Skip to content

Commit

Permalink
Somoclu parallel working version (#318)
Browse files Browse the repository at this point in the history
* Somoclu parallel working version

* Changing filenames in somoclu demo

* Changed the iterator to send a single large chunk when data is not in a file but in memory

* testing why is it failling

* seems that mean z are not allways the same

* Another

* Seeing how close is the output of somoclu

* Fixing initialization bug

* Debugging the tests

* Clean up

* Fixing merge conflicts

* solving a merge conflict
  • Loading branch information
joselotl authored May 10, 2023
1 parent 393d03e commit 142a156
Show file tree
Hide file tree
Showing 4 changed files with 184 additions and 94 deletions.
12 changes: 6 additions & 6 deletions examples/estimation_examples/somocluSOMcluster_demo.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
"\n",
"This notebook shows a quick demonstration of the use of the `SOMocluSummarizer` summarization module. Algorithmically, this module is not very different from the NZDir estimator/summarizer. NZDir operates by finding neighboring photometric points around spectroscopic objects. SOMocluSummarizer takes a large training set of data in the `Inform_SOMocluUmmarizer` stage and trains a self-organized map (SOM) (using code from the `somoclu` package available at: https://github.com/peterwittek/somoclu/). Once the SOM is set up, the \"best match unit\" are determined for both the photometric/unknown data and a set of spectroscopic data with known redshifts. For each SOM cell, the algorithm constructs a histogram using the spectroscopic members mapped to that cell, and weights these by the number of photometric galaxies in that cell. Both the photometric and spectroscopic datasets can also employ an optional weight per-galaxy. <br>\n",
"\n",
"The summarizer also identifies SOM cells that contain photometric data but do not contain and galaxies with a measured spec-z, and thus do not have an obvious redshift estimate. It writes out the (raveled) SOM cell indices that contain \"uncovered\"/uncalibratable data to the file specified by the `uncovered_cell_file` option as a list of integers. The cellIDs and galaxy/objIDs for all photometric galaxies will be written out to the file specified by the `cellid_output` parameter. Any galaxies in these cells should really be removed, and thus some iteration may be necessary in defining bin membership by looking at the properties of objects in these uncovered cells before a final N(z) is estimated, as otherwise a bias may be present.<br>\n",
"The summarizer also identifies SOM cells that contain photometric data but do not contain and galaxies with a measured spec-z, and thus do not have an obvious redshift estimate. It writes out the (raveled) SOM cell indices that contain \"uncovered\"/uncalibratable data to the file specified by the `uncovered_cluster_file` option as a list of integers. The cellIDs and galaxy/objIDs for all photometric galaxies will be written out to the file specified by the `cellid_output` parameter. Any galaxies in these cells should really be removed, and thus some iteration may be necessary in defining bin membership by looking at the properties of objects in these uncovered cells before a final N(z) is estimated, as otherwise a bias may be present.<br>\n",
"\n",
"The shape and number of cells used in constructing the SOM affects performance, as do several tuning parameters. This paper, http://www.giscience2010.org/pdfs/paper_230.pdf gives a rough guideline that the number of cells should be of the order ~ 5 x sqrt (number of data rows x number of column rows), though this is a rough guide. Some studies have found a 2D SOM that is more elongated in one direction to be preferential, while others claim that a square layout is optimal, the user can set the number of cells in each SOM dimension via the `n_rows` and `n_cols` parameters. For more discussion on SOMs see the Appendices of this KiDS paper: http://arxiv.org/abs/1909.09632.\n",
"\n",
Expand Down Expand Up @@ -412,7 +412,7 @@
"`nsamples` (int): number of bootstrap samples to generate<br>\n",
"`output` (str): name of the output qp file with N samples<br>\n",
"`single_NZ` (str): name of the qp file with fiducial distribution<br>\n",
"`uncovered_cell_file` (str): name of hdf5 file containing a list of all of the cells with phot data but no spec-z objects: photometric objects in these cells will *not* be accounted for in the final N(z), and should really be removed from the sample before running the summarizer. Note that we return a single integer that is constructed from the pairs of SOM cell indices via `np.ravel_multi_index`(indices).<br>"
"`uncovered_cluster_file` (str): name of hdf5 file containing a list of all of the clusters with phot data but no spec-z objects: photometric objects in these cells will *not* be accounted for in the final N(z), and should really be removed from the sample before running the summarizer. Note that we return a single integer that is constructed from the pairs of SOM cell indices via `np.ravel_multi_index`(indices).<br>"
]
},
{
Expand Down Expand Up @@ -441,7 +441,7 @@
"summ_dict = dict(model=\"output_SOMoclu_model.pkl\", hdf5_groupname='photometry',\n",
" spec_groupname='photometry', nzbins=101, nsamples=25,\n",
" output='SOM_ensemble.hdf5', single_NZ='fiducial_SOMoclu_NZ.hdf5',\n",
" uncovered_cell_file='all_uncovered_cells.hdf5',\n",
" uncovered_cluster_file='all_uncovered_clusters.hdf5',\n",
" objid_name='id',\n",
" cellid_output='output_cellIDs.hdf5')\n",
"som_summarizer = somocluSOMSummarizer.make_stage(name='SOMoclu_summarizer', **summ_dict) \n",
Expand Down Expand Up @@ -510,7 +510,7 @@
" spec_groupname='photometry', nzbins=101, nsamples=25,\n",
" output='SOM_ensemble.hdf5', single_NZ='fiducial_SOMoclu_NZ.hdf5',\n",
" n_clusters=n_clusters,\n",
" uncovered_cell_file='all_uncovered_cells.hdf5',\n",
" uncovered_cluster_file='all_uncovered_clusters.hdf5',\n",
" objid_name='id',\n",
" cellid_output='output_cellIDs.hdf5')\n",
" som_summarizer = somocluSOMSummarizer.make_stage(name='SOMoclu_summarizer', **summ_dict) \n",
Expand Down Expand Up @@ -581,7 +581,7 @@
" spec_groupname='photometry', nzbins=101, nsamples=25,\n",
" output='SOM_ensemble.hdf5', single_NZ='fiducial_SOMoclu_NZ.hdf5',\n",
" n_clusters=1000,\n",
" uncovered_cell_file='all_uncovered_cells.hdf5',\n",
" uncovered_cluster_file='all_uncovered_clusters.hdf5',\n",
" objid_name='id',\n",
" cellid_output='output_cellIDs.hdf5')\n",
"\n",
Expand Down Expand Up @@ -625,7 +625,7 @@
"bright_dict = dict(model=\"output_SOMoclu_model.pkl\", hdf5_groupname='photometry',\n",
" spec_groupname='photometry', nzbins=101, nsamples=25,\n",
" output='BRIGHT_SOMoclu_ensemble.hdf5', single_NZ='BRIGHT_fiducial_SOMoclu_NZ.hdf5',\n",
" uncovered_cell_file=\"BRIGHT_uncovered_cells.hdf5\",\n",
" uncovered_cluster_file=\"BRIGHT_uncovered_clusters.hdf5\",\n",
" n_clusters=1000,\n",
" objid_name='id',\n",
" cellid_output='BRIGHT_output_cellIDs.hdf5')\n",
Expand Down
24 changes: 20 additions & 4 deletions src/rail/core/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -222,13 +222,29 @@ class Hdf5Handle(TableHandle):
"""DataHandle for a table written to HDF5"""
suffix = 'hdf5'

@classmethod
def _initialize_write(cls, data, path, data_lenght, **kwargs):
initial_dict = cls._get_allocation_kwds(data, data_lenght)
comm = kwargs.get('communicator', None)
group, fout = tables_io.io.initializeHdf5WriteSingle(path, groupname=None, comm=comm, **initial_dict)
return group, fout

@classmethod
def _get_allocation_kwds(cls, data, data_lenght):
keywords = {}
for key, array in data.items():
shape = list(array.shape)
shape[0] = data_lenght
keywords[key] = (shape, array.dtype)
return keywords

@classmethod
def _write_chunk(cls, data, fileObj, groups, start, end, **kwargs):
if groups is None:
tables_io.io.writeDictToHdf5ChunkSingle(fileObj, data, start, end, **kwargs)
else: #pragma: no cover
tables_io.io.writeDictToHdf5Chunk(groups, data, start, end, **kwargs)
tables_io.io.writeDictToHdf5ChunkSingle(fileObj, data, start, end, **kwargs)

@classmethod
def _finalize_write(cls, data, fileObj, **kwargs):
return tables_io.io.finalizeHdf5Write(fileObj, **kwargs)

class FitsHandle(TableHandle):
"""DataHandle for a table written to fits"""
Expand Down
9 changes: 7 additions & 2 deletions src/rail/core/stage.py
Original file line number Diff line number Diff line change
Expand Up @@ -326,7 +326,7 @@ def input_iterator(self, tag, **kwargs):
These will be passed to the Handle's iterator method
"""
handle = self.get_handle(tag, allow_missing=True)
if self.config.hdf5_groupname:
if self.config.hdf5_groupname and handle.path:
self._input_length = handle.size(groupname=self.config.hdf5_groupname)
total_chunks_needed = ceil(self._input_length/self.config.chunk_size)
if total_chunks_needed<self.size: #pragma: no cover
Expand All @@ -342,8 +342,13 @@ def input_iterator(self, tag, **kwargs):
parallel_size=self.size)
kwcopy.update(**kwargs)
return handle.iterator(**kwcopy)
# If data is in memory and not in a file, it means is small enough to process it
# in a single chunk.
else: #pragma: no cover
test_data = self.get_data('input')
if self.config.hdf5_groupname:
test_data = self.get_data('input')[self.config.hdf5_groupname]
else:
test_data = self.get_data('input')
s = 0
e = len(list(test_data.items())[0][1])
self._input_length=e
Expand Down
Loading

0 comments on commit 142a156

Please sign in to comment.