Somoclu parallel working version (#318)

* Somoclu parallel working version * Changing filenames in somoclu demo * Changed the iterator to send a single large chunk when data is not in a file but in memory * testing why is it failling * seems that mean z are not allways the same * Another * Seeing how close is the output of somoclu * Fixing initialization bug * Debugging the tests * Clean up * Fixing merge conflicts * solving a merge conflict
LSSTDESC · May 10, 2023 · 142a156 · 142a156
1 parent 393d03e
commit 142a156
Show file tree

Hide file tree

Showing 4 changed files with 184 additions and 94 deletions.
diff --git a/examples/estimation_examples/somocluSOMcluster_demo.ipynb b/examples/estimation_examples/somocluSOMcluster_demo.ipynb
@@ -18,7 +18,7 @@
     "\n",
     "This notebook shows a quick demonstration of the use of the `SOMocluSummarizer` summarization module.  Algorithmically, this module is not very different from the NZDir estimator/summarizer.  NZDir operates by finding neighboring photometric points around spectroscopic objects.  SOMocluSummarizer takes a large training set of data in the `Inform_SOMocluUmmarizer` stage and trains a self-organized map (SOM) (using code from the `somoclu` package available at: https://github.com/peterwittek/somoclu/).  Once the SOM is set up, the \"best match unit\" are determined for both the photometric/unknown data and a set of spectroscopic data with known redshifts.  For each SOM cell, the algorithm constructs a histogram using the spectroscopic members mapped to that cell, and weights these by the number of photometric galaxies in that cell.  Both the photometric and spectroscopic datasets can also employ an optional weight per-galaxy. <br>\n",
     "\n",
-    "The summarizer also identifies SOM cells that contain photometric data but do not contain and galaxies with a measured spec-z, and thus do not have an obvious redshift estimate.  It writes out the (raveled) SOM cell indices that contain \"uncovered\"/uncalibratable data to the file specified by the `uncovered_cell_file` option as a list of integers.  The cellIDs and galaxy/objIDs for all photometric galaxies will be written out to the file specified by the `cellid_output` parameter.  Any galaxies in these cells should really be removed, and thus some iteration may be necessary in defining bin membership by looking at the properties of objects in these uncovered cells before a final N(z) is estimated, as otherwise a bias may be present.<br>\n",
+    "The summarizer also identifies SOM cells that contain photometric data but do not contain and galaxies with a measured spec-z, and thus do not have an obvious redshift estimate.  It writes out the (raveled) SOM cell indices that contain \"uncovered\"/uncalibratable data to the file specified by the `uncovered_cluster_file` option as a list of integers.  The cellIDs and galaxy/objIDs for all photometric galaxies will be written out to the file specified by the `cellid_output` parameter.  Any galaxies in these cells should really be removed, and thus some iteration may be necessary in defining bin membership by looking at the properties of objects in these uncovered cells before a final N(z) is estimated, as otherwise a bias may be present.<br>\n",
     "\n",
     "The shape and number of cells used in constructing the SOM affects performance, as do several tuning parameters.  This paper, http://www.giscience2010.org/pdfs/paper_230.pdf gives a rough guideline that the number of cells should be of the order ~ 5 x sqrt (number of data rows x number of column rows), though this is a rough guide.  Some studies have found a 2D SOM that is more elongated in one direction to be preferential, while others claim that a square layout is optimal, the user can set the number of cells in each SOM dimension via the `n_rows` and `n_cols` parameters.  For more discussion on SOMs see the Appendices of this KiDS paper:  http://arxiv.org/abs/1909.09632.\n",
     "\n",
@@ -412,7 +412,7 @@
     "`nsamples` (int): number of bootstrap samples to generate<br>\n",
     "`output` (str): name of the output qp file with N samples<br>\n",
     "`single_NZ` (str): name of the qp file with fiducial distribution<br>\n",
-    "`uncovered_cell_file` (str): name of hdf5 file containing a list of all of the cells with phot data but no spec-z objects: photometric objects in these cells will *not* be accounted for in the final N(z), and should really be removed from the sample before running the summarizer.  Note that we return a single integer that is constructed from the pairs of SOM cell indices via `np.ravel_multi_index`(indices).<br>"
+    "`uncovered_cluster_file` (str): name of hdf5 file containing a list of all of the clusters with phot data but no spec-z objects: photometric objects in these cells will *not* be accounted for in the final N(z), and should really be removed from the sample before running the summarizer.  Note that we return a single integer that is constructed from the pairs of SOM cell indices via `np.ravel_multi_index`(indices).<br>"
    ]
   },
   {
@@ -441,7 +441,7 @@
     "summ_dict = dict(model=\"output_SOMoclu_model.pkl\", hdf5_groupname='photometry',\n",
     "                 spec_groupname='photometry', nzbins=101, nsamples=25,\n",
     "                 output='SOM_ensemble.hdf5', single_NZ='fiducial_SOMoclu_NZ.hdf5',\n",
-    "                 uncovered_cell_file='all_uncovered_cells.hdf5',\n",
+    "                 uncovered_cluster_file='all_uncovered_clusters.hdf5',\n",
     "                 objid_name='id',\n",
     "                 cellid_output='output_cellIDs.hdf5')\n",
     "som_summarizer = somocluSOMSummarizer.make_stage(name='SOMoclu_summarizer', **summ_dict)    \n",
@@ -510,7 +510,7 @@
     "                 spec_groupname='photometry', nzbins=101, nsamples=25,\n",
     "                 output='SOM_ensemble.hdf5', single_NZ='fiducial_SOMoclu_NZ.hdf5',\n",
     "                 n_clusters=n_clusters,\n",
-    "                 uncovered_cell_file='all_uncovered_cells.hdf5',\n",
+    "                 uncovered_cluster_file='all_uncovered_clusters.hdf5',\n",
     "                 objid_name='id',\n",
     "                 cellid_output='output_cellIDs.hdf5')\n",
     "    som_summarizer = somocluSOMSummarizer.make_stage(name='SOMoclu_summarizer', **summ_dict)    \n",
@@ -581,7 +581,7 @@
     "                 spec_groupname='photometry', nzbins=101, nsamples=25,\n",
     "                 output='SOM_ensemble.hdf5', single_NZ='fiducial_SOMoclu_NZ.hdf5',\n",
     "                 n_clusters=1000,\n",
-    "                 uncovered_cell_file='all_uncovered_cells.hdf5',\n",
+    "                 uncovered_cluster_file='all_uncovered_clusters.hdf5',\n",
     "                 objid_name='id',\n",
     "                 cellid_output='output_cellIDs.hdf5')\n",
     "\n",
@@ -625,7 +625,7 @@
     "bright_dict = dict(model=\"output_SOMoclu_model.pkl\", hdf5_groupname='photometry',\n",
     "                   spec_groupname='photometry', nzbins=101, nsamples=25,\n",
     "                   output='BRIGHT_SOMoclu_ensemble.hdf5', single_NZ='BRIGHT_fiducial_SOMoclu_NZ.hdf5',\n",
-    "                   uncovered_cell_file=\"BRIGHT_uncovered_cells.hdf5\",\n",
+    "                   uncovered_cluster_file=\"BRIGHT_uncovered_clusters.hdf5\",\n",
     "                   n_clusters=1000,\n",
     "                   objid_name='id',\n",
     "                   cellid_output='BRIGHT_output_cellIDs.hdf5')\n",

diff --git a/src/rail/core/data.py b/src/rail/core/data.py
@@ -222,13 +222,29 @@ class Hdf5Handle(TableHandle):
     """DataHandle for a table written to HDF5"""
     suffix = 'hdf5'
 
+    @classmethod
+    def _initialize_write(cls, data, path, data_lenght, **kwargs):
+        initial_dict = cls._get_allocation_kwds(data, data_lenght)
+        comm = kwargs.get('communicator', None)
+        group, fout = tables_io.io.initializeHdf5WriteSingle(path, groupname=None, comm=comm, **initial_dict)
+        return group, fout
+
+    @classmethod
+    def _get_allocation_kwds(cls, data, data_lenght):
+        keywords = {}
+        for key, array in data.items():
+            shape = list(array.shape)
+            shape[0] = data_lenght
+            keywords[key] = (shape, array.dtype)
+        return keywords
+
     @classmethod
     def _write_chunk(cls, data, fileObj, groups, start, end, **kwargs):
-        if groups is None:
-            tables_io.io.writeDictToHdf5ChunkSingle(fileObj, data, start, end, **kwargs)
-        else:  #pragma: no cover
-            tables_io.io.writeDictToHdf5Chunk(groups, data, start, end, **kwargs)
+        tables_io.io.writeDictToHdf5ChunkSingle(fileObj, data, start, end, **kwargs)
 
+    @classmethod
+    def _finalize_write(cls, data, fileObj, **kwargs):
+        return tables_io.io.finalizeHdf5Write(fileObj, **kwargs)
 
 class FitsHandle(TableHandle):
     """DataHandle for a table written to fits"""

diff --git a/src/rail/core/stage.py b/src/rail/core/stage.py
@@ -326,7 +326,7 @@ def input_iterator(self, tag, **kwargs):
             These will be passed to the Handle's iterator method
         """
         handle = self.get_handle(tag, allow_missing=True)
-        if self.config.hdf5_groupname:
+        if self.config.hdf5_groupname and handle.path: 
             self._input_length = handle.size(groupname=self.config.hdf5_groupname)
             total_chunks_needed = ceil(self._input_length/self.config.chunk_size)
             if total_chunks_needed<self.size:  #pragma: no cover
@@ -342,8 +342,13 @@ def input_iterator(self, tag, **kwargs):
                           parallel_size=self.size)
             kwcopy.update(**kwargs)
             return handle.iterator(**kwcopy)
+                # If data is in memory and not in a file, it means is small enough to process it
+                # in a single chunk.
         else:  #pragma: no cover
-            test_data = self.get_data('input')
+            if self.config.hdf5_groupname:
+                test_data = self.get_data('input')[self.config.hdf5_groupname]
+            else:
+                test_data = self.get_data('input')
             s = 0
             e = len(list(test_data.items())[0][1])
             self._input_length=e