qiskit-community · nkanazawa1989 · Feb 6, 2024 · Jan 10, 2024 · Jan 18, 2024 · Jan 18, 2024
diff --git a/docs/howtos/rerun_analysis.rst b/docs/howtos/rerun_analysis.rst
@@ -17,7 +17,7 @@ Solution
     consult the `migration guide <https://docs.quantum.ibm.com/api/migration-guides/qiskit-runtime-from-provider>`_.\
 
 Once you recreate the exact experiment you ran and all of its parameters and options,
-you can call the :meth:`.add_jobs` method with a list of :class:`Job
+you can call the :meth:`.ExperimentData.add_jobs` method with a list of :class:`Job
 <qiskit.providers.JobV1>` objects to generate the new :class:`.ExperimentData` object.
 The following example retrieves jobs from a provider that has access to them via their
 job IDs:
@@ -47,7 +47,7 @@ job IDs:
 instead of overwriting the existing one.
 
 If you have the job data in the form of a :class:`~qiskit.result.Result` object, you can
-invoke the :meth:`.add_data` method instead of :meth:`.add_jobs`:
+invoke the :meth:`.ExperimentData.add_data` method instead of :meth:`.ExperimentData.add_jobs`:
 
 .. jupyter-input::
 

diff --git a/docs/tutorials/curve_analysis.rst b/docs/tutorials/curve_analysis.rst
@@ -240,6 +240,85 @@ generate initial guesses for parameters, from the ``AnalysisA`` class in the fir
 On the other hand, in the latter case, you need to manually copy and paste
 every logic defined in ``AnalysisA``.
 
+.. _data_management_with_scatter_table:
+
+Managing intermediate data
+--------------------------
+
+:class:`.ScatterTable` is the single source of truth for the data used in the curve fit analysis.
+Each data point in a 1-D curve fit may consist of the x value, y value, and
+standard error of the y value.
+In addition, such analysis may internally create several data subsets.
+Each data point is given a metadata triplet (`series_id`, `category`, `analysis`)
+to distinguish the subset.
+
+* The `series_id` is an integer key representing a label of the data which may be classified by fits models.
+  When an analysis consists of multiple fit models and performs a multi-objective fit,
+  the created table may contain multiple datasets for each fit model.
+  Usually the index of series matches with the index of the fit model in the analysis.
+  The table also provides a `series_name` column which is a human-friendly text notation of the `series_id`.
+  The `series_name` and corresponding `series_id` must refer to the identical data subset,
+  and the `series_name` typically matches with the name of the fit model.
+  You can find a particular data subset by either `series_id` or `series_name`.
+
+* The `category` is a string tag categorizing a group of data points.
+  The measured outcomes input as-is to the curve analysis are categorized by "raw".
+  In a standard :class:`.CurveAnalysis` subclass, the input data is formatted for
+  the fitting and the formatted data is also stored in the table with the "formatted" category.
+  You can filter the formatted data to run curve fitting with your custom program.
+  After the fit is successfully conducted and the model parameters are identified,
+  data points in the interpolated fit curves are stored with the "fitted" category
+  for visualization. The management of the data groups depends on the design of
+  the curve analysis protocol, and the convention of category naming might
+  be different in a particular analysis.
+
+* The `analysis` is a string key representing a name of
+  the analysis instance that generated the data point.
+  This allows a user to combine multiple tables from different analyses without collapsing the data points.
+  For a simple analysis class, all rows will have the same value,
+  but a :class:`.CompositeCurveAnalysis` instance consists of
+  nested component analysis instances containing statistically independent fit models.
+  Each component is given a unique analysis name, and datasets generated from each instance
+  are merged into a single table stored in the outermost composite analysis.
+
+User must be aware of this triplet to extract data points that belong to a
+particular data subset. For example,
+
+.. code-block:: python
+
+    mini_table = table.filter(series="my_experiment1", category="raw", analysis="AnalysisA")
+    mini_x = mini_table.x
+    mini_y = mini_table.y
+
+This operation is equivalent to
+
+.. code-block:: python
+
+    mini_x = table.xvals(series="my_experiment1", category="raw", analysis="AnalysisA")
+    mini_y = table.yvals(series="my_experiment1", category="raw", analysis="AnalysisA")
+
+When an analysis only has a single model and the table is created from a single
+analysis instance, the `series_id` and `analysis` are trivial, and you only need to
+specify the `category` to get subset data of interest.
+
+The full description of :class:`.ScatterTable` columns are following below:
+
+- `xval`: Parameter scanned in the experiment. This value must be defined in the circuit metadata.
+- `yval`: Nominal part of the outcome. The outcome is something like expectation value,
+  which is computed from the experiment result with the data processor.
+- `yerr`: Standard error of the outcome, which is mainly due to sampling error.
+- `series_name`: Human readable name of the data series. This is defined by the ``data_subfit_map`` option in the :class:`.CurveAnalysis`.
+- `series_id`: Integer corresponding to the name of data series. This number is automatically assigned.
+- `category`: A tag for the data group. This is defined by a developer of the curve analysis.
+- `shots`: Number of measurement shots used to acquire a data point. This value can be defined in the circuit metadata.
+- `analysis`: The name of the curve analysis instance that generated a data point.
+
+This object helps an analysis developer with writing a custom analysis class
+without an overhead of complex data management, as well as end-users with
+retrieving and reusing the intermediate data for their custom fitting workflow
+outside our curve fitting framework.
+Note that a :class:`ScatterTable` instance may be saved in the :class:`.ExperimentData` as an artifact.
+
 .. _curve_analysis_workflow:
 
 Curve Analysis workflow
@@ -271,67 +350,71 @@ the data processor in the analysis option is internally called.
 This consumes input experiment results and creates the :class:`.ScatterTable` dataframe.
 This table may look like:
 
-.. code-block::
-
-        xval      yval      yerr name  class_id category  shots
-    0    0.1  0.153659  0.011258    A         0      raw   1024
-    1    0.1  0.590732  0.015351    B         1      raw   1024
-    2    0.1  0.315610  0.014510    A         0      raw   1024
-    3    0.1  0.376098  0.015123    B         1      raw   1024
-    4    0.2  0.937073  0.007581    A         0      raw   1024
-    5    0.2  0.323415  0.014604    B         1      raw   1024
-    6    0.2  0.538049  0.015565    A         0      raw   1024
-    7    0.2  0.530244  0.015581    B         1      raw   1024
-    8    0.3  0.143902  0.010958    A         0      raw   1024
-    9    0.3  0.261951  0.013727    B         1      raw   1024
-    10   0.3  0.830732  0.011707    A         0      raw   1024
-    11   0.3  0.874634  0.010338    B         1      raw   1024
+.. jupyter-input::
+
+    table = analysis._run_data_processing(experiment_data.data())
+    print(table)
+
+.. jupyter-output::
+
+        xval      yval      yerr  series_name  series_id  category  shots     analysis
+    0    0.1  0.153659  0.011258            A          0      raw    1024   MyAnalysis
+    1    0.1  0.590732  0.015351            B          1      raw    1024   MyAnalysis
+    2    0.1  0.315610  0.014510            A          0      raw    1024   MyAnalysis
+    3    0.1  0.376098  0.015123            B          1      raw    1024   MyAnalysis
+    4    0.2  0.937073  0.007581            A          0      raw    1024   MyAnalysis
+    5    0.2  0.323415  0.014604            B          1      raw    1024   MyAnalysis
+    6    0.2  0.538049  0.015565            A          0      raw    1024   MyAnalysis
+    7    0.2  0.530244  0.015581            B          1      raw    1024   MyAnalysis
+    8    0.3  0.143902  0.010958            A          0      raw    1024   MyAnalysis
+    9    0.3  0.261951  0.013727            B          1      raw    1024   MyAnalysis
+    10   0.3  0.830732  0.011707            A          0      raw    1024   MyAnalysis
+    11   0.3  0.874634  0.010338            B          1      raw    1024   MyAnalysis
 
 where the experiment consists of two subset series A and B, and the experiment parameter (xval)
 is scanned from 0.1 to 0.3 in each subset. In this example, the experiment is run twice
-for each condition. The role of each column is as follows:
-
-- ``xval``: Parameter scanned in the experiment. This value must be defined in the circuit metadata.
-- ``yval``: Nominal part of the outcome. The outcome is something like expectation value, which is computed from the experiment result with the data processor.
-- ``yerr``: Standard error of the outcome, which is mainly due to sampling error.
-- ``name``: Unique identifier of the result class. This is defined by the ``data_subfit_map`` option.
-- ``class_id``: Numerical index corresponding to the result class. This number is automatically assigned.
-- ``category``: The attribute of data set. The "raw" category indicates an output from the data processing.
-- ``shots``: Number of measurement shots used to acquire this result.
+for each condition.
+See :ref:`data_management_with_scatter_table` for the details of columns.
 
 3. Formatting
 ^^^^^^^^^^^^^
 
-Next, the processed dataset is converted into another format suited for the fitting and
-every valid result is assigned a class corresponding to a fit model.
+Next, the processed dataset is converted into another format suited for the fitting.
 By default, the formatter takes average of the outcomes in the processed dataset
 over the same x values, followed by the sorting in the ascending order of x values.
 This allows the analysis to easily estimate the slope of the curves to
 create algorithmic initial guess of fit parameters.
 A developer can inject extra data processing, for example, filtering, smoothing,
 or elimination of outliers for better fitting.
-The new class_id is given here so that its value corresponds to the fit model object index
-in this analysis class. This index mapping is done based upon the correspondence of
-the data name and the fit model name.
+The new `series_id` is given here so that its value corresponds to the fit model index
+defined in this analysis class. This index mapping is done based upon the correspondence of
+the `series_name` and the fit model name.
 
 This is done by calling :meth:`_format_data` method.
 This may return new scatter table object with the addition of rows like the following below.
 
-.. code-block::
+.. jupyter-input::
+
+    table = analysis._format_data(table)
+    print(table)
+
+.. jupyter-output::
 
-    12   0.1  0.234634  0.009183    A         0  formatted   2048
-    13   0.2  0.737561  0.008656    A         0  formatted   2048
-    14   0.3  0.487317  0.008018    A         0  formatted   2048
-    15   0.1  0.483415  0.010774    B         1  formatted   2048
-    16   0.2  0.426829  0.010678    B         1  formatted   2048
-    17   0.3  0.568293  0.008592    B         1  formatted   2048
+        xval      yval      yerr  series_name  series_id   category  shots     analysis
+    ...
+    12   0.1  0.234634  0.009183            A          0  formatted   2048   MyAnalysis
+    13   0.2  0.737561  0.008656            A          0  formatted   2048   MyAnalysis
+    14   0.3  0.487317  0.008018            A          0  formatted   2048   MyAnalysis
+    15   0.1  0.483415  0.010774            B          1  formatted   2048   MyAnalysis
+    16   0.2  0.426829  0.010678            B          1  formatted   2048   MyAnalysis
+    17   0.3  0.568293  0.008592            B          1  formatted   2048   MyAnalysis
 
 The default :meth:`_format_data` method adds its output data with the category "formatted".
 This category name must be also specified in the analysis option ``fit_category``.
 If overriding this method to do additional processing after the default formatting,
 the ``fit_category`` analysis option can be set to choose a different category name to use to
 select the data to pass to the fitting routine.
-The (x, y) value in each row is passed to the corresponding fit model object
+The (xval, yval) value in each row is passed to the corresponding fit model object
 to compute residual values for the least square optimization.
 
 3. Fitting

diff --git a/qiskit_experiments/curve_analysis/__init__.py b/qiskit_experiments/curve_analysis/__init__.py
@@ -39,6 +39,7 @@
 .. autosummary::
     :toctree: ../stubs/
 
+    ScatterTable
     SeriesDef
     CurveData
     CurveFitResult

diff --git a/qiskit_experiments/curve_analysis/composite_curve_analysis.py b/qiskit_experiments/curve_analysis/composite_curve_analysis.py
@@ -230,34 +230,35 @@ def _create_figures(
             A list of figures.
         """
         for analysis in self.analyses():
-            sub_data = curve_data[curve_data.group == analysis.name]
-            for name, data in list(sub_data.groupby("name")):
-                full_name = f"{name}_{analysis.name}"
+            group_data = curve_data.filter(analysis=analysis.name)
+            model_names = analysis.model_names()
+            for series_id, sub_data in group_data.iter_by_series_id():
+                full_name = f"{model_names[series_id]}_{analysis.name}"
                 # Plot raw data scatters
                 if analysis.options.plot_raw_data:
-                    raw_data = data[data.category == "raw"]
+                    raw_data = sub_data.filter(category="raw")
                     self.plotter.set_series_data(
                         series_name=full_name,
-                        x=raw_data.xval.to_numpy(),
-                        y=raw_data.yval.to_numpy(),
+                        x=raw_data.x,
+                        y=raw_data.y,
                     )
                 # Plot formatted data scatters
-                formatted_data = data[data.category == analysis.options.fit_category]
+                formatted_data = sub_data.filter(category=analysis.options.fit_category)
                 self.plotter.set_series_data(
                     series_name=full_name,
-                    x_formatted=formatted_data.xval.to_numpy(),
-                    y_formatted=formatted_data.yval.to_numpy(),
-                    y_formatted_err=formatted_data.yerr.to_numpy(),
+                    x_formatted=formatted_data.x,
+                    y_formatted=formatted_data.y,
+                    y_formatted_err=formatted_data.y_err,
                 )
                 # Plot fit lines
-                line_data = data[data.category == "fitted"]
+                line_data = sub_data.filter(category="fitted")
                 if len(line_data) == 0:
                     continue
-                fit_stdev = line_data.yerr.to_numpy()
+                fit_stdev = line_data.y_err
                 self.plotter.set_series_data(
                     series_name=full_name,
-                    x_interp=line_data.xval.to_numpy(),
-                    y_interp=line_data.yval.to_numpy(),
+                    x_interp=line_data.x,
+                    y_interp=line_data.y,
                     y_interp_err=fit_stdev if np.isfinite(fit_stdev).all() else None,
                 )
 
@@ -354,7 +355,7 @@ def _run_analysis(
             metadata["group"] = analysis.name
 
             table = analysis._format_data(analysis._run_data_processing(experiment_data.data()))
-            formatted_subset = table[table.category == analysis.options.fit_category]
+            formatted_subset = table.filter(category=analysis.options.fit_category)
             fit_data = analysis._run_curve_fit(formatted_subset)
             fit_dataset[analysis.name] = fit_data
 
@@ -376,32 +377,35 @@ def _run_analysis(
 
             if fit_data.success:
                 # Add fit data to curve data table
-                fit_curves = []
-                columns = list(table.columns)
                 model_names = analysis.model_names()
-                for i, sub_data in list(formatted_subset.groupby("class_id")):
-                    xval = sub_data.xval.to_numpy()
+                for series_id, sub_data in formatted_subset.iter_by_series_id():
+                    xval = sub_data.x
                     if len(xval) == 0:
                         # If data is empty, skip drawing this model.
                         # This is the case when fit model exist but no data to fit is provided.
                         continue
                     # Compute X, Y values with fit parameters.
-                    xval_fit = np.linspace(np.min(xval), np.max(xval), num=100)
-                    yval_fit = eval_with_uncertainties(
-                        x=xval_fit,
-                        model=analysis.models[i],
+                    xval_arr_fit = np.linspace(np.min(xval), np.max(xval), num=100, dtype=float)
+                    uval_arr_fit = eval_with_uncertainties(
+                        x=xval_arr_fit,
+                        model=analysis.models[series_id],
                         params=fit_data.ufloat_params,
                     )
-                    model_fit = np.full((100, len(columns)), np.nan, dtype=object)
-                    fit_curves.append(model_fit)
-                    model_fit[:, columns.index("xval")] = xval_fit
-                    model_fit[:, columns.index("yval")] = unp.nominal_values(yval_fit)
+                    yval_arr_fit = unp.nominal_values(uval_arr_fit)
                     if fit_data.covar is not None:
-                        model_fit[:, columns.index("yerr")] = unp.std_devs(yval_fit)
-                    model_fit[:, columns.index("name")] = model_names[i]
-                    model_fit[:, columns.index("class_id")] = i
-                    model_fit[:, columns.index("category")] = "fitted"
-                table = table.append_list_values(other=np.vstack(fit_curves))
+                        yerr_arr_fit = unp.std_devs(uval_arr_fit)
+                    else:
+                        yerr_arr_fit = np.zeros_like(xval_arr_fit)
+                    for xval, yval, yerr in zip(xval_arr_fit, yval_arr_fit, yerr_arr_fit):
+                        table.add_row(
+                            series_name=model_names[series_id],
+                            series_id=series_id,
+                            category="fitted",
+                            x=xval,
+                            y=yval,
+                            y_err=yerr,
+                            analysis=analysis.name,
+                        )
                 analysis_results.extend(
                     analysis._create_analysis_results(
                         fit_data=fit_data,
@@ -416,11 +420,11 @@ def _run_analysis(
                     analysis._create_curve_data(curve_data=formatted_subset, **metadata)
                 )
 
-            # Add extra column to identify the fit model
-            table["group"] = analysis.name
             curve_data_set.append(table)
 
-        combined_curve_data = pd.concat(curve_data_set)
+        combined_curve_data = ScatterTable.from_dataframe(
+            pd.concat([d.dataframe for d in curve_data_set])
+        )
         total_quality = self._evaluate_quality(fit_dataset)
 
         # After the quality is determined, plot can become a boolean flag for whether