-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-34875: switch most DataFrame connections to ArrowAstropy #1010
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good; a bunch of small comments.
import esutil | ||
import hpgeom as hpg | ||
import numpy as np | ||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This warms my heart.
|
||
table = df[persist_columns][goodSrc.selected].to_records() | ||
table = tbl[persist_columns][goodSrc.selected].as_array().view(np.recarray) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we need the .view(np.recarray)
here? Or why (equivalently) np.asarray(tbl[persist_columns][goodSrc.selected])
doesn't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was being cautious because the old code (I think) was making actual np.recarray
instances, not just np.ndarray
instances with structured dtypes, and I didn't know if any recarray
functionality was actually needed. But unit tests pass with what you've suggested, so I'll switch to that.
import functools | ||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😄
import numbers | ||
import os | ||
|
||
import numpy as np | ||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh... sigh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it'd have been fantastic to be able to drop it from this file, but that's a long ways away. Just have to settle for not pretending it's part of the standard library anymore.
df = catalog.asAstropy().to_pandas().set_index("id", drop=True) | ||
df["visit"] = visit | ||
tbl = catalog.asAstropy() | ||
tbl.add_index("id") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me nervous. Why do we need the index here? We aren't persisting it... Also astropy indexes are wonky and I don't want to rely on them since the support seems nonexistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I hadn't realized I'd already committed this before our Slack discussion about the Astropy indexes being inadvisable. I don't think it's actually needed; I'll drop it and rerun ci_*
.
outputCatalog = pd.DataFrame(data=visitEntries) | ||
outputCatalog.set_index("visitId", inplace=True, verify_integrity=True) | ||
outputCatalog = astropy.table.Table(rows=visitEntries) | ||
outputCatalog.add_index("visitId") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, the astropy table index thing.
import numpy as np | ||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
df.set_index('sourceId', inplace=True) | ||
data_refs.append(lsst.pipe.base.InMemoryDatasetHandle(df, storageClass="DataFrame")) | ||
tbl = astropy.table.Table(table) | ||
handles.append(lsst.pipe.base.InMemoryDatasetHandle(tbl, storageClass="DataFrame")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be DataFrame
or ArrowAstropy
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. But I guess nothing actually cares, given what the test actually does with the handle.
tables.append(df.to_records()) | ||
for handle in self.handles: | ||
tbl = handle.get() | ||
tables.append(tbl.as_array().view(np.recarray)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again np.asarray(tbl)
?
tables.append(df.to_records()) | ||
for handle in self.handles: | ||
tbl = handle.get() | ||
tables.append(tbl.as_array().view(np.recarray)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also here.
81a3750
to
f924f94
Compare
b9a7f93
to
acbd837
Compare
87772b1
to
5201792
Compare
This will only change the output dataset type definitions in repositories where they are not already registered. It should be backwards compatible (both in reading inputs originally written as DataFrame, and in downstream tasks reading its outputs) due to storage class conversions.
Some of these previously returned a DataFrame directly while others returned a Struct.
This does not include TransformForcedSourceTable, as its string columns make it a little trickier. The inputs and calculations are still Pandas; we just convert to ArrowAstropy at the end. This gets the default dataset type definitions in shape with minimal effort. There is a temporary exception or TransformObjectTable when multiLevelOutput=True - that still uses DataFrame - but this option is now depecated (it wasn't being used in production already).
We can't do the same for Object and ForcedSource (at least not easily) because those use Pandas MultiIndexes.
Use "handle" as an abbreviation for DeferredDatasetHandle (or InMemoryDatasetHandle) rather than "ref", which used to mean DataRef in Gen2 to (analogous) but suggests DatasetRef in Gen3 (not analogous).
Since the internals are already all using structured numpy arrays, it's easy to fully remove Pandas here.
We're already setting these in transfromSourceTable from the data ID, overriding whatever was there in the pre-transform table, and this lets transformSourceTable run on calibrateImage's already-calibrated Astropy output, which lacks them. Note that I'm making minor adjustments to Source.yaml instead of switching to the existing configuration for initial_stars because I don't want to reconfigure downstream analysis tasks (which require many measurement columns that calibrateImage doesn't run by default) on this ticket. I expect the final configuration to be somewhere in between.
Prior to this ticket, there was no detector ID column (just the manged ccdVisitId), which seems like an odd oversight. Prior to this commit, the detector ID was just called "id", since that's what made sense in the per-visit input cataogs, but it doesn't make sense here.
Using this for MakeCcdVisitTableTask is trickier because we can't ask an ExposureCatalog how many rows it has before loading it. If that's needed, we can extend this code to do it on another branch.
5201792
to
e080d3d
Compare
No description provided.