Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(upload): improve SINAN upload; add helpers, validations & views to upload large files #732

Merged
merged 27 commits into from
Jan 7, 2025

Conversation

luabida
Copy link
Collaborator

@luabida luabida commented Dec 4, 2024

No description provided.

@luabida luabida force-pushed the chunked-upload-sinan branch from 992d5b5 to 709841f Compare December 6, 2024 13:11
@luabida luabida force-pushed the chunked-upload-sinan branch from 709841f to 7c6f032 Compare December 6, 2024 17:52
@luabida luabida force-pushed the chunked-upload-sinan branch from 7c6f032 to 97859f9 Compare December 9, 2024 19:45
@luabida luabida force-pushed the chunked-upload-sinan branch from 512d6ab to 7d9cc83 Compare December 20, 2024 21:16
@luabida luabida force-pushed the chunked-upload-sinan branch from 19b3a2e to fde6413 Compare December 27, 2024 15:10
@luabida luabida force-pushed the chunked-upload-sinan branch from 0afd8a4 to 8b37888 Compare December 30, 2024 14:41
@luabida luabida force-pushed the chunked-upload-sinan branch from eda79e3 to 29c3296 Compare January 4, 2025 03:53
logger.info("Converting Parquet file into chunks")
# df_chunk = df_chunk.dropna(
# subset=SINANUpload.REQUIRED_COLS, how="any"
# ) # it usually drops the entire chunk
Copy link
Collaborator Author

@luabida luabida Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if its supposed to happen this way, but this dropna usually drops the entire chunk, few rows don't have None in some of the required columns

Edit: found the issue, the dropna is correct since required_cols can't contain None values, I was incorrectly replacing a string by None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this DROPNA necessary for the conversion to Parquet? If not, we should retain all rows independently of the fact that they may contain NAs. This chunk_parquet_file should not make decisions about data quality. It should already receive sanitized data.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that the parquet file would contain the raw data (the DBF converted to parquet, for instance) and the inserted rows of that file would be pointed in a separated table..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. The sanitization should be done as a separate step.

@luabida luabida force-pushed the chunked-upload-sinan branch from 74059a9 to a7119c6 Compare January 6, 2025 16:42
@luabida luabida force-pushed the chunked-upload-sinan branch from a7119c6 to aaebf2d Compare January 6, 2025 21:55
@luabida luabida force-pushed the chunked-upload-sinan branch from 2469f7e to 9a1ba83 Compare January 7, 2025 14:15
@luabida luabida force-pushed the chunked-upload-sinan branch 2 times, most recently from f606721 to 6a394dd Compare January 7, 2025 14:58
@luabida luabida force-pushed the chunked-upload-sinan branch from 6a394dd to 45648f9 Compare January 7, 2025 17:24
@luabida luabida marked this pull request as ready for review January 7, 2025 19:58
@luabida
Copy link
Collaborator Author

luabida commented Jan 7, 2025

@fccoelho I'll merge this PR the way it is right now so the SINAN upload can be used on production, but its still missing the file overview (merely visual) and the task to convert the dbf to parquet (until we resolve how the data should be included on the parquet file). Can you review it again please?

@luabida luabida requested a review from fccoelho January 7, 2025 20:03
@fccoelho
Copy link
Contributor

fccoelho commented Jan 7, 2025

@luabida I think that the Parquet should have the same exact content of the DBF(as you mentioned above), so that we don't need to keep the DBFs.
But we should have separate diagnostic step that can report problems in the DBF, before it is inserted in the production database. Simple things like checking it has all columns we need, counting the number of NAs per column, etc.

@luabida luabida force-pushed the chunked-upload-sinan branch from 4e74acc to 9596c41 Compare January 7, 2025 20:15
Copy link
Contributor

@fccoelho fccoelho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Good.

@luabida luabida merged commit 623cf21 into AlertaDengue:main Jan 7, 2025
0 of 4 checks passed
@luabida luabida deleted the chunked-upload-sinan branch January 7, 2025 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants