-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(upload): improve SINAN upload; add helpers, validations & views to upload large files #732
Conversation
992d5b5
to
709841f
Compare
709841f
to
7c6f032
Compare
7c6f032
to
97859f9
Compare
512d6ab
to
7d9cc83
Compare
19b3a2e
to
fde6413
Compare
0afd8a4
to
8b37888
Compare
eda79e3
to
29c3296
Compare
AlertaDengue/upload/tasks.py
Outdated
logger.info("Converting Parquet file into chunks") | ||
# df_chunk = df_chunk.dropna( | ||
# subset=SINANUpload.REQUIRED_COLS, how="any" | ||
# ) # it usually drops the entire chunk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if its supposed to happen this way, but this dropna usually drops the entire chunk, few rows don't have None in some of the required columns
Edit: found the issue, the dropna is correct since required_cols can't contain None values, I was incorrectly replacing a string by None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this DROPNA necessary for the conversion to Parquet? If not, we should retain all rows independently of the fact that they may contain NAs. This chunk_parquet_file
should not make decisions about data quality. It should already receive sanitized data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that the parquet file would contain the raw data (the DBF converted to parquet, for instance) and the inserted rows of that file would be pointed in a separated table..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. The sanitization should be done as a separate step.
74059a9
to
a7119c6
Compare
…lds due to huge time consuming on insert
a7119c6
to
aaebf2d
Compare
2469f7e
to
9a1ba83
Compare
f606721
to
6a394dd
Compare
…e entire array into memory
6a394dd
to
45648f9
Compare
@fccoelho I'll merge this PR the way it is right now so the SINAN upload can be used on production, but its still missing the file overview (merely visual) and the task to convert the dbf to parquet (until we resolve how the data should be included on the parquet file). Can you review it again please? |
@luabida I think that the Parquet should have the same exact content of the DBF(as you mentioned above), so that we don't need to keep the DBFs. |
4e74acc
to
9596c41
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks Good.
No description provided.