Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added count_documents() implementation to builtin_timeseries #935

Merged
merged 11 commits into from
Sep 13, 2023

Conversation

MukuFlash03
Copy link
Contributor

@MukuFlash03 MukuFlash03 commented Aug 26, 2023

Implemented code for issue 933 in e-mission-docs for adding functionality to count number of documents. I've determined that 'key' parameter can be passed to retrieve appropriate timeseries db collection. A query is generated with optional extra_query keys list which returns filtered data set.

Pending: Overall testing with both regular and edge test cases.
Edit (09/11/23) - Completed implementation and testing; PR merged.

Implemented code for issue e-mission#933 in e-mission-docs for adding functionality to count number of documents.
I've determined that 'key' parameter can be passed to retrieve appropriate timeseries db collection.
A query is generated with optional extra_query keys list which returns filtered data set.
@nataliejschultz
Copy link
Contributor

Link to the issue

Mahadik, Mukul Chandrakant added 2 commits August 28, 2023 17:07
Updated variable name in definition but missed changing it in its usage in 'total_entries'.
- Fixed previous commits in the actual count_data() function.
I had ignored the ordering of function arguments passed to the _get_query() template method which takes in parameters other than "key, extra_query_list".
So, I used the actual keywords of the parameter name "extra_query_list" to assign parameter.

- Primarily, added a basic test case for the count_data() functionality.
Using the "background/location" key value for the "shankari_2015-aug-27" sample dataset.

TODO:
- Working on edge cases.
Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Only one minor change...

emission/tests/storageTests/TestTimeSeries.py Outdated Show resolved Hide resolved
Mahadik, Mukul Chandrakant added 2 commits September 1, 2023 11:06
…ases

- Updated the function name and the function signature to include other common parameters like time_query, geo_query along with key and extra_queries to maintain consistency with existing functions.

- Modified test cases to include dataset samples with more specific keys.

- Provided code comments to clarify validation process for inputs and outputs of test datasets.
- Repositioned assert statements below their respective computations.
Comment on lines 97 to 100
count_ts1 = [ts1_aug_21.find_entries_count(key="background/location"), ts1_aug_21.find_entries_count(key="background/filtered_location")]
print("\nEntry counts for location, filtered_location on {} = {}".format("Aug_21", count_ts1))
self.assertEqual(count_ts1, [738, 508])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are very basic tests - I don't see any corner-case testing, for example.

  • What happens if the key list is blank?
  • What happens if the key list is blablabla?
  • What happens if there is no data for that user?

I believe you had listed out a few similar cases during our discussion, I don't see any actual implementation of those here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had not implemented for blank keys as it was to be decided whether key is to be made mandatory or not and assumed that a key would always be passed for getting count.
But yes, I will go ahead and make code changes to accommodate for blank keys by returning total counts from each of the two timeseries DBs as a tuple (count_timeseries_db, count_analysis_db).

Copy link
Contributor Author

@MukuFlash03 MukuFlash03 Sep 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a doubt, regarding "key list". For now, I have only been taking a single key as opposed to a list.

  1. Is the count to be done for multiple keys passed as a key list?
  2. If yes, ´get_query' returns a combined query for multiple keys and gives total count of matching entries not segregated by keys. For instance, keylist = [A,B], then total count will be (A+B).
    Do we need a separate count for each key in that case, that is (A,B, ...)
    or combined count = (A+B) is needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see #935 (comment)

@shankari
Copy link
Contributor

shankari commented Sep 5, 2023

I had not implemented for blank keys as it was to be decided whether key is to be made mandatory or not and assumed that a key would always be passed for getting count.

@MukuFlash03 Let me clarify.

It is fine to assume that a key needs to be passed in for getting count.
Does everybody read documentation and follow it perfectly? No.

So we should also define the behavior if it is not and test that behavior
That is pretty much the definition of unit testing

It is completely possible for the user to have no data if they have just signed up for the app.
Note that the invocation of this code from the admin dashboard will be through the aggregate subclass, not the individual timeseries. Where is the testing against the aggregate subclass?

Code Logic
- Updated the code to return counts from both original_timeseries and analysis_timeseries as a tuple of lists pertaining to each key for each timeseries db.

Test cases:
- Updated test cases to include counts from both DBs.
- Included new test cases for edge conditions.
- Test cases include:
1. Key_list of keys from both DBs
2. Key_list of keys from Original timeseries DB.
3. Key_list of keys from Analysis timeseries DB.
4. Empty key_list
5. Invalid keys
6. Aggregate timeseries DB data as input.
7. New user without any data
@MukuFlash03
Copy link
Contributor Author

With reference to the TestTimeSeries.py file's testcase for Aggregate subclass under the testFindEntriesCount module,
I am getting a Failed assertion for this test.

Problem:
Test passes on running only the specific test module or even the specific test file.
However, test fails when the entire suit of tests is run using runAllTests.sh both locally and through the GitHub CLI.

Possible issue:
I am fetching the distinct users under the aggregate class object, and for each user - fetching the timeseries and counting documents.

In case of running only current test module, only 1 distinct user is found and output matches expected counts. I have verified expected outputs using mongodb terminal to fetch the count values. Code comments contains details on my validation process.

On the entire test suite, this user may not be present or the count data may not be matching for the user.


Failed Assertion:

File "/Users/mmahadik/Documents/GitHub/e-mission-server/emission/tests/storageTests/TestTimeSeries.py", line 156, in testFindEntriesCount
   self.assertEqual(count_ts7, ([1476, 1016], [5]))
AssertionError: Tuples differ: ([0, 0], [0]) != ([1476, 1016], [5])

First differing element 0:
[0, 0]
[1476, 1016]

- ([0, 0], [0])
+ ([1476, 1016], [5])

----------------------------------------------------------------------
Ran 345 tests in 1327.825s
FAILED (failures=1)

- Had misunderstood usage of aggregate timeseries class.
- Corrected test case for aggregate class to check for sum of counts for each user without explicitly fetching each user.
@MukuFlash03 MukuFlash03 requested a review from shankari September 6, 2023 21:42
Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better, but still needs some work.

Comment on lines 475 to 477
orig_tsdb_counts = self._get_entries_counts_for_timeseries(orig_tsdb, orig_tsdb_keys, time_query, geo_query, extra_query_list)
analysis_tsdb_counts = self._get_entries_counts_for_timeseries(analysis_tsdb, analysis_tsdb_keys, time_query, geo_query, extra_query_list)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not opposed to this, but I thought that the plan was to only support one key for now so that we could make progress on other server tasks. Can you clarify why that changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The blank key input scenario prompted me to go for supporting multiple keys.
In this case, I thought of returning total counts of all documents for the specific user not filtered by keys.

Since, this is a functionality of abstract_timeseries class and its subclasses, any function call for counting documents would happen through this class objects and should not directly involve any timeseries database.

Now, since we don't know what timeseries database is to be used for counting, my thought was to return count of documents pertaining to both timeseries databases for the user (original and analysis).

I do realize now, that this could still have been achieved with a single key instead of key_list but a series of code changes happened in my current thought process to accommodate code changes and new test cases.

The extra time spent on the modifying the initial requirement could have been utilized for the other pending tasks.

Output: Aug_21: ([738, 508], [0]), Aug_27: ([555, 327], [0])
- Actual output just returns a single number for count of entries.
- Validated using grep count of occurrences for keys: 1) "background/location" 2) "background/filtered_location"
- $ grep -c <key> <dataset>.json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please add the grep outputs here?

emission/tests/storageTests/TestTimeSeries.py Outdated Show resolved Hide resolved
emission/tests/storageTests/TestTimeSeries.py Outdated Show resolved Hide resolved
# Test case: Only original timeseries DB keys for Aug-27 dataset
key_list2=["background/location", "background/filtered_location"]
count_ts3 = ts2_aug_27.find_entries_count(key_list=key_list2)
self.assertEqual(count_ts3, ([555, 327], []))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here it's an empty array. Why the inconsistency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty array is returned in case there were no keys pertaining to the respective timeseries database.
In this specific test case, only background keys are there, hence their count is returned.
No, analysis keys are present in the input keys, hence empty array returned.

This is to differentiate from the [0] case where a key might be present in the input but no matching documents found.
Whereas in this case of [], no key was present in the input itself.

Copy link
Contributor

@shankari shankari Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is an interesting design choice, but I am not sure I agree with it. This seems to:

  1. very tied to this specific implementation: e.g. the difference between timeseries and analysis_timeseries. Whenever you are designing an abstract interface, think about what it would look like if the underlying implementation were completely different. What if we stored everything in giant csv files instead? What if we used a real timeseries database instead? What would the meaning of timeseries and analysis_timeseries be in that case?
  2. unnecesarily complex to use: again, think of the user of the interface, who is using the timeseries abstraction that we have outlined in chapter 5 of my thesis. they want to know how many entries there are in each stream. Do they want to know where the entries came from? Do they want to know the details of our implementation? Do they want to know the difference between the key being present but there being zero entries and the key being absent? No. We shouldn't make them do that. Note that with this implementation, we don't even support just adding the two sets of counts together because they may not have the same number of entries.

Can you articulate the pros (if any) of this interface?

Copy link
Contributor Author

@MukuFlash03 MukuFlash03 Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank for your feedback on the design.

Yes, I do agree that it is too tightly based on the current implementation, especially with regards to two types of timeseries that we have currently.

In terms of reducing the complexity, it again comes down to supporting just a single key as per the initial requirement. This would resolve much of the varying output scenarios that the current implementation has such as the [] vs [0] case.

Motivation for this design:
In the original issue sample code usage here, I saw that first the data was fetched and then count was called. However, I designed this keeping in mind that the user would not have to explicitly state which timeseries the user wants a count for.
I noticed that the key could be used to identify which timeseries is required and hence went ahead with it.

Pending issue:
Now, in this design scenario, there was the case of blank keys passed.
How would we decide what timeseries count was to be returned to the user for this case?
Hence, I went ahead with returning the total count of the two available timeseries.
This resulted in my output now changing from a single value to two separate values for the two timeseries.

Possible resolution:

  • Stick to a single key and return count.
  • Need to decide how blank keys are handled AND,
  • Which timeseries data to be considered if user wants count of all documents without specifying key?

Also, I am a bit unclear what you mean by this - "Note that with this implementation, we don't even support just adding the two sets of counts together because they may not have the same number of entries."

Copy link
Contributor

@shankari shankari Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design should be driven by what is easy for users to use and generalizable. Not by what is easy to implement.

Motivation for this design:

Motivation for which design? single key or multi-key?

How would we decide what timeseries count was to be returned to the user for this case?

I don't see why this is a problem. What does find_entries do if the user doesn't specify a key? Do the same things for counts.

Also, I am a bit unclear what you mean by this - "Note that with this implementation, we don't even support just adding the two sets of counts together because they may not have the same number of entries."

If you always returned the same number of entries (e.g. if ['background/location', 'background/filtered_location'] returned ([x, y], [0, 0]) then we could do np.array(retVal[0]) + np.array(retVal[1]) to get the total counts. But we cannot do that with your current design.

Concretely, after this change, what will the op-admin-dashboard code that is changed to use this look like?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why it is good, when you come to a decision point like this, to document the alternatives, think through the pros and cons, and document your decision instead of just refactoring.

Copy link
Contributor Author

@MukuFlash03 MukuFlash03 Sep 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Motivation for which design? single key or multi-key?

Single key initially.
Approach to fetch timeseries data using keys is valid even in case of multi-key.
I moved to multi-key since I followed how find_entries was handling multiple keys.


How would we decide what timeseries count was to be returned to the user for this case?

I don't see why this is a problem. What does find_entries do if the user doesn't specify a key? Do the same things for counts.

Right, so I have done what find_entries is doing - in case of blank keys, fetch the total count of each specific dataset, which itself is done via a helper function _get_entries_for_timeseries.
Here, _get_entries_for_timeseries (used in find_entries) , adds up all counts for keys belonging to a dataset to give a single count.
Ex. ['background/location, 'background/filtered_location'] output would be [x+y]

The difference in my implementation of _get_entries_counts_for_timeseries is that I am returning individual counts for each key separately as an array element part of combined list.
Ex. ['background/location, 'background/filtered_location'] = [x,y]

This is done for each of the two timeseries and their respective query results are returned in find_entries.
Similarly, I am returning the respective counts in find_entries_count.

Copy link
Contributor Author

@MukuFlash03 MukuFlash03 Sep 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Concretely, after this change, what will the op-admin-dashboard code that is changed to use this look like?

With reference to sample code in original issue here:
total_trips = edb.get_analysis_timeseries_db().count_documents(arguments include userid and keys... ),

Now, the count can be fetched by:
ts_user = esta.get_time_series(UUID(user_id))
key_list = ['analysis/confirmed_trip']
total_trips = ts_user.find_entries_count(key_list=key_list)[1][0]

Copy link
Contributor Author

@MukuFlash03 MukuFlash03 Sep 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Also, I am a bit unclear what you mean by this - "Note that with this implementation, we don't even support just adding the two sets of counts together because they may not have the same number of entries."

If you always returned the same number of entries (e.g. if ['background/location, 'background/filtered_location']returned([x, y], [0, 0])then we could donp.array(retVal[0]) + np.array(retVal[1])` to get the total counts. But we cannot do that with your current design.

Alright, thank you for the clarification. I do get your point.
This would be a small change to match up the number of entries in the output.

However, I want to point out that the two lists [x,y] and [0,0] would refer to different timeseries datasets with varying keys.
For instance, if 2nd list also had non-zero values, say [a,b], corresponding to some analysis_timseries keys, each element inside the 1st list refers to a separate key which will not match with the key from 2nd list.
Would this be appropriate: x (= background/location ) + a = (analysis/some_key) = x + a ?

Copy link
Contributor

@shankari shankari Sep 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so I have done what find_entries is doing - in case of blank keys, fetch the total count of each specific dataset, which itself is done via a helper function _get_entries_for_timeseries.

This is the implementation of find_entries. The interface of find_entries is that if you give it blank keys, you get a combined list of all entries in the time range. The count is not even exposed outside the implementation. The interface of find_entry_counts should be the same as the interface of find_entries, which is that it should return the number of matching entries for that time range.

This is done for each of the two timeseries and their respective query results are returned in find_entries.

The respective query results are not returned in find_entries. A combined query result is returned.
See itertools.chain(orig_ts_db_result, analysis_ts_db_result)

However, I want to point out that the two lists [x,y] and [0,0] would refer to different timeseries datasets with varying keys.

I fail to see who would be interested in the implementation detail that we have two timeseries collections internally, or why they would need to know which key is stored while accessing the result.

ts_user = esta.get_time_series(UUID(user_id))
key_list = ['analysis/confirmed_trip']
total_trips = ts_user.find_entries_count(key_list=key_list)[1][0]

and if it were location, we would need:

ts_user = esta.get_time_series(UUID(user_id))
key_list = ['background/location']
total_trips = ts_user.find_entries_count(key_list=key_list)[0][0]

can you see how that is leaking the implementation outside the interface?

miniconda.sh Outdated Show resolved Hide resolved
- Added counts of grep outputs for sample inputs.
- Explained expected scenarios for blank keys.
- Removed miniconda.sh file added mistakenly post re-setup.
Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that if a list of keys is passed in (e.g. ['background/location', 'background/filtered_location'], find_entries does not currently return lists of entries separated by key. There is no equivalent of # For each key in orig_tsdb keys, create a query. It returns a combined list which intersperses both sets of entries. In general, I would expect find_entries_count(...) to given the same results as len(find_entries(...))

# Test case: Only analysis timeseries DB keys
key_list3=["analysis/confirmed_trip"]
count_ts4 = ts2_aug_27.find_entries_count(key_list=key_list3)
self.assertEqual(count_ts4, ([], [0]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expected interface:

Suggested change
self.assertEqual(count_ts4, ([], [0]))
self.assertEqual(count_ts4, 0)

# Test case: Empty key_list which should return total count of all documents in the two DBs
key_list4=[]
count_ts5 = ts1_aug_21.find_entries_count(key_list=key_list4)
self.assertEqual(count_ts5, ([2125], [0]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.assertEqual(count_ts5, ([2125], [0]))
self.assertEqual(count_ts5, 2125)

# Test case: Aggregate timeseries DB User data passed as input
ts_agg = esta.TimeSeries.get_aggregate_time_series()
count_ts7 = ts_agg.find_entries_count(key_list=key_list1)
self.assertEqual(count_ts7, ([1293, 835], [0]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.assertEqual(count_ts7, ([1293, 835], [0]))
self.assertEqual(count_ts7, 2125)

- Reused existing functionality from builtin_timeseries.py to fetch computed count of entries
- Corrected return values to return combined sum of count of matching entries not segregated by keys or timeseries dataset.
@MukuFlash03
Copy link
Contributor Author

MukuFlash03 commented Sep 8, 2023

I have addressed the latest review comments.
Overall changes made include using current _get_entries_for_timeseries implementation to return the count.
Additionally, changed the returned values to include combined sum for all matching entries not segregated by keys and or timeseries datasets.


While the test for the specific test module and the test file runs successfully, the entire test suite fails when run on GitHub's CI as well as my local system.
The failure occurs due to an Assertion failure in the counts for Aggregate subclass test case.

The actual failed assertion is this:

FAIL: testFindEntriesCount (storageTests.TestTimeSeries.TestTimeSeries)
Test: Specific keys with other parameters not passed values.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/e-mission-server/e-mission-server/emission/tests/storageTests/TestTimeSeries.py", line 166, in testFindEntriesCount
    self.assertEqual(count_ts8, 3607)
AssertionError: 3608 != 3607

I separated the timeseries counts in the outputs to see which one was not matching, which showed analysis count is getting some entries.

FAIL: testFindEntriesCount (storageTests.TestTimeSeries.TestTimeSeries)
Test: Specific keys with other parameters not passed values.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/mmahadik/Documents/GitHub/e-mission-server/emission/tests/storageTests/TestTimeSeries.py", line 174, in testFindEntriesCount
    self.assertEqual(count_ts8, 3607)
AssertionError: (3607, 1) != 3607

I am unsure how would I validate this output manually especially for analysis timeseries counts as expected count calculated by me just has total number of values from the input files.

I did check the counts for builtin_timeseries for both UUIDs and the count matches my expected values.
But count mismatch occurs when testing for Aggregate subclass.

@shankari
Copy link
Contributor

shankari commented Sep 8, 2023

I am unsure how would I validate this output manually especially for analysis timeseries counts as expected count calculated by me just has total number of values from the input files.

You cannot validate it - is is probably incorrect. I would print out the key of the single analysis result. I bet some other test is not cleaning up properly after itself, which is why you are running into this only when running in concert with other tests.

@shankari
Copy link
Contributor

shankari commented Sep 8, 2023

Let's get the tests to pass before I review. There's no point in reviewing PRs that don't pass tests.

@MukuFlash03
Copy link
Contributor Author

MukuFlash03 commented Sep 9, 2023

I am unsure how would I validate this output manually especially for analysis timeseries counts as expected count calculated by me just has total number of values from the input files.

You cannot validate it - is is probably incorrect. I would print out the key of the single analysis result. I bet some other test is not cleaning up properly after itself, which is why you are running into this only when running in concert with other tests.

This is the one unaccounted analysis document entry being fetched on running all tests:

key = segmentation/raw_stop

Entry({'_id': ObjectId('64fbb82b7b555615e043ef49'), 'user_id': UUID('3f205663-671b-3333-91d8-347d74a88f83'), 'metadata': {'key': 'segmentation/raw_stop', 'platform': 'server', 'write_ts': 1694218283.960188, 'time_zone': 'America/Los_Angeles', 'write_local_dt': {'year': 2023, 'month': 9, 'day': 8, 'hour': 17, 'minute': 11, 'second': 23, 'weekday': 4, 'timezone': 'America/Los_Angeles'}, 'write_fmt_time': '2023-09-08T17:11:23.960188-07:00'}, 'data': {'enter_ts': 5, 'exit_ts': 6, 'trip_id': 'test_trip_id'}})

The specific test case I was testing was passing an empty key list to aggregate subclass.
If this invalid data isn't related to my implementation, for now can I ignore this test case and commit code changes without this specific test?
As it does seem to work with empty list, almost matching the value and aggregating entries except for this extra entry.

- Commented out test case for aggregate subclass with input as empty key list.
- Previous commit failed on running all tests together due to an unaccounted extra document entry in analysis_timeseries.
- Ignoring this test for now.
@shankari
Copy link
Contributor

I appreciate your desire to get this checked in soon, but the project is at a stage of maturity where I don't want to cut corners. And commenting out a test that doesn't work the way you want without understanding the reason why is optimizing the specific functionality you are working on over the health of the project as a whole.

If this invalid data isn't related to my implementation, for now can I ignore this test case and commit code changes without this specific test?

How do you know that this is not related to your implementation? You don't even know where the entry is coming from. Maybe it is actually correct. You have validated the background/location and background/filtered_location counts by using grep on the files. But have you tried to grep for any other keys in the files?

  1. We are not under time pressure for these changes.
  2. There is no external deadline we must meet.
  3. We are in the process of improving the tightness and production-readiness of the codebase
  4. There is no other magic team member who is going to go in and understand this new entry

We should understand where the entry is coming from and fix it, not fix by removing an inconvenient test

…subclass test case in TestTimeSeries.py

- Had initially removed aggregate subclass test case in TestTimeseries.py as the overall tests were failing.
- Debugged the entire issue and found out that some test files in emission/tests/storageTests did not have a tearDown() to clear up test data initiated during that specific test.
- These tests include:
TestAnalysisTimeseries, TestPlaceQueries, TestSectionQueries, TestStopQueries
- Added tearDown() with code to clear the database as was being done in all other storageTests/ .
Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MukuFlash03 I am merging this for now because:

  1. you have investigated and fixed the issue, and
  2. there is a release train leaving tonight

However, I would like you to include the cleanup below in your next PR.
Please keep the comment unresolved until I check the next PR and resolve it

Please also file an issue to change the code in https://github.com/e-mission/op-admin-dashboard to use the new method

Comment on lines +173 to +181
except AssertionError as e:
print(f"Assertion failed for 3607...")
for ct in count_ts8:
cte = ecwe.Entry(ct)
print(f"CTE = ")
print(cte.user_id)
print(cte.metadata.key)
print(cte)
print("=== Trip:", cte.data.start_loc, "->", cte.data.end_loc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will mean that this test can never fail, so it is essentially the same as the test being commented out like it was before. It is true that you print out an error message if you catch the assertion, but who is checking for log statements in the automated tests, particularly while running in a Ci/CD environment?

The only thing that people pay attention to is the red or green badge 😄

If you have actually fixed the test, you should have confidence in it and fail the test if the assertion fails.
You can do this by removing the try/catch or by leaving this in for debugging, but rethrowing the error after printing out the CTE

If you do choose to keep the try/catch + rethrow, you should add a TODO: to remove it in Sept 2024, so that we don't have any more vestigial code sitting around.

@shankari shankari merged commit f976ab2 into e-mission:master Sep 13, 2023
shankari added a commit that referenced this pull request Sep 13, 2023
Merge conflict was because `invalidate_raw_entry` was added in this branch (in
c79722a)
and `find_entries_count` was added in master in
#935

Retained both
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Address code reviews with "future"
Development

Successfully merging this pull request may close these issues.

3 participants