Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added count_documents() implementation to builtin_timeseries #935

Merged
merged 11 commits into from
Sep 13, 2023
15 changes: 15 additions & 0 deletions emission/storage/timeseries/builtin_timeseries.py
Original file line number Diff line number Diff line change
Expand Up @@ -440,3 +440,18 @@ def update_data(user_id, key, obj_id, data):
logging.debug("updating entry %s into timeseries" % new_entry)
edb.save(ts.get_timeseries_db(key), new_entry)

def find_entries_count(self, key, time_query = None, geo_query = None, extra_query_list = None):
"""
Returns the total number of documents for the specific key referring to a timeseries db.
:param key: the metadata key we are querying for. Only supports one key for now.
:param time_query: the time range in which to search the stream
:param geo_query: the query for a geographical area
:param extra_query_list: any additional queries to filter out data
"""
logging.debug("builtin_timeseries.find_entries_count() called")
created_query = self._get_query([key], time_query, geo_query, extra_query_list)
result_dataset = self.get_timeseries_db(key)
total_entries = result_dataset.count_documents(created_query)
return total_entries


24 changes: 24 additions & 0 deletions emission/tests/storageTests/TestTimeSeries.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,30 @@ def testExtraQueries(self):
with self.assertRaises(AttributeError):
list(ts.find_entries(time_query=tq, extra_query_list=[ignored_phones]))

def testFindEntriesCount(self):
'''
Test: Specific keys with other parameters not passed values.
Input: For each dataset: ["background/location", "background/filtered_location]
- Testing this with sample dataset: "shankari_2015-aug-21", "shankari_2015-aug-27"
Output: Aug_21: [738, 508], Aug_27: [555, 327]
- Actual output just returns a single number for count of entries.
- Validated using grep count of occurrences for keys: 1) "background/location" 2) "background/filtered_location"
- $ grep -c <key> <dataset>.json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please add the grep outputs here?

'''
ts1_aug_21 = esta.TimeSeries.get_time_series(self.testUUID1)
ts2_aug_27 = esta.TimeSeries.get_time_series(self.testUUID)

count_ts1 = [ts1_aug_21.find_entries_count(key="background/location"), ts1_aug_21.find_entries_count(key="background/filtered_location")]
print("\nEntry counts for location, filtered_location on {} = {}".format("Aug_21", count_ts1))
self.assertEqual(count_ts1, [738, 508])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are very basic tests - I don't see any corner-case testing, for example.

  • What happens if the key list is blank?
  • What happens if the key list is blablabla?
  • What happens if there is no data for that user?

I believe you had listed out a few similar cases during our discussion, I don't see any actual implementation of those here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had not implemented for blank keys as it was to be decided whether key is to be made mandatory or not and assumed that a key would always be passed for getting count.
But yes, I will go ahead and make code changes to accommodate for blank keys by returning total counts from each of the two timeseries DBs as a tuple (count_timeseries_db, count_analysis_db).

Copy link
Contributor Author

@MukuFlash03 MukuFlash03 Sep 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a doubt, regarding "key list". For now, I have only been taking a single key as opposed to a list.

  1. Is the count to be done for multiple keys passed as a key list?
  2. If yes, ´get_query' returns a combined query for multiple keys and gives total count of matching entries not segregated by keys. For instance, keylist = [A,B], then total count will be (A+B).
    Do we need a separate count for each key in that case, that is (A,B, ...)
    or combined count = (A+B) is needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see #935 (comment)

count_ts2 = [ts2_aug_27.find_entries_count(key="background/location"), ts2_aug_27.find_entries_count(key="background/filtered_location")]
print("Entry counts for location, filtered_location on {} = {}".format("Aug_27", count_ts2))
self.assertEqual(count_ts2, [555, 327])

print("Assert Test for Count Data successful!")


if __name__ == '__main__':
import emission.tests.common as etc
etc.configLogging()
Expand Down