Skip to content
This repository has been archived by the owner on Apr 14, 2023. It is now read-only.

Backfill Socrata data #35

Open
thekaveman opened this issue Jun 1, 2016 · 7 comments
Open

Backfill Socrata data #35

thekaveman opened this issue Jun 1, 2016 · 7 comments

Comments

@thekaveman
Copy link
Contributor

Not so much an issue with our code, just a reminder todo.

Let's backfill the daily report data into Socrata, starting January 1, 2016.

@allejo allejo self-assigned this Jun 7, 2016
@allejo
Copy link
Collaborator

allejo commented Jun 7, 2016

Do the current GA IDs/properties have data dating back to Jan 1, 2016?

@thekaveman
Copy link
Contributor Author

No, I guess not 😩

I think we probably have data starting April 26, see 361d487

@allejo
Copy link
Collaborator

allejo commented Jun 7, 2016

Google Analytics API supports the start-date query to have a YYYY-MM-DD date so we can use this to backfill. However... Around this time, the actual analytics filters were still being worked on so there are well over 10,000 entries; this isn't an issue for GA since we can just split up the queries into smaller time periods but what I'd like to know is if you'd like to have all of that unfiltered data (e.g. Google Mini, www vs non-www results, etc.) pushed to Socrata or should we have some sort of filter in this, otherwise, unfiltered data.

Here's what the report would look like; it's exactly the same as the all-pages report except with an explicit date.

frequency: once
meta:
  description: Data regarding each and every page dating back to 2016-04-26
  name: Back Fill
name: back-fill
query:
  dimensions:
  - ga:date
  - ga:hostname
  - ga:pagePath
  - ga:pageTitle
  end-date: yesterday
  start-date: 2016-04-26
  metrics:
  - ga:sessions
  - ga:percentNewSessions
  - ga:pageviews
  - ga:uniquePageviews
  - ga:avgTimeOnPage
  - ga:avgPageLoadTime
  - ga:entrances
  - ga:bounces
  - ga:exits

@allejo allejo removed their assignment Sep 6, 2016
@thekaveman
Copy link
Contributor Author

Since the socrata job hasn't been running successfully since August 17, we need to make sure we run a few backfills (to save on our GA quota) once #42 is solved.

We'll probably want at least:

@thekaveman
Copy link
Contributor Author

thekaveman commented Sep 7, 2016

@allejo regarding your question above:

[Should we] have all of that unfiltered data (e.g. Google Mini, www vs non-www results, etc.) pushed to Socrata or should we have some sort of filter in this, otherwise, unfiltered data.

I think we should make a best-effort at filtering and normalizing. We might want to create one or more WebJobs, distinct from the socrata job, to deal with this backfilling. That way we can make full use of a programming environment to clean up the data as much as possible before it gets sent up.

@allejo
Copy link
Collaborator

allejo commented Sep 7, 2016

Would a WebJob actually be necessary for a one-time job of backfilling data? I was thinking running it locally, clean up data and then upsert the data to Socrata. Or would this backfilling be happening regularly?

@thekaveman
Copy link
Contributor Author

Yeah, you're totally right. I guess I just meant something external to the socrata job.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants