Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a service to diff PDF files #36

Open
Mr0grog opened this issue Jun 1, 2017 · 28 comments
Open

Create a service to diff PDF files #36

Mr0grog opened this issue Jun 1, 2017 · 28 comments

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Jun 1, 2017

We have a simplistic service for displaying diffs between two HTML pages (https://github.com/edgi-govdata-archiving/go-calc-diff), but we also see a lot of PDFs on government websites and would love to have a similar service for visualizing the diff between two versions of a PDF.

This should be a simple web service that takes two query arguments:

  • a: A URL for the “before” version of the PDF
  • b: A URL for the “after” version of the PDF

It can take any additional arguments that might make sense. It can produce an image, an HTML page, a PDF, or anything that can be rendered by most web browsers as an HTTP response.

If you need it to function in a different way to be feasible, let’s talk about it! We can make other interfaces work so long as they can be accessible as a web service.

Some open source libraries for diffing PDFs that might be useful:

@Mr0grog
Copy link
Member Author

Mr0grog commented Jun 1, 2017

Here’s an example of a small, hard to see change:
Record: https://web-monitoring-db.herokuapp.com/api/v0/pages/2d2ccc52-f467-4775-a034-bea5271c0b9f
Version A: https://edgi-versionista-archive.s3.amazonaws.com/versionista1/74346-6228877/version-11512540.pdf
Version B: https://edgi-versionista-archive.s3.amazonaws.com/versionista1/74346-6228877/version-11522529.pdf

Here’s an interesting graphic page with changes:
Record: https://web-monitoring-db.herokuapp.com/api/v0/pages/c0307603-0bae-4a6c-bf12-52cc6482b0bc
Version A: https://edgi-versionista-archive.s3.amazonaws.com/versionista1/71555-6026691/version-9608983.pdf
Version B: https://edgi-versionista-archive.s3.amazonaws.com/versionista1/71555-6026691/version-11239564.pdf

Here’s one that’s just hard to scan by eye because it’s mostly reams of data:
Record: https://web-monitoring-db.herokuapp.com/api/v0/pages/3edef8ea-de3f-4771-89f2-92840dad026b
Version A: https://edgi-versionista-archive.s3.amazonaws.com/versionista1/74013-6199243/version-9920428.pdf
Version B: https://edgi-versionista-archive.s3.amazonaws.com/versionista1/74013-6199243/version-10713675.pdf

And another:
Record: https://web-monitoring-db.herokuapp.com/api/v0/pages/563b013c-883f-4099-8c98-ce6059a0b823
Version A: https://edgi-versionista-archive.s3.amazonaws.com/versionista2/74279-6212866/version-11023958.pdf
Version B: https://edgi-versionista-archive.s3.amazonaws.com/versionista2/74279-6212866/version-11255938.pdf

@neiljp
Copy link

neiljp commented Jun 1, 2017

I'm looking into this; see how I get on!

@Mr0grog
Copy link
Member Author

Mr0grog commented Jun 1, 2017

@neiljp Awesome, thanks so much!

@neiljp
Copy link

neiljp commented Jun 1, 2017

I have the second library working for all 4 examples you listed. I did have minor issues with the 3rd one, which is large, has an offset top page and had lots of extraneous characters which I needed to figure out how to filter out.
Would it be helpful to show these images somewhere?

@Mr0grog
Copy link
Member Author

Mr0grog commented Jun 1, 2017

Sure! Go ahead and post them here. If you have this work in a repo, go ahead and link it, too.

@Mr0grog
Copy link
Member Author

Mr0grog commented Jun 1, 2017

Are you on the Archivers Slack group? There’s more “live” conversation there and workflow, process, etc.

@neiljp
Copy link

neiljp commented Jun 1, 2017

I'm generally not on Slack; is there an IRC mirror somewhere?

@neiljp
Copy link

neiljp commented Jun 2, 2017

These are the results I have for the 4 tests, with the caveats as above:
1
2
3
4

@Mr0grog
Copy link
Member Author

Mr0grog commented Jun 2, 2017

These are wonderful. 👍 🎉

Unfortunately, I don’t think there is any mirror of the Slack :\

@Mr0grog
Copy link
Member Author

Mr0grog commented Jun 2, 2017

I did have minor issues with the 3rd one, which is large, has an offset top page and had lots of extraneous characters which I needed to figure out how to filter out.

No worries. I should have been clearer that this doesn’t have to be perfect. Even if there are false positives, being able to identify space people can definitely ignore is a big deal. This is super, super helpful.

@titaniumbones
Copy link
Contributor

Hey @neiljp this looks great, thx. Great to have new people stepping in!

We have been talking about an IRC bridge for a while but haven't set one up - doh!

@Mr0grog
Copy link
Member Author

Mr0grog commented Jun 2, 2017

@neiljp I’m headed out for the night, but will be back on tomorrow at 9-ish Pacific Time if you are planning to do more work on it. I will also try and sign into the global sprint Gitter.im if you are using that (I did not do a good job of paying attention to it today, sorry).

Looking forward to getting this integrated as a running service!

@neiljp
Copy link

neiljp commented Jun 2, 2017

@Mr0grog I'm back and on the gitter chat now.
Re chat: I'm currently on IRC (freenode, oftc), matrix.org and also experimenting with zulip (after some pycon sprints).
While I'm moving on with looking into this, were other online services looked at? Or is it that they cannot be deployed with different resource limitations, for example?

@Mr0grog
Copy link
Member Author

Mr0grog commented Jun 2, 2017

were other online services looked at? Or is it that they cannot be deployed with different resource limitations, for example?

No—diffing PDFs is something that we simply haven't had time to get to at all yet.

In general, we haven’t found any great diffing services that either we can deploy feasibly or third party ones that we can integrate with and easily display the diff results in our own UI alongside forms and other visualizations for analysts.

@neiljp
Copy link

neiljp commented Jun 2, 2017

Progress today has my flask implementation (locally) working with the library and generating a png in the browser; how would you deploy that? I could try and deploy to a server I have access to, in theory.

@Mr0grog
Copy link
Member Author

Mr0grog commented Jun 2, 2017

We don’t have a great deploy process for anything that’s not Heroku yet—it’s very ad-hoc on Amazon EC2. If you can deploy to a server you manage and document the process, that’d be great.

@neiljp
Copy link

neiljp commented Jun 2, 2017

Apparently flask works on heroku; the trick might be installing the other module(s), including one that I built as binary, though might not strictly need to be.

@Mr0grog
Copy link
Member Author

Mr0grog commented Jun 2, 2017

Ah, yeah, binaries can be complicated on Heroku. You have to create a “buildpack:” https://devcenter.heroku.com/articles/buildpacks

@Mr0grog
Copy link
Member Author

Mr0grog commented Jun 5, 2017

@neiljp Did you get anywhere on this? If not, do you mind posting what code you’ve got somewhere so others can help on this? Thanks!

@neiljp
Copy link

neiljp commented Jun 6, 2017

@Mr0grog I didn't get any further than getting it to work locally in the end, but have submitted some PRs against the lib I used, and hope to document the process ASAP.

@Mr0grog
Copy link
Member Author

Mr0grog commented Jun 20, 2017

@neiljp Any updates on this?

@neiljp
Copy link

neiljp commented Jul 11, 2017

@Mr0grog Apologies, I got swept up in contributing to Zulip after PyCon. I'm now getting back to this, though I note there is other progress?

@Mr0grog
Copy link
Member Author

Mr0grog commented Jul 11, 2017

@neiljp Yeah, we sorta have a more defined way to do this now. You can add your work as a module in the https://github.com/edgi-govdata-archiving/web-monitoring-processing repo, in the web_monitoring folder. There’s not much documentation on how the built-in diff server there works yet, but you can look at PR #59 in that repo. @danielballan can probably also help you out.

@Mr0grog
Copy link
Member Author

Mr0grog commented Jul 18, 2017

Hey, @neiljp, just checking in. Any updates or anything I can help with here?

@stale
Copy link

stale bot commented Jan 23, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

@stale stale bot added the stale label Jan 23, 2019
@Mr0grog
Copy link
Member Author

Mr0grog commented Jan 23, 2019

Well, this is still pretty critical. It would be lovely to get some help from someone on this, but it does need to get done.

@stale stale bot removed the stale label Jan 23, 2019
@stale stale bot added the stale label Oct 21, 2019
@stale stale bot removed the stale label Oct 21, 2019
@edgi-govdata-archiving edgi-govdata-archiving deleted a comment from stale bot Oct 21, 2019
@0xrishabh
Copy link

Hey if the issue still alive, I will like to contribute.

@Mr0grog
Copy link
Member Author

Mr0grog commented Jan 2, 2020

Hey @cyph3r1337, that would be great. These days, all the diff-related code lives in the web-monitoring-processing repo in the web_monitoring/diff directory.

You can then make your differ accessible via HTTP by adding it to the server here: https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/e77cc3cb56b9d66c82e3ad59f071d9d12b87254a/web_monitoring/diff_server/server.py#L30-L53 Basically, this just maps a part of the URL path to a function. The server will examine your argument names to figure out what to send it. More info on that here: https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/e77cc3cb56b9d66c82e3ad59f071d9d12b87254a/web_monitoring/diff_server/server.py#L455-L465

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants