-
-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a service to diff PDF files #36
Comments
I'm looking into this; see how I get on! |
@neiljp Awesome, thanks so much! |
I have the second library working for all 4 examples you listed. I did have minor issues with the 3rd one, which is large, has an offset top page and had lots of extraneous characters which I needed to figure out how to filter out. |
Sure! Go ahead and post them here. If you have this work in a repo, go ahead and link it, too. |
Are you on the Archivers Slack group? There’s more “live” conversation there and workflow, process, etc. |
I'm generally not on Slack; is there an IRC mirror somewhere? |
These are wonderful. 👍 🎉 Unfortunately, I don’t think there is any mirror of the Slack :\ |
No worries. I should have been clearer that this doesn’t have to be perfect. Even if there are false positives, being able to identify space people can definitely ignore is a big deal. This is super, super helpful. |
Hey @neiljp this looks great, thx. Great to have new people stepping in! We have been talking about an IRC bridge for a while but haven't set one up - doh! |
@neiljp I’m headed out for the night, but will be back on tomorrow at 9-ish Pacific Time if you are planning to do more work on it. I will also try and sign into the global sprint Gitter.im if you are using that (I did not do a good job of paying attention to it today, sorry). Looking forward to getting this integrated as a running service! |
@Mr0grog I'm back and on the gitter chat now. |
No—diffing PDFs is something that we simply haven't had time to get to at all yet. In general, we haven’t found any great diffing services that either we can deploy feasibly or third party ones that we can integrate with and easily display the diff results in our own UI alongside forms and other visualizations for analysts. |
Progress today has my flask implementation (locally) working with the library and generating a png in the browser; how would you deploy that? I could try and deploy to a server I have access to, in theory. |
We don’t have a great deploy process for anything that’s not Heroku yet—it’s very ad-hoc on Amazon EC2. If you can deploy to a server you manage and document the process, that’d be great. |
Apparently flask works on heroku; the trick might be installing the other module(s), including one that I built as binary, though might not strictly need to be. |
Ah, yeah, binaries can be complicated on Heroku. You have to create a “buildpack:” https://devcenter.heroku.com/articles/buildpacks |
@neiljp Did you get anywhere on this? If not, do you mind posting what code you’ve got somewhere so others can help on this? Thanks! |
@Mr0grog I didn't get any further than getting it to work locally in the end, but have submitted some PRs against the lib I used, and hope to document the process ASAP. |
@neiljp Any updates on this? |
@Mr0grog Apologies, I got swept up in contributing to Zulip after PyCon. I'm now getting back to this, though I note there is other progress? |
@neiljp Yeah, we sorta have a more defined way to do this now. You can add your work as a module in the https://github.com/edgi-govdata-archiving/web-monitoring-processing repo, in the |
Hey, @neiljp, just checking in. Any updates or anything I can help with here? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions. |
Well, this is still pretty critical. It would be lovely to get some help from someone on this, but it does need to get done. |
Hey if the issue still alive, I will like to contribute. |
Hey @cyph3r1337, that would be great. These days, all the diff-related code lives in the web-monitoring-processing repo in the You can then make your differ accessible via HTTP by adding it to the server here: https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/e77cc3cb56b9d66c82e3ad59f071d9d12b87254a/web_monitoring/diff_server/server.py#L30-L53 Basically, this just maps a part of the URL path to a function. The server will examine your argument names to figure out what to send it. More info on that here: https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/e77cc3cb56b9d66c82e3ad59f071d9d12b87254a/web_monitoring/diff_server/server.py#L455-L465 |
We have a simplistic service for displaying diffs between two HTML pages (https://github.com/edgi-govdata-archiving/go-calc-diff), but we also see a lot of PDFs on government websites and would love to have a similar service for visualizing the diff between two versions of a PDF.
This should be a simple web service that takes two query arguments:
a
: A URL for the “before” version of the PDFb
: A URL for the “after” version of the PDFIt can take any additional arguments that might make sense. It can produce an image, an HTML page, a PDF, or anything that can be rendered by most web browsers as an HTTP response.
If you need it to function in a different way to be feasible, let’s talk about it! We can make other interfaces work so long as they can be accessible as a web service.
Some open source libraries for diffing PDFs that might be useful:
The text was updated successfully, but these errors were encountered: