Radical suggestion: Deprecate this repo and break up pieces #206

danielballan · 2018-08-01T00:51:14Z

I am tempted to propose deprecating web-monitoring-processing and creating several separate repos in its place.

Python RESTer to the Internet Archive CDX API
web-monitoring-db Python RESTer
diffing functions
diffing server, which could gradually grow into the Python implementation of the Ruby server -- see Rewrite this project in Python web-monitoring-db#119

Arguments in favor:

"Many small packages" is good for reuse and shared maintenance. Lots of people might be happy to use and potentially contribute to a modern, well-tested Python RESTer for the Internet Archive CDX API, but be uninterested in the other bits.
The diffing server and the diffing functions have very different (maybe totally nonintersecting?) dependency sets. It would be nice to attract more contributors to our diffing work, and it might be easier to get up and running if they didn't have to encouter the server code at all.
It's hard to properly implement semver [and, more generally, to manage release schedules] on a codebase that has so many loosely coupled pieces. This is why I propose separae repos, with one Python package per repo, as opposed to several Python packages that share one repo. It's better to have one package per repo so that the version on PyPI and the version in the git tag are in sync.
This would be a graceful way to give up on some of the old code that we have maintained (well, refrained from deleting at least) in the hope that some hackathon contributors might someday return to expand on them. It's been 18 months.

Arguments against:

There are already a lot of repos! And more repos can mean more baseline maintenance work. Cookiecutters (like this one I developed recently) can lower the startup cost, but there is an ongoing maintenance cost to keeping CI up to date, etc.
Coordinated releases can be tricky. Currently, if we want to change the API between the diffing server and the diffing functions, we can just do it in one granular step. If we split them up, we have to manage coordinated releases. This is probably not a big problem since we are the only user of this code.

Mr0grog · 2018-08-01T04:44:00Z

Note there is another concern not well covered here: scraping (#174) and monitoring (#172). That said, they could maybe be lumped in with a future “Python implementation of the Ruby server.”

My 2¢: I’m not sure I see as much value in splitting it up so granularly. I feel like I see two major areas here…

Diffing Service (which I kind of want to merge into a Python rewrite of the DB server, except I suspect that would be problematic for Internet Archive if they really want to make something serious out of the diffing work here — can you comment, @vbanos?)
Pile of useful tools for loading and analyzing data about web page snapshots (APIs to Internet Archive, our DB, etc.) that we and analysts might use in scripts. I would just keep that here instead of deprecating this.
- I have no concerns about cleaning out cobwebs and deleting unused & unmaintained stuff here — it can live on in Git’s history just fine.
- I’m skeptical that we have enough in the IA CDX API code to be worth publishing. Everything south of list_versions() is not relevant, and I’m not even sure list_versions() makes sense outside our context.
- I’m not sure the DB API module is worth publishing on its own, either — nobody outside of EDGI can really even use it since we still have no API that doesn’t require authentication.

I can see the actual diffing algorithms living in either place. From a practical standpoint so far, they’ve proven to have zero applicability outside our problem space and thus outside the diffing server. The diffing server is also nothing but a wrapper for them (and, more importantly, at least one past experience showed not having the server to go along with the algorithm to be a real barrier). On the other hand, I can at least clearly envision our analysts using the algorithms in their scripts.

It seems like we are finally semi-serious about a Python rewrite of the DB server. I think the scraping and monitoring tools would eventually move there, but they could continue to live here pretty comfortably until that becomes real enough.

Mr0grog · 2018-08-08T19:24:53Z

From last week’s dev meeting, the short-term plan is to break this repo in two:

Diffing server + diffing algorithms
Toolbox of other stuff, like our API client, Internet Archive CDX client, etc. (which probably stays here in this repo; we’ll just remove everything else that doesn’t fit and keep it in a branch)

With a potential future of splitting up the toolbox stuff here when we see it’s useful to do so.

@danielballan please edit if I got this wrong.

danielballan · 2018-09-14T13:21:40Z

To surface a conversation from private channels:

We still plan to do this eventually, and if possible before IA presents this work in October and possibly brings more eyeballs to the project.
We are not 100% sure what the right boundaries are to draw, and since splitting a project is much easier than combining projects, we want to wait until we are sure and do it right the first time.

stale · 2019-03-13T14:14:13Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

Mr0grog · 2019-04-24T21:09:58Z

Updating this from some recent discussions… the amount of work and oddball logic involved in #174, combined with questions I’ve seen from people on Wayback’s Slack, have convinced my of @danielballan’s original point that a package dedicated to Wayback would be good. (I’d love to answer some questions I’ve seen with “just use this,” but it’s impractical given how buried it is in this repo.)

Also marking this as never-stale, since I think we’ve at least come to consensus that this should be broken up a bit, if not on exactly how.

Mr0grog · 2019-12-05T05:15:00Z

We’ve actually made forward progress with this in #512! Most of what was in internetarchive.py has been extracted into the Wayback package.

Mr0grog · 2019-12-05T05:28:09Z

I’ve also made sure that the new analyst task sheet script is in a separate repo (https://github.com/edgi-govdata-archiving/web-monitoring-task-sheets/), which puts me in a bit of a mind to at least reorganize things internally here a bit for sanity’s sake. I’m thinking:

web_monitoring/
    tests/
        [same as today]
    diff/
        content_type.py
        differs.py
        diff_errors.py
        html_diff_render.py
        links_diff.py
    diff_server/
        diffing_server.py
    cli/                     # Directory to hold CLI command script stuff
        cli.py               # Entrypoint -- parses args and calls other modules
        ia_healthcheck.py    # Currently in scripts folder
        ia_import.py         # Most of what's currently in cli.py
    __init__.py
    _version.py
    utils.py
    db.py

I don’t think this organization magically solves anything, but I hope it makes the “grab-bag” feeling of this repo a little more navigable, and maybe helps us see where we can or should excise bits into separate places.

Also, to be clear, I’m not suggesting we make each of these folders separate packages. This is just a little bit of organization within the web_monitoring package.

danielballan · 2019-12-05T21:30:07Z

I like this layout. 👍 I agree it will make it easier to grok what the various pieces of this project are.

This reorganizes the content of web_monitoring into a hierarchy of modules for easier management and comprehension. See the discussion in #206 for more. scripts/ # Stubs for things in web_monitoring/cli annotations_import ia_healthcheck wm wm-diffing-server web_monitoring/ tests/ [same as today] diff/ content_type.py differs.py diff_errors.py html_diff_render.py links_diff.py diff_server/ server.py cli/ cli.py ia_healthcheck.py ia_import.py annotations_import.py __init__.py _version.py utils.py db.py This also drops `filtering.py`, which was vestigial and no longer used.

Mr0grog · 2020-03-25T03:20:45Z

I’m going ahead and closing this issue. I think we’ll continue to pull things apart or reorganize where it seems reasonable, but the above re-organization (plus the upcoming reduction in developer time for this project) probably get us as close as is currently reasonable to the main goals of this issue.

danielballan self-assigned this Aug 6, 2018

danielballan mentioned this issue Sep 14, 2018

Use doctr to deploy built docs to GH pages. #267

Merged

This was referenced Jan 10, 2019

Discuss splitting differs.py into multiple modules #63

Closed

Incorporate deployment.md into published docs. #118

Closed

Mr0grog mentioned this issue Jan 22, 2019

upload web-monitoring-processing into PyPI #168

Closed

stale bot added the stale label Mar 13, 2019

Mr0grog added the idea label Mar 13, 2019

stale bot removed the stale label Mar 13, 2019

Mr0grog added the never-stale label Apr 24, 2019

Mr0grog mentioned this issue May 30, 2019

Use html5-parser instead of lxml #410

Merged

Mr0grog mentioned this issue Dec 5, 2019

Reorganize content of web_monitoring into directories #528

Merged

6 tasks

Mr0grog closed this as completed Mar 25, 2020

Mr0grog mentioned this issue Oct 14, 2020

Extract diffs and diff server into separate repo & Python package #638

Closed

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Radical suggestion: Deprecate this repo and break up pieces #206

Radical suggestion: Deprecate this repo and break up pieces #206

danielballan commented Aug 1, 2018 •

edited

Loading

Mr0grog commented Aug 1, 2018 •

edited

Loading

Mr0grog commented Aug 8, 2018 •

edited

Loading

danielballan commented Sep 14, 2018

stale bot commented Mar 13, 2019

Mr0grog commented Apr 24, 2019

Mr0grog commented Dec 5, 2019

Mr0grog commented Dec 5, 2019

danielballan commented Dec 5, 2019

Mr0grog commented Mar 25, 2020

Radical suggestion: Deprecate this repo and break up pieces #206

Radical suggestion: Deprecate this repo and break up pieces #206

Comments

danielballan commented Aug 1, 2018 • edited Loading

Mr0grog commented Aug 1, 2018 • edited Loading

Mr0grog commented Aug 8, 2018 • edited Loading

danielballan commented Sep 14, 2018

stale bot commented Mar 13, 2019

Mr0grog commented Apr 24, 2019

Mr0grog commented Dec 5, 2019

Mr0grog commented Dec 5, 2019

danielballan commented Dec 5, 2019

Mr0grog commented Mar 25, 2020

danielballan commented Aug 1, 2018 •

edited

Loading

Mr0grog commented Aug 1, 2018 •

edited

Loading

Mr0grog commented Aug 8, 2018 •

edited

Loading