Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Radical suggestion: Deprecate this repo and break up pieces #206

Closed
danielballan opened this issue Aug 1, 2018 · 9 comments
Closed

Radical suggestion: Deprecate this repo and break up pieces #206

danielballan opened this issue Aug 1, 2018 · 9 comments
Assignees

Comments

@danielballan
Copy link
Contributor

danielballan commented Aug 1, 2018

I am tempted to propose deprecating web-monitoring-processing and creating several separate repos in its place.

Arguments in favor:

  • "Many small packages" is good for reuse and shared maintenance. Lots of people might be happy to use and potentially contribute to a modern, well-tested Python RESTer for the Internet Archive CDX API, but be uninterested in the other bits.
  • The diffing server and the diffing functions have very different (maybe totally nonintersecting?) dependency sets. It would be nice to attract more contributors to our diffing work, and it might be easier to get up and running if they didn't have to encouter the server code at all.
  • It's hard to properly implement semver [and, more generally, to manage release schedules] on a codebase that has so many loosely coupled pieces. This is why I propose separae repos, with one Python package per repo, as opposed to several Python packages that share one repo. It's better to have one package per repo so that the version on PyPI and the version in the git tag are in sync.
  • This would be a graceful way to give up on some of the old code that we have maintained (well, refrained from deleting at least) in the hope that some hackathon contributors might someday return to expand on them. It's been 18 months.

Arguments against:

  • There are already a lot of repos! And more repos can mean more baseline maintenance work. Cookiecutters (like this one I developed recently) can lower the startup cost, but there is an ongoing maintenance cost to keeping CI up to date, etc.
  • Coordinated releases can be tricky. Currently, if we want to change the API between the diffing server and the diffing functions, we can just do it in one granular step. If we split them up, we have to manage coordinated releases. This is probably not a big problem since we are the only user of this code.
@Mr0grog
Copy link
Member

Mr0grog commented Aug 1, 2018

Note there is another concern not well covered here: scraping (#174) and monitoring (#172). That said, they could maybe be lumped in with a future “Python implementation of the Ruby server.”

My 2¢: I’m not sure I see as much value in splitting it up so granularly. I feel like I see two major areas here…

  1. Diffing Service (which I kind of want to merge into a Python rewrite of the DB server, except I suspect that would be problematic for Internet Archive if they really want to make something serious out of the diffing work here — can you comment, @vbanos?)

  2. Pile of useful tools for loading and analyzing data about web page snapshots (APIs to Internet Archive, our DB, etc.) that we and analysts might use in scripts. I would just keep that here instead of deprecating this.

    • I have no concerns about cleaning out cobwebs and deleting unused & unmaintained stuff here — it can live on in Git’s history just fine.
    • I’m skeptical that we have enough in the IA CDX API code to be worth publishing. Everything south of list_versions() is not relevant, and I’m not even sure list_versions() makes sense outside our context.
    • I’m not sure the DB API module is worth publishing on its own, either — nobody outside of EDGI can really even use it since we still have no API that doesn’t require authentication.

I can see the actual diffing algorithms living in either place. From a practical standpoint so far, they’ve proven to have zero applicability outside our problem space and thus outside the diffing server. The diffing server is also nothing but a wrapper for them (and, more importantly, at least one past experience showed not having the server to go along with the algorithm to be a real barrier). On the other hand, I can at least clearly envision our analysts using the algorithms in their scripts.

It seems like we are finally semi-serious about a Python rewrite of the DB server. I think the scraping and monitoring tools would eventually move there, but they could continue to live here pretty comfortably until that becomes real enough.

@danielballan danielballan self-assigned this Aug 6, 2018
@Mr0grog
Copy link
Member

Mr0grog commented Aug 8, 2018

From last week’s dev meeting, the short-term plan is to break this repo in two:

  • Diffing server + diffing algorithms
  • Toolbox of other stuff, like our API client, Internet Archive CDX client, etc. (which probably stays here in this repo; we’ll just remove everything else that doesn’t fit and keep it in a branch)

With a potential future of splitting up the toolbox stuff here when we see it’s useful to do so.

@danielballan please edit if I got this wrong.

@danielballan
Copy link
Contributor Author

To surface a conversation from private channels:

  • We still plan to do this eventually, and if possible before IA presents this work in October and possibly brings more eyeballs to the project.
  • We are not 100% sure what the right boundaries are to draw, and since splitting a project is much easier than combining projects, we want to wait until we are sure and do it right the first time.

@stale
Copy link

stale bot commented Mar 13, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

@stale stale bot added the stale label Mar 13, 2019
@Mr0grog Mr0grog added the idea label Mar 13, 2019
@stale stale bot removed the stale label Mar 13, 2019
@Mr0grog
Copy link
Member

Mr0grog commented Apr 24, 2019

Updating this from some recent discussions… the amount of work and oddball logic involved in #174, combined with questions I’ve seen from people on Wayback’s Slack, have convinced my of @danielballan’s original point that a package dedicated to Wayback would be good. (I’d love to answer some questions I’ve seen with “just use this,” but it’s impractical given how buried it is in this repo.)

Also marking this as never-stale, since I think we’ve at least come to consensus that this should be broken up a bit, if not on exactly how.

@Mr0grog
Copy link
Member

Mr0grog commented Dec 5, 2019

We’ve actually made forward progress with this in #512! Most of what was in internetarchive.py has been extracted into the Wayback package.

@Mr0grog
Copy link
Member

Mr0grog commented Dec 5, 2019

I’ve also made sure that the new analyst task sheet script is in a separate repo (https://github.com/edgi-govdata-archiving/web-monitoring-task-sheets/), which puts me in a bit of a mind to at least reorganize things internally here a bit for sanity’s sake. I’m thinking:

web_monitoring/
    tests/
        [same as today]
    diff/
        content_type.py
        differs.py
        diff_errors.py
        html_diff_render.py
        links_diff.py
    diff_server/
        diffing_server.py
    cli/                     # Directory to hold CLI command script stuff
        cli.py               # Entrypoint -- parses args and calls other modules
        ia_healthcheck.py    # Currently in scripts folder
        ia_import.py         # Most of what's currently in cli.py
    __init__.py
    _version.py
    utils.py
    db.py

I don’t think this organization magically solves anything, but I hope it makes the “grab-bag” feeling of this repo a little more navigable, and maybe helps us see where we can or should excise bits into separate places.

Also, to be clear, I’m not suggesting we make each of these folders separate packages. This is just a little bit of organization within the web_monitoring package.

@danielballan
Copy link
Contributor Author

I like this layout. 👍 I agree it will make it easier to grok what the various pieces of this project are.

Mr0grog added a commit that referenced this issue Dec 9, 2019
This reorganizes the content of web_monitoring into a hierarchy of modules for easier management and comprehension. See the discussion in #206 for more.

    scripts/  # Stubs for things in web_monitoring/cli
        annotations_import
        ia_healthcheck
        wm
        wm-diffing-server
    web_monitoring/
        tests/
            [same as today]
        diff/
            content_type.py
            differs.py
            diff_errors.py
            html_diff_render.py
            links_diff.py
        diff_server/
            server.py
        cli/
            cli.py
            ia_healthcheck.py
            ia_import.py
            annotations_import.py
        __init__.py
        _version.py
        utils.py
        db.py

This also drops `filtering.py`, which was vestigial and no longer used.
@Mr0grog
Copy link
Member

Mr0grog commented Mar 25, 2020

I’m going ahead and closing this issue. I think we’ll continue to pull things apart or reorganize where it seems reasonable, but the above re-organization (plus the upcoming reduction in developer time for this project) probably get us as close as is currently reasonable to the main goals of this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants