Skip to content

ukwa/crawl-log-viewer

Repository files navigation

Crawl Log Viewer

A simple crawl-log-viewer as a standalone web service. Retrieves and filters log streams stored as Kafka topics. Here's an example, showing how you can hover over the status code to get an explanation of what it means:

log-viewer-screenshot

Usage

The service can be deployed via Docker, and needs a configuration file that points it to where the Kafka brokers are, and what topics to inspect.

Once up and running, the view defaults to the previous days activity from the first topic in the configuration. This works great for small crawls, but for bigger crawls you can use the filters.

Filters

The filters use the fnmatch library to provide a simple filtering syntax. Here are some examples, with links that should work if you're running the service locally (as per the Local Development Setup below).

  • Status Code:
    • 404 show Not Found URLs
    • 3* show all redirects
    • -9998 Show all URLs blocked by robots.txt
    • -* show lines with Heritrix's negative status codes (usually errors/problems)
    • [!-]* don't show lines with negative status codes.
  • URL:
  • Hop Path:
    • _ show all seeds (marked in this system with underscores _).
    • P show prerequisites (DNS/robots.txt etc.)
    • P* show prerequisites and any other URLs found via prerequisites.
  • Content Type:

Integrations

The tool is preconfigured to links URLs to the UKWA internal (QA) Wayback service, and if the events are tagged with a source that looks like tid:<NNN>:<URL>, that will link back to the relevant record in our W3ACT curation service.

Linking the other way around, any tool that can lookup when a URL was crawled (e.g. pywb, or OutbackCDX) can then be used to build a link to this service with the appropriate time offset and filters, in order to inspect the details of a particular crawl.

Local Development Setup

You can run and populate a local Kafka service using Docker Compose:

docker-compose up -d kafka

...wait a bit, then...

./populate-test-kafka.sh

If you want to check what's in there, use:

docker-compose up kafka-ui

And go to http://localhost:9000 to look around.

Once there's a Kafka available, you can set up a development environment:

virtualenv -p python3 venv
source venv/bin/activate
pip install -r requirements.txt

After which the app can be run like this:

export FLASK_DEBUG=1
FLASK_APP=logs.py flask run

About

A simple web service for viewing crawl logs.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •