Crawl Log Viewer

A simple crawl-log-viewer as a standalone web service. Retrieves and filters log streams stored as Kafka topics. Here's an example, showing how you can hover over the status code to get an explanation of what it means:

Usage

The service can be deployed via Docker, and needs a configuration file that points it to where the Kafka brokers are, and what topics to inspect.

Once up and running, the view defaults to the previous days activity from the first topic in the configuration. This works great for small crawls, but for bigger crawls you can use the filters.

Filters

The filters use the fnmatch library to provide a simple filtering syntax. Here are some examples, with links that should work if you're running the service locally (as per the Local Development Setup below).

Status Code:
- 404 show Not Found URLs
- 3* show all redirects
- -9998 Show all URLs blocked by robots.txt
- -* show lines with Heritrix's negative status codes (usually errors/problems)
- [!-]* don't show lines with negative status codes.
URL:
- .webarchive.org.uk match URLs against a hostname.
Hop Path:
- _ show all seeds (marked in this system with underscores _).
- P show prerequisites (DNS/robots.txt etc.)
- P* show prerequisites and any other URLs found via prerequisites.
Content Type:
- image/* show all images

Integrations

The tool is preconfigured to links URLs to the UKWA internal (QA) Wayback service, and if the events are tagged with a source that looks like tid:<NNN>:<URL>, that will link back to the relevant record in our W3ACT curation service.

Linking the other way around, any tool that can lookup when a URL was crawled (e.g. pywb, or OutbackCDX) can then be used to build a link to this service with the appropriate time offset and filters, in order to inspect the details of a particular crawl.

Local Development Setup

You can run and populate a local Kafka service using Docker Compose:

docker-compose up -d kafka

...wait a bit, then...

./populate-test-kafka.sh

If you want to check what's in there, use:

docker-compose up kafka-ui

And go to http://localhost:9000 to look around.

Once there's a Kafka available, you can set up a development environment:

virtualenv -p python3 venv
source venv/bin/activate
pip install -r requirements.txt

After which the app can be run like this:

export FLASK_DEBUG=1
FLASK_APP=logs.py flask run

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.github/workflows		.github/workflows
docs		docs
static		static
templates		templates
test		test
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
hops.txt		hops.txt
logs.py		logs.py
parser.py		parser.py
populate-test-kafka.sh		populate-test-kafka.sh
requirements.txt		requirements.txt
status_codes.txt		status_codes.txt
streamer.py		streamer.py
topics.json		topics.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawl Log Viewer

Usage

Filters

Integrations

Local Development Setup

About

Releases

Packages

Contributors 3

Languages

License

ukwa/crawl-log-viewer

Folders and files

Latest commit

History

Repository files navigation

Crawl Log Viewer

Usage

Filters

Integrations

Local Development Setup

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages