Skip to content

Latest commit

 

History

History
57 lines (35 loc) · 4.32 KB

README.MD

File metadata and controls

57 lines (35 loc) · 4.32 KB

Parallel File Downloader

Installation

  1. Clone this repository.
  2. Get python3.6+.
  3. Run python3 -m virtualenv venv.
  4. Run the virtualenv activation script.
    • On Ubuntu, this is source venv/bin/activate
    • On Windows, you can run the bash script under venv/bin
  5. Run pip install -r requirements.txt to fetch the external modules.
  6. Set the relative download directory for all files in constants.py.
  7. To learn how to use the program, run ./downloader -h. For reference, usage is: ./downloader.py URL -c nThreads.

Assumptions

The intended behavior of the program is to download a file concurrently from an HTTP server, that ideally supports multi-part requests using byte-ranges. The file server must set the Content-Length in addition, to determine the size of the file prior to actually downloading it.

If either the file size cannot be determined using a HEAD HTTP request, or if the server does not support multi-part requests, the file will be downloaded serially.

Design Choices

I am using Python's ThreadPoolExecutor that uses the multithreaded concurrency model to download roughly equally sized parts of the file concurrently (if possible).

Even though Python has a Global Interpreter Lock (GIL) that forces threads to be interleaved on a single CPU as opposed to multiple cores, since file downloading is an I/O-heavy operation, this should still result in significant download time improvements relative to single-threaded downloading, as most of the time is spent waiting for data to arrive through the network.

Each part of the file is also streamed(a.k.a. "fetched in chunks") in the event that each part is too large to store in memory. We download a page at a time (4KiB), and store it to disk; this process continues until the entire chunk is downloaded.

Once the downloading completes, all the files are merged. This is done by appending each file after the first to the first file.

This merging process can be optimized, but I decided not to implement it, assuming that bottleneck on a server is not disk speed but network bandwidth.

Performance Bottlenecks

I have kept the maximum concurrent requests capped at 4, in order to prevent throttling from the server, and for this to be considered a denial-of-service attack. Most browsers have this limit set this to a number between 8 and 13 when fetching files.

If the number of threads is very high (possibly due to an unthinking user), the overhead incurred in switching between so many threads, as well as opening many simultaneous TCP connections, will lead to a significant slowdown.

Testing

Initially, I used manual testing, by downloading files as large as 240MiB, with a varying number of threads (1-8) and diffing them with the original file. The largest file I tested, showed a highest speedup of about 5.4x with 8 threads (137.9 seconds vs. 25.4 seconds).

Although the downloading process is limited by network bandwidth, the reason we see a speed-up is since the server treats different TCP connections separately.

I have also created a testing script (using the pytest and the subprocess modules) that invokes this script with a list of file URLs and and diffs each file with the output of wget.

The byte range generating function has been unit tested, to ensure the ranges are roughly equal.

To run the tests, run pytest -s -v.

Potential Improvements / Optimizations

  • The file merging process can be done in O(log n) time where n is the number of chunks of the file, as instead of merging serially, we merge consecutive blocks in parallel to minimize the on-disk I/O.

  • If this downloader is used multiple times, we can cache files that are downloaded multiple times, by keeping track of downloaded files in a persistent store (e.g. SQL database).

  • It would be nice if this downloader could support pause / resume semantics, similar to how deep learning packages can resume model training by reading past weights from a file.

  • If the downloading of multiple files needs to be supported, it probably makes sense to parallelize the download of each file, as opposed to chunks of files. This tool could support a configuration file / job description language which is parsed to process a list of files, and options for how they should be downloaded, including priority levels.

  • This tool could be designed to support other protocols such as SSH and FTP.