Merge pull request #47 from remix/v1.0.0

v1.0
remix · Dec 18, 2018 · 8813824 · 8813824
2 parents 8b67dae + f94595c
commit 8813824
Show file tree

Hide file tree

Showing 26 changed files with 1,178 additions and 1,170 deletions.
diff --git a/.flake8 b/.flake8
@@ -0,0 +1,12 @@
+[flake8]
+exclude =
+    .eggs
+    .git
+    __pycache__
+    build
+    dist
+    docs
+    scratch
+    venv
+
+max-line-length = 100
diff --git a/.gitignore b/.gitignore
@@ -67,3 +67,7 @@ venv/
 
 scratch/
 .DS_Store
+.pytest_cache
+.ipynb_checkpoints/*
+*.ipynb
+.mypy_cache
diff --git a/.travis.yml b/.travis.yml
@@ -4,14 +4,9 @@
 language: python
 python:
   - 3.6
-  - 3.5
-  - 2.7
-  # TODO(DW) fix
-  # - 3.4
-  # - 3.3
 
 # command to install dependencies, e.g. pip install -r requirements.txt --use-mirrors
-install: pip install -U tox-travis flake8
+install: python setup.py install && pip install -U black flake8 mypy
 
 # command to run tests, e.g. python setup.py test
-script: tox && make lint
+script: make test
diff --git a/CONTRIBUTING.rst b/CONTRIBUTING.rst
@@ -76,13 +76,13 @@ Ready to contribute? Here's how to set up `partridge` for local development.
 
    Now you can make your changes locally.
 
-5. When you're done making changes, check that your changes pass flake8 and the tests, including testing other Python versions with tox::
+5. When you're done making changes, check that your changes pass flake8 and the tests::
 
     $ flake8 partridge tests
     $ python setup.py test or py.test
     $ tox
 
-   To get flake8 and tox, just pip install them into your virtualenv.
+   To get flake8, just pip install it into your virtualenv.
 
 6. Commit your changes and push your branch to GitHub::
 
@@ -101,7 +101,7 @@ Before you submit a pull request, check that it meets these guidelines:
 2. If the pull request adds functionality, the docs should be updated. Put
    your new functionality into a function with a docstring, and add the
    feature to the list in README.rst.
-3. The pull request should work for Python 2.6, 2.7, 3.3, 3.4 and 3.5, and for PyPy. Check
+3. The pull request should work for Python 3.6+. Check
    https://travis-ci.org/remix/partridge/pull_requests
    and make sure that the tests pass for all supported Python versions.
 
@@ -110,5 +110,5 @@ Tips
 
 To run a subset of tests::
 
-$ py.test tests.test_partridge
+$ py.test tests.test_feed
 
diff --git a/HISTORY.rst b/HISTORY.rst
@@ -1,8 +1,31 @@
 History
 =======
 
+1.0.0 (2018-12-18)
+------------------
+
+This release is a combination of major internal refactorings and some minor interface changes. Overall, you should expect your upgrade from pre-1.0 versions to be relatively painless. A big thank you to @genhernandez and @csb19815 for their valuable design feedback.
+
+Here is a list of interface changes:
+
+* The class ``partridge.gtfs.feed`` has been renamed to ``partridge.gtfs.Feed``.
+* The public interface for instantiating feeds is ``partridge.load_feed``. This function replaces the previously undocumented function ``partridge.get_filtered_feed``.
+* A new function has been added for identifying the busiest week in a feed: ``partridge.read_busiest_date``
+* The public function ``partridge.get_representative_feed`` has been removed in favor of using ``partridge.read_busiest_date`` directly.
+* The public function ``partridge.writers.extract_feed`` is now available via the top level module: ``partridge.extract_feed``.
+
+Miscellaneous minor changes:
+
+* Character encoding detection is now done by the ``cchardet`` package instead of ``chardet``. ``cchardet`` is faster, but may not always return the same result as ``chardet``.
+* Zip files are unpacked into a temporary directory instead of reading directly from the zip. These temporary directories are cleaned up when the feed is garbage collected or when the process exits.
+* The code base is now annotated with type hints and the build runs ``mypy`` to verify the types.
+* DataFrames are cached in a dictionary instead of the ``functools.lru_cache`` decorator.
+* The ``partridge.extract_feed`` function now writes files concurrently to improve performance.
+
+
 0.11.0 (2018-08-01)
 -------------------
+
 * Fix major performance issue related to encoding detection. Thank you to @cjer for reporting the issue and advising on a solution.
 
 
@@ -23,9 +46,7 @@ History
 0.8.0 (2018-03-14)
 ------------------
 
-* Gracefully handle completely empty files. This change unifies the behavior of reading from a CSV
-with a header only (no data rows) and a completely empty (zero bytes)
-file in the zip.
+* Gracefully handle completely empty files. This change unifies the behavior of reading from a CSV with a header only (no data rows) and a completely empty (zero bytes) file in the zip.
 
 
 0.7.0 (2018-03-09)

diff --git a/Makefile b/Makefile
@@ -53,16 +53,20 @@ dependency-graph.png:
 
 dot: dependency-graph.png
 
-lint: ## check style with flake8
-	flake8 partridge tests
+black:
+	black partridge tests
+
+lint: ## check style with black
+	black --check --diff partridge tests
+	flake8
+
+type-check:
+	mypy partridge --ignore-missing-imports
 
 ## run tests quickly with the default Python
-test: lint
+test: lint type-check
 	py.test
 
-test-all: ## run tests on every Python version with tox
-	tox
-
 coverage: ## check code coverage quickly with the default Python
 	coverage run --source partridge -m pytest
 	coverage report -m

diff --git a/README.rst b/README.rst
@@ -1,3 +1,4 @@
+=========
 Partridge
 =========
 
@@ -11,9 +12,11 @@ Partridge
 
 Partridge is python library for working with `GTFS <https://developers.google.com/transit/gtfs/>`__ feeds using `pandas <https://pandas.pydata.org/>`__ DataFrames.
 
-The implementation of Partridge is heavily influenced by our experience at `Remix <https://www.remix.com/>`__ ingesting, analyzing, and debugging thousands of GTFS feeds from hundreds of agencies.
+Partridge is heavily influenced by our experience at `Remix <https://www.remix.com/>`__ analyzing and debugging every GTFS feed we could find.
+
+At the core of Partridge is a dependency graph rooted at ``trips.txt``. Disconnected data is pruned away according to this graph when reading the contents of a feed.
 
-At the core of Partridge is a dependency graph rooted at ``trips.txt``. Disconnected data is pruned away according to this graph when reading the contents of a feed. The root node can optionally be filtered to create a view of the feed specific to your needs. It's most common to filter a feed down to specific dates (``service_id``), routes (``route_id``), or both.
+Feeds can also be filtered to create a view specific to your needs. It's most common to filter a feed down to specific dates (``service_id``) or routes (``route_id``), but any field can be filtered.
 
 .. figure:: dependency-graph.png
    :alt: dependency graph
@@ -36,57 +39,112 @@ The design of Partridge is guided by the following principles:
 - Do anything other than efficiently read GTFS files into DataFrames
 - Take an opinion on the GTFS spec
 
+
+Installation
+------------
+
+.. code:: console
+
+    pip install partridge
+
+
 Usage
 -----
 
-**Reading a feed**
+**Setup**
 
 .. code:: python
 
-    import datetime
     import partridge as ptg
 
-    path = 'path/to/sfmta-2017-08-22.zip'
+    inpath = 'path/to/caltrain-2017-07-24/'
+
+
+Inspecting the calendar
+~~~~~~~~~~~~~~~~~~~~~~~
+
+
+**The date with the most trips**
+
+.. code:: python
+
+    date, service_ids = ptg.read_busiest_date(inpath)
+    #  datetime.date(2017, 7, 17), frozenset({'CT-17JUL-Combo-Weekday-01'})
+
+
+**The week with the most trips**
+
+
+.. code:: python
+
+    service_ids_by_date = ptg.read_busiest_week(inpath)
+    #  {datetime.date(2017, 7, 17): frozenset({'CT-17JUL-Combo-Weekday-01'}),
+    #   datetime.date(2017, 7, 18): frozenset({'CT-17JUL-Combo-Weekday-01'}),
+    #   datetime.date(2017, 7, 19): frozenset({'CT-17JUL-Combo-Weekday-01'}),
+    #   datetime.date(2017, 7, 20): frozenset({'CT-17JUL-Combo-Weekday-01'}),
+    #   datetime.date(2017, 7, 21): frozenset({'CT-17JUL-Combo-Weekday-01'}),
+    #   datetime.date(2017, 7, 22): frozenset({'CT-17JUL-Caltrain-Saturday-03'}),
+    #   datetime.date(2017, 7, 23): frozenset({'CT-17JUL-Caltrain-Sunday-01'})}
+
+
+**Dates with active service**
+
+.. code:: python
 
     service_ids_by_date = ptg.read_service_ids_by_date(path)
 
-    date = datetime.date(2017, 9, 25)
-    service_ids = service_ids_by_date[date]
+    date, service_ids = min(service_ids_by_date.items())
+    #  (datetime.date(2017, 7, 15), frozenset({'CT-17JUL-Caltrain-Saturday-03'}))
 
-    feed = ptg.feed(path, view={
-        'trips.txt': {
-            'service_id': service_ids,
-            'route_id': '12300',
-        },
-    })
+    date, service_ids = max(service_ids_by_date.items())
+    #  (datetime.date(2019, 7, 20), frozenset({'CT-17JUL-Caltrain-Saturday-03'}))
 
-    assert service_ids == set(feed.trips.service_id)
 
-    len(feed.stops)
-    #  88
+**Dates with identical service**
+
+
+.. code:: python
 
-    feed.routes.head()
-    #  route_id agency_id route_short_name route_long_name route_desc  route_type  \
-    #     12300     SFMTA               18     46TH AVENUE        NaN           3
-    #
-    #  route_url route_color route_text_color
-    #        NaN         NaN              NaN
+    dates_by_service_ids = ptg.read_dates_by_service_ids(inpath)
+
+    busiest_date, busiest_service = ptg.read_busiest_date(inpath)
+    dates = dates_by_service_ids[busiest_service]
+
+    min(dates), max(dates)
+    #  datetime.date(2017, 7, 17), datetime.date(2019, 7, 19)
+
+
+Reading a feed
+~~~~~~~~~~~~~~
 
 
-**Extracting a new feed**
 
 .. code:: python
 
-    import partridge as ptg
+    _date, service_ids = ptg.read_busiest_date(inpath)
+
+    view = {
+        'trips.txt': {'service_id': service_ids},
+        'stops.txt': {'stop_name': 'Gilroy Caltrain'},
+    }
+
+    feed = ptg.load_feed(path, view)
+
+
+Extracting a new feed
+~~~~~~~~~~~~~~~~~~~~~
+
+.. code:: python
 
-    inpath = 'gtfs.zip'
     outpath = 'gtfs-slim.zip'
 
     date, service_ids = ptg.read_busiest_date(inpath)
+    view = {'trips.txt': {'service_id': service_ids}}
 
-    ptg.writers.extract_feed(inpath, outpath, {'trips.txt': {'service_id': service_ids}})
+    ptg.extract_feed(inpath, outpath, view)
+    feed = ptg.load_feed(outpath)
 
-    assert service_ids == set(ptg.feed(outpath).trips.service_id)
+    assert service_ids == set(feed.trips.service_id)
 
 
 Features
@@ -100,13 +158,6 @@ Features
 -  Handle nested folders and bad data in zips
 -  Predictable type conversions
 
-Installation
-------------
-
-.. code:: console
-
-    pip install partridge
-
 Thank You
 ---------