Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

37 implement a quotient filter #111

Merged
merged 16 commits into from
Jan 4, 2024
2 changes: 1 addition & 1 deletion .pylintrc
Original file line number Diff line number Diff line change
Expand Up @@ -535,7 +535,7 @@ function-naming-style=snake_case
#function-rgx=

# Good variable names which should always be accepted, separated by a comma.
good-names=i,j,k,b,f,v,m,n,p,d,hh,st,ex,Run,_
good-names=i,j,k,b,f,v,m,n,p,d,hh,st,ex,Run,_,r,q

# Good variable names regexes, separated by a comma. If names match any regex,
# they will always be accepted
Expand Down
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
# PyProbables Changelog

### Version 0.6.0

* Add `QuotientFilter` implementation; [see issue #37](https://github.com/barrust/pyprobables/issues/37)
* Add `bitarray` implementation

### Version 0.5.9

* Add `py.typed` files so that mypy will find type annotations
Expand Down
35 changes: 35 additions & 0 deletions docs/source/code.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ operations. Bloom Filters guarantee a zero percent false negative rate
and a predetermined false positive rate. Once the number of elements inserted
exceeds the estimated elements, the false positive rate will increase over the
desired amount.

`Further Reading <https://en.wikipedia.org/wiki/Bloom_filter>`__


Expand Down Expand Up @@ -69,6 +70,7 @@ membership testing. Cuckoo filters support insertion, deletion, and lookup of
elements with low overhead and few false positive results. The name is derived
from the `cuckoo hashing <https://en.wikipedia.org/wiki/Cuckoo_hashing>`__
strategy used to resolve conflicts.

`Further Reading <https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf>`__

CuckooFilter
Expand All @@ -92,6 +94,7 @@ data elements. The result is a probabilistic count of elements inserted into
the data structure. It will always provide the **maximum** number of times a
data element was encountered. Notice that the result may be **more** than the
true number of times it was inserted, but never fewer.

`Further Reading <https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch>`__


Expand Down Expand Up @@ -137,6 +140,38 @@ StreamThreshold

For more information of all methods and properties, see `CountMinSketch`_.

QuotientFilter
------------------

Quotient filters are an aproximate membership query filter (AMQ) that is both
space efficient and returns a zero false negative rate and a probablistic false
positive rate. Unlike Bloom filters, the quotient filter only requires a single
hash of the element to insert. The upper **q** bits denote the location within the
filter while the lower **r** bits are stored in the filter.

Quotient filters provide some useful benifits over Bloom filters including:

* Merging of two filters (not union)
* Resizing of the filter
* Ability to remove elements

`Further Reading <https://en.wikipedia.org/wiki/Quotient_filter>`__

QuotientFilter
+++++++++++++++++++++++++++++++

.. autoclass:: probables.QuotientFilter
:members:


Utilities
------------------

Bitarray
+++++++++++++++++++++++++++++++

.. autoclass:: probables.utilities.Bitarray
:members:

Exceptions
============================
Expand Down
9 changes: 8 additions & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,14 @@
# further. For a list of options available for each theme, see the
# documentation.
#
# html_theme_options = {}

html_theme_options = {
# "collapse_navigation": True,
# "sticky_navigation": True,
# "navigation_depth": 4,
# "includehidden": True,
# "titles_only": False,
}

# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
Expand Down
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
.. _home:
.. include:: ../../README.rst


.. toctree::

code
Expand Down
28 changes: 28 additions & 0 deletions docs/source/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -253,6 +253,34 @@ The counting cuckoo filter is similar to the standard filter except that it
tracks the number of times a fingerprint has been added to the filter.


Quotient Filters
----------------

Quotient Filters provide set operations of large datasets while being relatively
small in memory footprint. They provide a zero percent false negative rate and a
small false positive rate.
`more information <https://en.wikipedia.org/wiki/Quotient_filter>`__


Import, Initialize, and Train
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python3

>>> qf = QuotientFilter(quotient=22)
>>> with open('war_and_peace.txt', 'r') as fp:
>>> for line in fp:
>>> for word in line.split():
>>> blm.add(word.lower()) # add each word to the bloom filter!


Query the Quotient Filter
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python3

>>> words_to_check = ['borzoi', 'diametrically', 'fleches', 'rain', 'foo']
>>> for word in words_to_check:
>>> print(qf.check(word)) # prints: True, True, True, True, False

Custom Hashing Functions
----------------------------------
In many instances, to get the best raw performance out of the data structures,
Expand Down
20 changes: 6 additions & 14 deletions probables/__init__.py
Original file line number Diff line number Diff line change
@@ -1,19 +1,7 @@
""" pyprobables module """

from .blooms import (
BloomFilter,
BloomFilterOnDisk,
CountingBloomFilter,
ExpandingBloomFilter,
RotatingBloomFilter,
)
from .countminsketch import (
CountMeanMinSketch,
CountMeanSketch,
CountMinSketch,
HeavyHitters,
StreamThreshold,
)
from .blooms import BloomFilter, BloomFilterOnDisk, CountingBloomFilter, ExpandingBloomFilter, RotatingBloomFilter
from .countminsketch import CountMeanMinSketch, CountMeanSketch, CountMinSketch, HeavyHitters, StreamThreshold
from .cuckoo import CountingCuckooFilter, CuckooFilter
from .exceptions import (
CuckooFilterFullError,
Expand All @@ -22,6 +10,8 @@
ProbablesBaseException,
RotatingBloomFilterError,
)
from .quotientfilter import QuotientFilter
from .utilities import Bitarray

__author__ = "Tyler Barrus"
__maintainer__ = "Tyler Barrus"
Expand Down Expand Up @@ -50,4 +40,6 @@
"ExpandingBloomFilter",
"RotatingBloomFilter",
"RotatingBloomFilterError",
"QuotientFilter",
"Bitarray",
]
16 changes: 8 additions & 8 deletions probables/blooms/bloom.py
Original file line number Diff line number Diff line change
Expand Up @@ -286,7 +286,7 @@ def export(self, file: Union[Path, str, IOBase, mmap]) -> None:
"""Export the Bloom Filter to disk

Args:
filename (str): The filename to which the Bloom Filter will be written."""
file (str): The file or filepath to which the Bloom Filter will be written."""
if not isinstance(file, (IOBase, mmap)):
file = resolve_path(file)
with open(file, "wb") as filepointer:
Expand Down Expand Up @@ -658,23 +658,23 @@ def close(self) -> None:
self.__file_pointer.close()
self.__file_pointer = None

def export(self, filename: Union[str, Path]) -> None: # type: ignore
def export(self, file: Union[str, Path]) -> None: # type: ignore
"""Export to disk if a different location

Args:
filename (str): The filename to which the Bloom Filter will be exported
file (str|Path): The filename to which the Bloom Filter will be exported
Note:
Only exported if the filename is not the original filename"""
self.__update()
if filename and Path(filename) != self._filepath:
copyfile(self._filepath.name, str(filename))
if file and Path(file) != self._filepath:
copyfile(self._filepath.name, str(file))
# otherwise, nothing to do!

def _load(self, filepath: Union[str, Path], hash_function: Union[HashFuncT, None] = None): # type: ignore
def _load(self, file: Union[str, Path], hash_function: Union[HashFuncT, None] = None): # type: ignore
"""load the Bloom Filter on disk"""
# read the file, set the optimal params
# mmap everything
file = resolve_path(filepath)
file = resolve_path(file)
with open(file, "r+b") as filepointer:
offset = self._FOOTER_STRUCT.size
filepointer.seek(offset * -1, os.SEEK_END)
Expand All @@ -683,7 +683,7 @@ def _load(self, filepath: Union[str, Path], hash_function: Union[HashFuncT, None
fpr, n_hashes, n_bits = self._get_optimized_params(est_els, fpr)
self._set_values(est_els, fpr, n_hashes, n_bits, hash_function)
# setup a few additional items
self.__file_pointer = open(filepath, "r+b") # type: ignore
self.__file_pointer = open(file, "r+b") # type: ignore
self._bloom = mmap(self.__file_pointer.fileno(), 0) # type: ignore
self._on_disk = True

Expand Down
22 changes: 21 additions & 1 deletion probables/hashes.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
from struct import unpack
from typing import Callable, List, Union

from .constants import UINT64_T_MAX
from .constants import UINT32_T_MAX, UINT64_T_MAX

KeyT = Union[str, bytes]
SimpleHashT = Callable[[KeyT, int], int]
Expand Down Expand Up @@ -103,6 +103,26 @@ def fnv_1a(key: KeyT, seed: int = 0) -> int:
return hval


def fnv_1a_32(key: KeyT, seed: int = 0) -> int:
"""Pure python implementation of the 32 bit fnv-1a hash
Args:
key (str): The element to be hashed
seed (int): Add a seed to the initial starting point (0 means no seed)
Returns:
int: 32-bit hashed representation of key
Note:
Uses the lower 32 bits when overflows occur"""
max32mod = UINT32_T_MAX + 1
hval = (0x811C9DC5 + (31 * seed)) % max32mod
fnv_32_prime = 0x01000193
tmp = list(key) if not isinstance(key, str) else list(map(ord, key))
for t_str in tmp:
hval ^= t_str
hval *= fnv_32_prime
hval %= max32mod
return hval


@hash_with_depth_bytes
def default_md5(key: KeyT, *args, **kwargs) -> bytes:
"""The default md5 hashing routine
Expand Down
6 changes: 6 additions & 0 deletions probables/quotientfilter/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
""" Quotient Filters """


from .quotientfilter import QuotientFilter

__all__ = ["QuotientFilter"]
Empty file.
Loading