v0.1.14
This release includes CPU support and OCR in the Sycamore Partitioner, caching for better performance and lower cost when using Textract for table extraction, an upgraded version of Ray (2.10), and more.
What's Changed
- mark rps version as latest rc by @HenryL27 in #291
- Cleanup rewriting - cloning doesn't work by @eric-anderson in #292
- Fix integ test import error. by @bsowell in #293
- Change notebook working directory when running outside container. by @bsowell in #294
- Fix bug in undocumented/untested prefix limiting feature. by @eric-anderson in #295
- Implement CachedTextractTableExtractor by @bohou-aryn in #288
- Upgrade the openai Python library to 1.x and guidance to 0.1.x. by @bsowell in #242
- Reorder partitioner output and fix model loading inefficiency by @bohou-aryn in #277
- Refactor sycamore to apps, lib by @HenryL27 in #296
- add averaged_perceptron_tagger to nltk downloads by @HenryL27 in #301
- fix jupyter bind mount path by @HenryL27 in #302
- Make sure filetype property is already set. by @eric-anderson in #298
- initialize messages index on startup by @HenryL27 in #303
- Add demo UI by @HenryL27 in #300
- Address HTML viewer bug when doing sycamore_crawler_http_sort_all by @alexaryn in #304
- Make SycamorePartitioner runnable on CPUs. by @bsowell in #299
- Get all the containers building and working again. by @eric-anderson in #305
- Switch from Exception to RuntimeError by @eric-anderson in #306
- remove submodule steps from plugin checkout in dockerfile because sub… by @HenryL27 in #309
- Fix dockerfile to work post merge by @eric-anderson in #310
- Add some documentation for NDD: Sketcher at ingestion time. by @alexaryn in #307
- Add sketch() after explode() in all our default pipelines. by @alexaryn in #312
- Add remote processor service by @HenryL27 in #311
- use ADD instead of RUN git clone to checkout git repos by @HenryL27 in #313
- Change from nmslib to faiss everywhere. by @alexaryn in #314
- Add tesseract-ocr to container dependencies. by @bsowell in #316
- compile docs with poetry by @HenryL27 in #317
- Add support for OCR in the Sycamore partitioner. by @bsowell in #315
- Setup query-time NDD: pre-create RPS processors, add to pipelines by @alexaryn in #318
- Changes needed for vanilla build of importer and RPS containers. by @alexaryn in #320
- Add shingles to _source to enable query-time near duplicate detection by @alexaryn in #321
- Fix importer to check for user, apply similar fix to crawlers by @eric-anderson in #322
- Remove obsolete files from the quickstart -> sycamore repo merge. by @eric-anderson in #283
- Upgrade to Ray 2.10.0. by @bsowell in #319
- Upgrade guidance to 0.1.13. by @bsowell in #323
- Remove mypy --explicit-package-bases flag and fix issues. by @bsowell in #324
- Update poetry.lock files based on recent sycamore dependency changes. by @bsowell in #325
- Deal with renamed file. by @alexaryn in #329
- Added -anon switch to S3 crawler for public buckets. by @alexaryn in #327
- add docs for RPS by @HenryL27 in #328
- Add Jupyter notebook to demonstrate query-time NDD. by @alexaryn in #326
- Expand NDD doc into separate file. by @alexaryn in #330
- Bump version to 0.1.14. by @bsowell in #332
- Add .profile to container so that we get poetry python not container python by @eric-anderson in #331
- Update dedup.md by @jonfritz in #334
Full Changelog: v0.1.13...v0.1.14