Releases: aryn-ai/sycamore
v0.1.20
This release refactors Sycamore’s dependencies to use extras in order to conditionally pull in dependencies for connectors and local inference (e.g. creating vector embeddings). For example, if you want to use the OpenSearch connector, you will need to: pip install sycamore-ai[opensearch]. Or, if you want to run a local vector embedding model, you will need to: pip install sycamore-ai[local-inference]. To do both, you will need to: pip install sycamore-ai[opensearch,local-inference]
Also, this release includes performance and stability improvements.
What's Changed
- Dependencies 1/n: Remove need to restart colab runtime by @bsowell in #728
- Don't require installing neo4j unless it's used. by @eric-anderson in #733
- Handle None cases for element.table = <> by @baitsguy in #735
- Fix materialize + S3 not working. by @eric-anderson in #734
- Fixed neo4j relationship property loading + added support for loading lists and dictionaries as properties by @RitxmSaha in #736
- Handle non-hashable data types in opensearch schema extractor by @baitsguy in #737
- docs: update README.md by @eltociear in #739
- Support concurrent libreoffice executions, fix bug to support s3 source paths in file_format_tools by @baitsguy in #741
- Fix calls to structured outputs so that they can be cached by @RitxmSaha in #738
- fix 'SycamorePartitioner' error message by @HenryL27 in #748
- Fix context test by @eric-anderson in #749
- Enforce the constraint that each cell is only in one spanning cell. by @bsowell in #754
- Add context_params decorator to read args from Context by @baitsguy in #747
- Remove unnecessary tracing code. by @mdwelsh in #752
- Dependencies 2/3: Move connectors to extras. by @bsowell in #740
- Allow any pinecone error on create index by @HenryL27 in #750
- Allow all Exceptions while creating Connector Targets by @karanataryn in #753
- Adding new ETL tutorial by @jonfritz in #751
- Add materialize to the ntsb loader for luna by @eric-anderson in #742
- Add Weaviate notebook by @karanataryn in #757
- Update get_started.rst by @jonfritz in #759
- Update pinecone.md by @jonfritz in #758
- added new document structure + tests by @RitxmSaha in #746
- Dependencies 3/3: Add partitioning extras. by @bsowell in #755
- Dependencies: Remove need to restart colab session for aryn-sdk by @bsowell in #756
- Default llm in transforms by @baitsguy in #760
- Improve materialize by @eric-anderson in #762
- adding neo4j s3 proxy for aura db + split_calls flag for entity and relationship extractor. by @RitxmSaha in #761
- Fix show_pages in Google Colab. by @bsowell in #763
- Jonfritz patch 3 tutorial by @jonfritz in #764
- Fix materialize to work even if it is re-executed on the same documents. by @eric-anderson in #765
- add clear_materialize(path=) by @eric-anderson in #767
- Jonfritz patch 3 consoledocs by @jonfritz in #768
- Update docs with more info on dependencies. by @bsowell in #769
- bump sycamore version to 0.1.20 by @HenryL27 in #770
New Contributors
- @eltociear made their first contribution in #739
Full Changelog: v0.1.19...v0.1.20
v0.1.19
This release adds a materialize opertaion and enhanced query functionality along with stability and performance improvements.
Also an experimental neo4j writer.
What's Changed
- Add comment to MetadataDocument superclass call by @eric-anderson in #607
- Update Copyright Year by @karanataryn in #617
- Merge elements in ntsb test ingest by @baitsguy in #619
- Add github ref name to Helicone logs. by @mdwelsh in #618
- Add local (no-ray) execution mode to speed up lineage development by @eric-anderson in #616
- Integrate LLM Extract logic into Sycamore Transforms by @tranade in #608
- parse html tables better by @HenryL27 in #621
- Change Aryn-SDK Error Message by @karanataryn in #622
- Small update to
field_to_value
by @tranade in #620 - Jonfritz patch docsupdate by @jonfritz in #624
- Avoid repeat take_all in Eval Pipeline by @aanya-p in #611
- Add Evaluate as Transform by @Soeb-aryn in #487
- Checking in notebook that calls APS to analyze financial document (10k). by @AbhijitP-009 in #626
- Refactor LogicalOperators to use pydantic. by @mdwelsh in #610
- Added Entity Extractor + HierarchicalDocument by @RitxmSaha in #601
- Rename SycamorePartitionerExample.ipynb to ArynPartitionerExample.ipynb by @jonfritz in #628
- Add LLMFilter as a DocSet Transform by @tranade in #623
- Create
count_distinct
for DocSet by @tranade in #625 - Jonfritz patch 3 update readme by @jonfritz in #629
- Update get_hash_context_file func by @pparmar30 in #603
- Change PDFMiner cache to $HOME/.sycamore/PDFMinerCache. by @mdwelsh in #634
- Add Context.config by @baitsguy in #627
- Fixup git repo from accidental pushes via overrides by @eric-anderson in #636
- Include match and range filter functions by @tranade in #630
- uncap python version for aryn-sdk by @HenryL27 in #638
- Add support to materialize to write documents out to files. by @eric-anderson in #640
- Refactor OpenSearchSchema to be more robust. by @mdwelsh in #639
- reading env variable as suggested and cosmetic changes by @Soeb-aryn in #609
- Fix code execution and trace display in Query UI by @tranade in #646
- Added OpenAI Async client by @RitxmSaha in #632
- A couple of small tweaks to make Sycamore more robust to missing or bogus data. by @mdwelsh in #649
- Add generic traverse by @eric-anderson in #648
- Shift more operations by @tranade in #631
- Refactor Context and support args in Map by @baitsguy in #637
- Changed OpenAI Cache integration test by @RitxmSaha in #651
- Run poetry-lock-all until the dependencies became consistent. by @eric-anderson in #652
- Fix range filter problem by @tranade in #654
- Fix codegen syntax/formatting by @baitsguy in #655
- fix table html parsing edge case by @HenryL27 in #656
- Code executor by @baitsguy in #657
- Switch Luna tracing to use materialize. by @mdwelsh in #653
- Add documentation on output of Aryn Partitioning Service by @MarkLindblad in #633
- Revamp Sycamore Query demo UI. by @mdwelsh in #659
- Adding new docs for Aryn Partitioning Service. Added a gentle introduction to APS docs and rearranged some of the existing APS docs. by @AbhijitP-009 in #660
- Bugfix for query dry-run mode by @baitsguy in #661
- Codegen with traces in UI by @tranade in #658
- Neo4j Writer by @RitxmSaha in #650
- Fixing the title for introduction page and making the 'specifying options' section its own page for APS docs by @AbhijitP-009 in #662
- Jonfritz patch 3 docs update by @jonfritz in #644
- Rename gentle_introduction.md to get_started.md by @jonfritz in #664
- Fixing APS docs to link to right doc also, making specifying options its own page by @AbhijitP-009 in #666
- Initial cut at chat UI. by @mdwelsh in #663
- Changing main title and reordering left pane documentation by @AbhijitP-009 in #669
- updated openai dependency by @RitxmSaha in #665
- Add support for subtasks by @aanya-p in #587
- Jonfritz sycamoredocsupdate by @jonfritz in #671
- Support arbitrary conversion to binary in materialize by @eric-anderson in #672
- implemented boilerplate transforms of documents. by @RitxmSaha in #668
- Add convert_file_to_pdf helper using libreoffice by @baitsguy in #670
- Origin/jonfritz patch 4 docs by @jonfritz in #673
- adding transforms and updating Ntsb demo notebooks by @Soeb-aryn in #645
- Making minor edits to the docs by @AbhijitP-009 in #679
- Fix Docs by @karanataryn in #675
- Add filter docs. by @bsowell in #682
- Change the file writer to create the output directory. by @bsowell in #681
- Add ssl_verify param to aryn-sdk by @HenryL27 in #684
- Add AutoMaterialize by @eric-anderson in #680
- Update Github integ test runner. by @bsowell in #683
- Update tutorial and remove old tutorial from ToC by @jonfritz in #676
- Fix flaky test. by @eric-anderson in #688
- import sycamore does not import ray by @eric-anderson in #687
- Update specifying_options.rst default threshold by @sohamkasar19 in #689
- Fix
ArynPartitioner
integration test by @MarkLindblad in #691 - Rename lineage files to materialize. by @eric-anderson in #693
- Defer model initialization. by @bsowell in #692
- Fixed a problem in detr_partitioner.py by @afriedman412 in #694
- Fixed an error in file_writer_ray.py by @afriedman412 in #696
- Fill in "gaps" in non-contiguous rows and columns from TATR . by @bsowell in #697
- Updated Entity Extractor + Infrastructure Changes by @RitxmSaha in #677
- corrected choosing beta client for openai by @RitxmSaha in #698
- added extract document structure by @RitxmSaha in #699
- Return an empty table when table transformers don't find a table. by @bsowell in #701
- Fix Override Text Bug by @karanataryn in #690
- SimplePrompt class by @baitsguy in #702
- Add IF_PRESENT reading mode to materialize by @eric-anderson in #703
- More dependencies at runtime. ~2x speedup on import sycamore test by @eric-anderson in #706
- implemented resolve graph entities by @RitxmSaha in #700
- add convert_image function by @HenryL27 in #708
- bump sdk to 0.1.3 by @HenryL27 in #709
- Fix bugs in json-encoding of documents by @eric-anderson in #713
- make sure writers finalize by @HenryL27 in #711
- Remove Override_Text by @karanataryn in #714
- Add Pinecone Source Tag by @karanataryn in #715
- Chunker by ...
v0.1.18
This Sycamore release contains a variety of new features, including interfaces for reading from and writing to vector stores, with implementations for OpenSearch, DuckDB, Elasticsearch, Pinecone, and Weaviate. This release also contains performance enhancements, dependency upgrades, and bug fixes.
This release coincides with the launch of the Aryn Partitioning Service, which provides an endpoint for partitioning PDFs. This service is integrated with Sycamore and free to try at https://www.aryn.ai/get-started.
What's Changed
- Provide better error messages on *Map mis-use by @eric-anderson in #450
- Run poetry lock in openai-proxy with poetry-lock-all.sh by @eric-anderson in #447
- use unstructured in weaviate IT by @HenryL27 in #453
- Allow disabling CUDA via env var by @eric-anderson in #454
- Lazily clean up temp files from DETR partitioner. by @alexaryn in #452
- fix pytest commands in contributing guide by @HenryL27 in #441
- Work around bad interaction between mypy and Python 3.9 by @alexaryn in #456
- Enable metadata by default. by @eric-anderson in #442
- Writer abstraction by @HenryL27 in #451
- Remove temporary file write in DETR partitioner. by @bsowell in #459
- Demo UI feature for Manual Filters and Aggregations on input query by @sohamkasar19 in #460
- convert opensearch writer to use base writer 1/3 by @HenryL27 in #461
- Upgrade torch and Ray. by @bsowell in #462
- handle llm flakiness in convert_timestamp better by @HenryL27 in #455
- [Bug Fix] Demo UI pdf viewer by @sohamkasar19 in #464
- convert weaviate writer to base db writer 2/3 by @HenryL27 in #465
- Adding OpenAITokenizer to sycamore.functions by @Soeb-aryn in #466
- Batch detr inference by @baitsguy in #467
- Add LogTime that logs time trace info via logging by @eric-anderson in #469
- Another approach to CUDA support in Docker, with less bloat. by @alexaryn in #449
- Choose MPS or CUDA automatically. by @alexaryn in #458
- Selectable sizes, autoscaling, better variable names. by @alexaryn in #473
- Fix show_pages to work with MetadataDocuments. by @bsowell in #471
- Force ray down to 2.20.0. by @eric-anderson in #477
- Instrumented more code with TimeTrace decorators. by @alexaryn in #474
- Add opensearch reader by @baitsguy in #476
- Batched sycamore pdf partitioner by @eric-anderson in #478
- Fix typo. by @alexaryn in #482
- Potential memory leak point by @bohou-aryn in #483
- Get integration tests working again. by @eric-anderson in #484
- Make it possible to pass in a schema for property extraction. by @bsowell in #481
- Switch batch at a time to True by default. by @eric-anderson in #485
- convert pinecone writer to base writer 3/3 by @HenryL27 in #468
- Add ArynPartitioner by @MarkLindblad in #470
- Remove old model server endpoint option by @MarkLindblad in #488
- Add DuckDB Writer by @karansampath in #480
- pinecone demo nb by @HenryL27 in #486
- Weaviate Scan by @HenryL27 in #490
- Add DuckDB Scan by @karansampath in #492
- Add filter_elements method on DocSet. by @bsowell in #493
- Support table deserialization from a dictionary. by @bsowell in #494
- Add flatten option to weaviate writer by @HenryL27 in #495
- Add PDFMiner caching by @pparmar30 in #489
- Use examples in few shot entity extraction by @aanya-p in #499
- Write intermediate results in BaseMap by @baitsguy in #498
- Add PDF to Image cache by @pparmar30 in #501
- Add S3 cache implementation, add cache option to Llm by @baitsguy in #503
- Add some writer unit tests by @HenryL27 in #504
- Extract Aryn token from config for ArynPartitioner by @MarkLindblad in #500
- turn off caching by default by @pparmar30 in #506
- Add Pinecone Reader by @karansampath in #502
- Revert "Add PDF to Image cache (#501)" by @pparmar30 in #507
- Disable caching by default by @pparmar30 in #510
- Combine Aryn and Sycamore Partitioners by @MarkLindblad in #497
- Support Aryn Partitioning Service v2 in ArynPartitioner by @MarkLindblad in #513
- Choose CPU as device when using remote ArynPartitioner by @MarkLindblad in #514
- process_batch() defaults table_structure_extractor. by @alexaryn in #515
- Don't start PyTorch thread just doing import. by @alexaryn in #516
- Log x-aryn-call-id in ArynPDFPartitioner by @MarkLindblad in #518
- Move initialization of easyocr Reader to Actor level. by @bsowell in #519
- OpenAI messages interface fix + OpenAI LLM cache fix by @baitsguy in #511
- Bugfix intermediate data by @baitsguy in #521
- Add Elasticsearch Writer by @karansampath in #517
- Fix example Jupyter notebook. by @mdwelsh in #522
- Allow custom names for Map/FlatMap nodes by @baitsguy in #523
- Add DuckDB Demo Notebook by @karansampath in #496
- Add DuckDB Documentation by @karansampath in #520
- Fix Pinecone Test by @karansampath in #524
- Add Aryn Partitioning Service page to docs by @MarkLindblad in #505
- Use printer-style page selection option to batch pages in ArynPartitioner by @MarkLindblad in #528
- Demote missing Aryn config error to debug log level by @MarkLindblad in #536
- Make ArynPartitioner retry on 502 Bad Gateway error by @MarkLindblad in #534
- add weaviate writer docs. Also add connectors to ToC by @HenryL27 in #529
- Add kwargs parameter to a few DocSet operators. by @mdwelsh in #532
- Fix partitioner docstring by @HenryL27 in #537
- Add Pinecone Writer Docs by @karansampath in #530
- Fix broken Dataset Integration Test by @karansampath in #535
- set table text rep to csv rep of table by @HenryL27 in #540
- Add Elasticsearch Demo Notebook and Docs by @karansampath in #538
- Add str methods to Document and Element. by @mdwelsh in #542
- Seperate element reordering function into two parts for use in Partitioning Service by @MarkLindblad in #541
- HTTP client that acts more like curl. by @alexaryn in #545
- Change ArynPartitioner API to use
use_partitioning_service
instead oflocal
by @MarkLindblad in #531 - Move Writer Notebooks to ArynPartitioner by @karansampath in #544
- Add Reader API and move DuckDB by @karansampath in #533
- Add preempt_work flag by @pparmar30 in #548
- Fix off by one error in ArynPartitioner by @MarkLindblad in #539
- Move Pinecone Reader to API by @karansampath in #553
- Revert "Add preempt_work flag" by @MarkLindblad in #551
- Temporarily serialize partition calls with
ArynPartitioner
when running remotely by @MarkLindblad in https://github.co...
v0.1.17
This Sycamore release contains new writers to the Weaviate and Pinecone vector databases, enhancements to the demo UI, and numerous small features and bug fixes.
What's Changed
- Add Sycamore Partitioner example notebook by @jonfritz in #379
- Various link updates and typo fixes in docs by @hsm207 in #381
- Fix notebook link in docs. by @bsowell in #383
- SummarizeImage example in the SycamorePartitionerExample notebook. by @bsowell in #382
- Jonfritz patch 1 updated description by @jonfritz in #384
- Rename ...Request -> ...Call and ...Response -> ...Reply by @alexaryn in #385
- Responsive Demo UI by @sohamkasar19 in #386
- lineage 1/n: add support for metadata. by @eric-anderson in #387
- Set table object to None when no table is found. by @bsowell in #389
- Fix integration tests. by @eric-anderson in #390
- Add GPT-4o support. by @bsowell in #392
- Updates to Demo UI by @sohamkasar19 in #391
- Updates in demo ui for filtering by @sohamkasar19 in #394
- ensure unique uuids post explode via sequence numbers by @HenryL27 in #398
- Fix model deserialization error by @bohou-aryn in #395
- Convert map.py transforms over to base map. by @eric-anderson in #399
- Convert bbox_merge and mark_misc to new Map style by @eric-anderson in #402
- Add TimeTrace and instrument major pieces of code. by @alexaryn in #388
- Use proxy to provide default settings to the UI by @baitsguy in #404
- Update CONTRIBUTING.md to note that integration tests are currently broken. by @eric-anderson in #403
- FIX: Pdf viewer error by @sohamkasar19 in #406
- Convert Merge from Ray Actor to Ray Task. by @alexaryn in #401
- Tools to look at TimeTrace output. by @alexaryn in #396
- Switch Filter and Enbed over to BaseMapTransform by @eric-anderson in #405
- Convert classes over to new *Map classes 3/n by @eric-anderson in #408
- Explicitly enumerate notebooks to automatically test. by @bsowell in #335
- Add timing for OpenSearch writer. by @alexaryn in #407
- Add weaviate writer by @HenryL27 in #400
- Switch from uuid1() to uuid4() in explode. by @bsowell in #410
- Add remote model server support by @MarkLindblad in #397
- Refactor drawing code to support additional formats. by @bsowell in #409
- Adjust assertion for batch_size resource_arg. by @bsowell in #411
- Convert over to *Map 4/n by @eric-anderson in #412
- Fix OOM on CPU by reducing default batch size by @MarkLindblad in #415
- Update poetry lock files by @eric-anderson in #414
- Fix bug in runtests, it always detected changes. Add --force to force tests by @eric-anderson in #413
- Conversion to base map 5/n: Partition by @eric-anderson in #416
- Remove generate_map_class_from_callable -- *Map 6/n by @eric-anderson in #417
- Fix bug again. Make JSON encoding work. by @alexaryn in #418
- TimeTrace: add RSS and improve usability by @alexaryn in #419
- Fix split_and_convert_to_image when some pages have no elements. by @bsowell in #422
- remove empty lists from documents in weaviate writer by @HenryL27 in #421
- Switch spread props & ndd to BaseMap -- *Map 7/n by @eric-anderson in #425
- Add show_pages pdf utility for visualizing pdf partitioning. by @bsowell in #424
- TimeTrace: Fallback to RUSAGE_SELF when RUSAGE_THREAD isn't present. by @bsowell in #426
- Switch extract_schema to BaseMap -- *Map 8/8 by @eric-anderson in #427
- fixed typo by @tranade in #420
- Remove base64 from model server response by @MarkLindblad in #423
- Pin setuptools so that we can run DETR model on GPU on Linux. by @alexaryn in #431
- Prettier output for timetrace files when using ttviz & ttanal by @alexaryn in #430
- Add Detr json serializability test by @MarkLindblad in #429
- add term_frequency transform by @HenryL27 in #432
- Pinecone connector by @HenryL27 in #435
- Soham demo UI updates by @sohamkasar19 in #428
- Enable table extraction to use GPU. by @alexaryn in #434
- Initial checkin of speed performance benchmarking script. by @alexaryn in #433
- Add pinned dependencies to the sycamore and rps repos. by @bsowell in #436
- Fix cross-test error by @eric-anderson in #439
- Force dependency consistency by @eric-anderson in #440
- Add utility code for working with tables and showing them as HTML. by @bsowell in #438
- Demo UI eslint prettier config by @sohamkasar19 in #437
- bump sycamore version to 0.1.17 by @HenryL27 in #443
New Contributors
- @MarkLindblad made their first contribution in #397
- @tranade made their first contribution in #420
Full Changelog: v0.1.16...v0.1.17
v0.1.16
This release contains support in the SycamorePartitioner for extracting table structure and images, as well as a new transform for summarizing images. It also includes a number of bug fixes and enhancements.
What's Changed
- fix ui error when no title is extracted and we're not in ntsb setting by @HenryL27 in #352
- Fix almost all the pyproject.toml and poetry.lock files to have consistent requirements on python dependencies. by @eric-anderson in #345
- Bind mount to convey SSL cert/key to Jupyter & UI by @alexaryn in #349
- Use real SSL certificate for OpenSearch HTTP. by @alexaryn in #353
- copy lib/poetry-lock into containers to make poetry happy by @HenryL27 in #354
- copy lib/poetry-lock into remote-processor-service too. by @HenryL27 in #355
- copy in all of poetry-lock, not just the pyproject files by @HenryL27 in #356
- Update data model for table structure recognition. by @bsowell in #357
- Put token-protected HTTPS proxy in front of UI proxy. by @alexaryn in #359
- Arxiv switched to HTTP for these PDFs; make it work. by @alexaryn in #360
- Add apt update to UI Dockerfiles. by @alexaryn in #361
- Use chown in our copy commands to make sure all files are owned by app by @eric-anderson in #362
- Add TableStructureExtractor interface and TableTransformer impl. by @bsowell in #358
- fix zsh path by @eric-anderson in #367
- Jupyter container improvements by @eric-anderson in #369
- Don't say localhost if it's not going to work. by @alexaryn in #366
- bump deploy timeout for reranking model from 60 to 120 by @HenryL27 in #363
- ingest all ntsb docs, automatically detect docker v not, spread path … by @HenryL27 in #368
- Fix typos in README by @hsm207 in #370
- Fix default prep script when given an empty directory to import by @HenryL27 in #371
- fix typo by @HenryL27 in #372
- Add the ability to summarize images to partitioned docsets. by @bsowell in #365
- Store element bbox as a tuple rather than BoundingBox. by @bsowell in #374
- Jonfritz patch 1 partition update by @jonfritz in #376
- FIX: Error on initiate conversation without a conversation id by @sohamkasar19 in #375
- Add API docs for the SycamorePartitioner and table extraction. by @bsowell in #373
- Fix malformed text from beautiful soup. by @bohou-aryn in #351
- Handle deserializing JSON documents when elements is None. by @bsowell in #377
- Bump sycamore version to 0.1.16 by @bsowell in #378
New Contributors
- @hsm207 made their first contribution in #370
- @sohamkasar19 made their first contribution in #375
Full Changelog: v0.1.15...v0.1.16
v0.1.15
This release add support for writing DocSets to jsonl files as well as other incremental features and bug fixes.
What's Changed
- Cache entire Amazon Textract response by @baitsguy in #333
- New query chosen in consultation with Mehul. by @alexaryn in #336
- Fix unit test mocking. by @alexaryn in #338
- Added ability to write JSONL block files. by @alexaryn in #337
- Fix bug in updating a single property and most workarounds for the bug. by @eric-anderson in #341
- Set RPS default version to follow VERSION again by @HenryL27 in #342
- Initial Container ITs by @HenryL27 in #339
- Force to opensearch V2.12.0.0 to make build work by @eric-anderson in #343
- minor fixups to NDD doc by @alexaryn in #346
- Better container integration testing automation by @HenryL27 in #344
- Update NDD notebook with JSON/PDF ingestion options. by @alexaryn in #347
- Bump sycamore version to v0.1.15 by @bsowell in #348
Full Changelog: v0.1.14...v0.1.15
v0.1.14
This release includes CPU support and OCR in the Sycamore Partitioner, caching for better performance and lower cost when using Textract for table extraction, an upgraded version of Ray (2.10), and more.
What's Changed
- mark rps version as latest rc by @HenryL27 in #291
- Cleanup rewriting - cloning doesn't work by @eric-anderson in #292
- Fix integ test import error. by @bsowell in #293
- Change notebook working directory when running outside container. by @bsowell in #294
- Fix bug in undocumented/untested prefix limiting feature. by @eric-anderson in #295
- Implement CachedTextractTableExtractor by @bohou-aryn in #288
- Upgrade the openai Python library to 1.x and guidance to 0.1.x. by @bsowell in #242
- Reorder partitioner output and fix model loading inefficiency by @bohou-aryn in #277
- Refactor sycamore to apps, lib by @HenryL27 in #296
- add averaged_perceptron_tagger to nltk downloads by @HenryL27 in #301
- fix jupyter bind mount path by @HenryL27 in #302
- Make sure filetype property is already set. by @eric-anderson in #298
- initialize messages index on startup by @HenryL27 in #303
- Add demo UI by @HenryL27 in #300
- Address HTML viewer bug when doing sycamore_crawler_http_sort_all by @alexaryn in #304
- Make SycamorePartitioner runnable on CPUs. by @bsowell in #299
- Get all the containers building and working again. by @eric-anderson in #305
- Switch from Exception to RuntimeError by @eric-anderson in #306
- remove submodule steps from plugin checkout in dockerfile because sub… by @HenryL27 in #309
- Fix dockerfile to work post merge by @eric-anderson in #310
- Add some documentation for NDD: Sketcher at ingestion time. by @alexaryn in #307
- Add sketch() after explode() in all our default pipelines. by @alexaryn in #312
- Add remote processor service by @HenryL27 in #311
- use ADD instead of RUN git clone to checkout git repos by @HenryL27 in #313
- Change from nmslib to faiss everywhere. by @alexaryn in #314
- Add tesseract-ocr to container dependencies. by @bsowell in #316
- compile docs with poetry by @HenryL27 in #317
- Add support for OCR in the Sycamore partitioner. by @bsowell in #315
- Setup query-time NDD: pre-create RPS processors, add to pipelines by @alexaryn in #318
- Changes needed for vanilla build of importer and RPS containers. by @alexaryn in #320
- Add shingles to _source to enable query-time near duplicate detection by @alexaryn in #321
- Fix importer to check for user, apply similar fix to crawlers by @eric-anderson in #322
- Remove obsolete files from the quickstart -> sycamore repo merge. by @eric-anderson in #283
- Upgrade to Ray 2.10.0. by @bsowell in #319
- Upgrade guidance to 0.1.13. by @bsowell in #323
- Remove mypy --explicit-package-bases flag and fix issues. by @bsowell in #324
- Update poetry.lock files based on recent sycamore dependency changes. by @bsowell in #325
- Deal with renamed file. by @alexaryn in #329
- Added -anon switch to S3 crawler for public buckets. by @alexaryn in #327
- add docs for RPS by @HenryL27 in #328
- Add Jupyter notebook to demonstrate query-time NDD. by @alexaryn in #326
- Expand NDD doc into separate file. by @alexaryn in #330
- Bump version to 0.1.14. by @bsowell in #332
- Add .profile to container so that we get poetry python not container python by @eric-anderson in #331
- Update dedup.md by @jonfritz in #334
Full Changelog: v0.1.13...v0.1.14
v0.1.13
This release upgrades the Sycamore docker containers to use OpenSearch 2.12 and adds support for SSL. It also includes significant additions to the Sycamore documentation (https://sycamore.readthedocs.io/), and a number of other features and bug fixes.
What's Changed
- Upgrade test workflow to os 2.10 by @baitsguy in #240
- Evaluation code by @baitsguy in #239
- Quickstart: use SSL for all network communication: OpenSearch and Jupyter by @alexaryn in #231
- Update get_started.md by @jonfritz in #241
- Jonfritz patch 1 by @jonfritz in #245
- Upgrade dependencies and address dependabot alerts. by @bsowell in #248
- Update documentation for DocSetWriter and DocSet.write by @bsowell in #246
- Added eval metrics and fixed bugs by @baitsguy in #247
- Fix examples to use https for links to UI by @eric-anderson in #244
- Upgrade opensearch to 2.12 by @HenryL27 in #249
- Straggler comment. by @alexaryn in #250
- Add debug facility to sycamore-opensearch.sh entrypoint script. by @alexaryn in #251
- Address two timing-related problems with SSL/security setup. by @alexaryn in #253
- Henry's fix to detect failure properly for model deployment. by @alexaryn in #254
- Enable DEBUG and NOEXIT environment variables. by @eric-anderson in #255
- Add NOEXIT functionality to die function. by @alexaryn in #257
- By default, disable SSL for Jupyter, to avoid browser cert complaints. by @alexaryn in #256
- Added some debug messages that were missing. by @alexaryn in #259
- Improve recall metrics by @baitsguy in #258
- Suppress parse error that we expect. by @alexaryn in #262
- Remove need for passwordless sudo to run the default import notebook. by @eric-anderson in #264
- Improve http integration test debugging. by @eric-anderson in #263
- Increase default model deployment stability by @HenryL27 in #260
- Modify supplement_text for integrating text from pdfminer by @bohou-aryn in #265
- register model if not found in setup transient by @HenryL27 in #266
- add documentation for reranking by @HenryL27 in #261
- move model and pipeline configurations to python by @HenryL27 in #268
- Bump opensearch ssl startup wait time to 30 tries. by @eric-anderson in #269
- Document map_batch by @eric-anderson in #267
- Cleanup metrics classes + bug fixes by @baitsguy in #271
- Fixes to documentation. gen script to auto-add transforms. by @alexaryn in #272
- Add .sketch() to DocSet to access Sketcher transform directly. by @alexaryn in #273
- Update hybrid_search.md by @jonfritz in #274
- Fix notebooks -- proper protocol, truncate output. by @eric-anderson in #275
- Update docs for schema and property extractors. by @bsowell in #270
- More SSL/container fixes. by @eric-anderson in #276
- Minor doc cleanup: removed not-checked-in files. by @alexaryn in #278
- default opensearch to x86 by @HenryL27 in #279
- build ml-commons locally with correct dependencies by @HenryL27 in #280
- Upgrade the Sycamore version to 0.1.13. by @bsowell in #281
- unset default opensearch platform by @HenryL27 in #282
- Fix bug in dev example. by @eric-anderson in #285
- Added documentation for SSL=1 and general SSL background. by @alexaryn in #286
- Update hardware.md by @jonfritz in #287
- build and install remote processor plugin in opensearch dockerfile by @HenryL27 in #284
- add rps to compose.yaml by @HenryL27 in #289
- lowercase d in docker compose command by @HenryL27 in #290
Full Changelog: v0.1.12...v0.1.13
v0.1.12
This release adds components to Sycamore to enable search and analytics use cases, beyond data preparation. Sycamore can now be deployed using Docker containers, and you can also download the Python libraries for data preparation. The documentation has also been updated to reflect this change in scope.
This release also has other features and bug fixes.
What's Changed
- Correctly handle OpenAI model fallback. by @bsowell in #205
- Upgrade Ray to 2.9.0. by @bsowell in #207
- Convert distance function from average to min. Tune parameters. by @alexaryn in #209
- add nltk download punkt action by @HenryL27 in #211
- Address boundary conditions of sliding window, small docs. Re-tuned. by @alexaryn in #210
- Augment text by @HenryL27 in #208
- Update JSON scan to use Ray JSON reader. by @bsowell in #215
- Add docker image prefix as sycamore-importer. by @bohou-aryn in #216
- Make Textract disabled by default by @bohou-aryn in #218
- Add Aryn trained DETR model for entity detection by @bohou-aryn in #212
- Add some useful utility methods for DocSets. by @bsowell in #217
- Element splitter to prevent text elements with too many tokens. by @alexaryn in #184
- Update to Sycamore documentation for consolidation by @jonfritz in #222
- Jonfritz patch 1 docs by @jonfritz in #224
- Metadata extraction updates by @baitsguy in #220
- Prepare for merge of quickstart into sycamore. by @eric-anderson in #225
- Merge quickstart into sycamore by @eric-anderson in #226
- Endless piles of reformatting to get checks to pass. by @eric-anderson in #227
- Use classic shingles; simplified implementation; added debug; re-tuned by @alexaryn in #214
- Update README.md by @jonfritz in #229
- Jonfritz patch 2 by @jonfritz in #228
- Fix sycamore importer service name by @bohou-aryn in #232
- Jonfritz patch 2 by @jonfritz in #235
- Fix bugs on Deformable-DETR by @bohou-aryn in #236
- Jonfritz patch 1 by @jonfritz in #233
- Create notebook file with default ingest script by @bohou-aryn in #219
- Fix typo in docs, and fix formatting by @eric-anderson in #237
- Bump version to v0.1.12. by @bsowell in #238
Full Changelog: v0.1.11...v0.1.12
v0.1.11
This release removes support for OpenAI's text-davinci-003
model, which will be deprecated on 1/4/23, and replaces it with gpt-3.5-turbo-instruct
. All users of sycamore should upgrade.
What's Changed
- Migrate from text-davinci-003 to gpt-3.5-turbo-instruct. by @bsowell in #202
- Bump version to v0.1.11. by @bsowell in #203
Full Changelog: v0.1.10...v0.1.11