Releases: aryn-ai/sycamore
v0.1.30
This Sycamore release contains several bug fixes and improvements.
What's Changed
- Add logging of the full exception in base_writer. by @eric-anderson in #1069
- Fix create_element to not crash on bad element types by @eric-anderson in #1070
- Add docset.take_stream() by @baitsguy in #1071
- Make temporary fix to
split_elements
to avoid exceeding recursion depth due to certain table elements by @MarkLindblad in #1073 - add TableMerger to merge elements docs by @HenryL27 in #1074
- Increase max recursion depth for
split_element
'ssplit_one
by @MarkLindblad in #1075 - Merge-elements-LLM-filter by @dhruvkaliraman7 in #1076
- Add support for GPU to similarity. by @austintlee in #999
- Tolerate bad entity extraction. by @eric-anderson in #1078
- move deformable detr safe loading code by @HenryL27 in #1055
- Allow Doc reconstruct via function by @austintlee in #1072
- Add-tokenizer-and-reranking-to-LLM-ExtractEntity by @dhruvkaliraman7 in #1081
- Schema object + entity extraction support by @baitsguy in #1083
- Make ttviz.cpp compile again. by @alexaryn in #1082
- Keep newline in OpenAI Embedder by @dhruvkaliraman7 in #1086
- Changed the default embedding model to openai. by @akarshgupta7 in #1087
- Add Embed at Element Level by @dhruvkaliraman7 in #1084
- Get sycamore.query to work with Schema instead of only OpenSearchSchema by @baitsguy in #1088
- Add hybrid table extractor by @HenryL27 in #1089
- Add map reduce style summarize to handle large texts for summarization. by @austintlee in #1079
- fix max(nothing) bug by @HenryL27 in #1091
- Delay initializing openai client in embedder by @HenryL27 in #1092
- fix materialize on windows by @HenryL27 in #1093
- Add Retries for OpenSearch Writer by @karanataryn in #1085
- Property extraction type cast by @baitsguy in #1095
- Revert overzealous no-rootification by @HenryL27 in #1098
- Add support for Anthropic LLMs. by @bsowell in #1096
- Fix similarity assert condition for LLM Filter by @dhruvkaliraman7 in #1099
- Raise PartitionError with explicit status code. by @alexaryn in #1101
- Add
PartitionError
toaryn_sdk.partition
's__init__.py
by @MarkLindblad in #1102 - Prompt update for property extraction by @baitsguy in #1103
- Add support for parallel read in OpenSearchReader by @austintlee in #1100
- Fix No Root Repetition in Test File by @karanataryn in #1097
- Bump version to 0.1.30. by @bsowell in #1109
Full Changelog: v0.1.29...v0.1.30
v0.1.29
This Sycamore release contains small bug fixes and enhancements.
What's Changed
- when there's no table structure, take the token bbox for the cell bbox by @HenryL27 in #1061
- Disable use of scroll in OpenSearch reader when running KNN queries. by @austintlee in #1062
- Binarize OCR Image to Improve Performance by @karanataryn in #1063
- Fix
split_elements
for table elements with noelem.table
attribute by @MarkLindblad in #1064 - Fix Extract Schema Empty Return by @karanataryn in #1067
- Bump version to v0.1.29. by @bsowell in #1068
Full Changelog: v0.1.28...v0.1.29
v0.1.28
This release updates doc_ids from UUIDs to NanoIds, adds some document title functionality, and improves stability and performance.
What's Changed
- adding one shot prompting along with multimodal request by @Soeb-aryn in #1023
- Fix query-ui dependency on boto3 and re-lock. by @mdwelsh in #1028
- Updated NTSB queries and ground truth for CIDR-25 paper. by @mdwelsh in #1026
- Add streaming support and tests for query-server. by @mdwelsh in #1027
- Supply element types in output from MarkedMerger. by @alexaryn in #1031
- Fix SummarizeData so that downstream .materialize operations will work. by @mdwelsh in #1030
- add nanoid by @HenryL27 in #1034
- Removed duplicate code in query execution. by @akarshgupta7 in #1035
- Convert docids from UUID to NanoID. by @alexaryn in #1032
- Use NanoIDs in file_scan. by @alexaryn in #1036
- extract table properties prompt & bug fix by @Soeb-aryn in #1037
- Convert DocIDs to UUIDs for Qdrant & Weaviate; unit tests. by @alexaryn in #1038
- heuristics to get title from section headers by @Soeb-aryn in #1033
- updating function in pdf_miner class by @Soeb-aryn in #1041
- Added ragas to compute string metrics for evaluation. by @akarshgupta7 in #1039
- Fix sort so that it works with an unspecified or None default_value. by @eric-anderson in #1040
- Added correctness score to the metrics. by @akarshgupta7 in #1043
- Query planner improvements by @baitsguy in #1046
- Fix materialize to tolerate an empty input directory in ray mode by @eric-anderson in #1045
- PR fix by @baitsguy in #1047
- disable vectorsearch rerank by default in query by @baitsguy in #1048
- vectorsearch planner prompt changes by @baitsguy in #1049
- Make OpenAIEmbedder serializable after client has been initialized. by @bsowell in #1050
- Rename Embedding in ElasticSearch Notebook by @karanataryn in #1051
- Add deformable table extractor by @HenryL27 in #1053
- Add helper for thread local variables that can be used to add metadata to the output stream by @eric-anderson in #1052
- Propagate element level llm_filter output to doc.properties by @baitsguy in #1054
- Handle military clock time (0800) in time standardizer. by @alexaryn in #1056
- Fix incorrect docstring for promote-certain-elements-to-title feature by @MarkLindblad in #1057
- adding parameter for API in sdk and remote_partitioner by @Soeb-aryn in #1042
- bump sycamore version to 0.1.28 by @HenryL27 in #1058
- bump aryn sdk version to 0.1.10 by @HenryL27 in #1059
- don't die if box is None in try_draw_boxes by @HenryL27 in #1060
New Contributors
- @akarshgupta7 made their first contribution in #1035
Full Changelog: v0.1.27...v0.1.28
v0.1.27
This Sycamore release includes a variety of small bug fixes and improvements.
What's Changed
- Bump
aryn-sdk
version to 0.1.9 from 0.1.8 by @MarkLindblad in #1011 - Add plan validation by @baitsguy in #1001
- Sort retrieval docs by score properties if they exist by @baitsguy in #1012
- Add 120k max chars (default) for summarize_data by @baitsguy in #1013
- Queryeval docset write fix by @baitsguy in #1014
- Add notebook file for OpenSearch example by @jonfritz in #1015
- Fix up NTSB queries for query-eval tool. by @mdwelsh in #1016
- Rename from APS to DocParse by @karanataryn in #1017
- enable JSONifying tables by @HenryL27 in #1018
- Fix
aryn-sdk
'sconvert_image_element
example by @MarkLindblad in #1019 - Fix DocParse chunking example in
aryn-sdk
by @MarkLindblad in #1021 - blacksmith.sh: Migrate workflows to Blacksmith by @blacksmith-sh in #1020
- Revert Unit Tests to GitHub Actions by @karanataryn in #1025
- Bump version to 0.1.27. by @bsowell in #1024
Full Changelog: v0.1.26...v0.1.27
v0.1.26
This release includes several stabliity and reliability improvements.
What's Changed
- skip flaky test by @HenryL27 in #956
- Fix mypy warnings. by @mdwelsh in #947
- Work around hang observed during vcrpy recording. by @alexaryn in #950
- Postprocessing to modify plans returned by llm planner; minor issues with query-ui by @amolvdeshpande in #882
- bump sdk to 0.1.7 by @HenryL27 in #961
- Add HeaderAugmenterMerger by @dhruvkaliraman7 in #946
- Update docs to reflect OpenAIPropertyExtractor->LLMPropertyextractor by @bsowell in #964
- Couple of minor fixes and tweaks to the table merger. by @bsowell in #963
- Enable use_elements in query.summarize_data by @baitsguy in #966
- Fix typo in syntax in docstring for Summarize Images by @jonfritz in #967
- Add missing
tokenizer
argument inMarkBreakByTokens
docstring by @MarkLindblad in #969 - Add Lots of Connector Unit Tests by @karanataryn in #957
- Add OCR Evaluation Code by @karanataryn in #685
- Fixed query tag check by @baitsguy in #968
- Fix SDK Threshold Bug by @karanataryn in #970
- Add score to each document in OpenSearch query result. by @bsowell in #971
- Fix HeaderAugmenterMerger by @MarkLindblad in #973
- Refactor
mark_bbox_preset
to expose function outsideDocSet
by @MarkLindblad in #972 - Fix
mark_bbox_preset
'sMarkDropHeaderFooter
parameter by @MarkLindblad in #975 - OpenSearch improvements by @baitsguy in #974
- Adding a separate installation instructions page by @AbhijitP-009 in #977
- Union OCR / PDFMiner Tokens with Table Outputs by @karanataryn in #976
- Make Table Code More Robust by @karanataryn in #979
- fix divide by zero in align_headers by @HenryL27 in #978
- Allow for returning query traces on cached query executions. by @mdwelsh in #959
- Add Enhance Table Option to SDK by @karanataryn in #980
- Bump SDK Version by @karanataryn in #981
- Update Lockfiles by @karanataryn in #920
- Add query planning strategy objects by @baitsguy in #982
- Move tokenized data to device by @baitsguy in #983
- Update vectorsearch query test by @baitsguy in #984
- Integration test for Sycamore Query demo. by @mdwelsh in #985
- Add Closure of Client Connections for Connectors by @karanataryn in #989
- Work around lack of resource module on Windows. by @alexaryn in #962
- Update README.md by @karanataryn in #990
- Merge in Fixes from Luna Demo Deployment by @karanataryn in #992
- Add table-chunker by @dhruvkaliraman7 in #993
- chore: Added back to top , contributors section and star history graph by @samarth29jc in #987
- Return the list of documents referenced in a Luna query. by @mdwelsh in #995
- Sync Locks across all Directories by @karanataryn in #988
- Remove unused code (
_batchify
) by @MarkLindblad in #887 - Don't try to put footers in columns by @HenryL27 in #998
- Docprep notebook testing by @sohamkasar19 in #996
- Add expected documents in query-eval tool by @baitsguy in #997
- Move Aryn DocParse Docs Out of Sycamore by @karanataryn in #994
- Remove seed from rewrite prompt by @baitsguy in #1000
- Fix OpenAI reduce methods to handle Azure deployment names. by @bsowell in #1002
- Add support for custom source parameter for remote Aryn Partitioner by @MarkLindblad in #1003
- Fix mixed samples for schema extraction. by @mdwelsh in #1004
- updating extract table prop by @Soeb-aryn in #1005
- Update Opensearch domain in docprep notebook testing (GHA) by @sohamkasar19 in #1006
- Improve suggested install command by @HenryL27 in #1007
- Fix augment_text docstring by @HenryL27 in #1008
- Add support for using Aryn DocParse chunking from
aryn-sdk
by @MarkLindblad in #1010 - Update sycamore to 0.1.26 by @HenryL27 in #1009
New Contributors
- @amolvdeshpande made their first contribution in #882
- @samarth29jc made their first contribution in #987
Full Changelog: v0.1.25...v0.1.26
v0.1.25
This Sycamore release includes numerous bug fixes for connectors and other transforms. It also includes support for Anthropic LLMs via Amazon Bedrock.
What's Changed
- Sycamore Query evaluation tool. by @mdwelsh in #912
- Luna client local schema (take 2) by @dtecuci in #919
- Fix small bug in client. by @mdwelsh in #923
- Fix DuckDB Spelling Error by @karanataryn in #924
- Make OpenSearchSchema a proper Pydantic model. by @mdwelsh in #922
- Fix typo by @Yashbhatt786 in #927
- Bugfixes: DocumentSource enum serialization and missing element_id in old data by @baitsguy in #928
- Bug fixes: remove kwargs in docset.rerank, sycamore query codegen by @baitsguy in #932
- Add Table Merger by @dhruvkaliraman7 in #880
- Basic Bedrock LLM client. by @mdwelsh in #931
- Accept query plan examples in config by @baitsguy in #934
- Evaluate query plans in query-eval by @baitsguy in #936
- Add local mode support for json scan and json document scan by @bohou-aryn in #925
- Handle Drawing Missing Tables and Cells by @karanataryn in #938
- Support LLM selection in Sycamore Query Client. by @mdwelsh in #935
- Crop To Bbox Error by @karanataryn in #939
- Add plan correctness metrics summary + K in TopK optional by @baitsguy in #940
- don't embed the empty string with openai by @HenryL27 in #943
- Support SummarizeImages with non-OpenAI LLMs. by @bsowell in #941
- Add support for tags and notes. by @mdwelsh in #942
- Create LLMSchemaExtractor and LLMPropertyExtractor classes. by @bsowell in #945
- Don't run embedded weaviate in the unit tests by @HenryL27 in #951
- fix empty strings in section headers by @HenryL27 in #948
- Select pages by @bsowell in #937
- Fixup notebook tests by @eric-anderson in #933
- Use pytest-xdist for unit tests. by @mdwelsh in #952
- Update standardizer.py by @jonfritz in #944
- Fix bugs in Unflattening Data by @karanataryn in #930
- fix materialize bug with s3 filesystem by @eric-anderson in #954
- Bump version to 0.1.25. by @bsowell in #955
New Contributors
- @Yashbhatt786 made their first contribution in #927
Full Changelog: v0.1.24...v0.1.25
v0.1.24
This Sycamore release includes several bug fixes in the Weaviate and DuckDB connectors and in several of the example notebooks. Thanks to @Dnaynu for contributing to the Sycamore documentation!
What's Changed
- fix asdict in the reader too. duh by @HenryL27 in #907
- Add text reprentation for empty tables by @dhruvkaliraman7 in #909
- Refactor logical plan serialization. by @mdwelsh in #905
- microperformance improvement by @HenryL27 in #906
- Bugfix: Handle opensearch reader doc resconstruction when no parent doc in results by @baitsguy in #908
- Fix bug in entity extraction. by @eric-anderson in #911
- added ability to read schema from file by @dtecuci in #904
- Enable copying of the hash context. by @alexaryn in #910
- Add option to extract line-based bounding boxes from pdfminer. by @bsowell in #874
- Support random sample in local mode. by @bsowell in #913
- Opensearch kwargs fix by @baitsguy in #914
- Fix Typo in NTSB Demo by @karanataryn in #917
- Update using_jupyter.md by @jonfritz in #902
- Docs: Typo Fix by @Dnaynu in #918
- Update DuckDB Reader to Package Change by @karanataryn in #916
- Make metadata-extraction.ipynb work by @eric-anderson in #915
- Bump Sycamore version to 0.1.24. by @bsowell in #921
New Contributors
Full Changelog: v0.1.23...v0.1.24
v0.1.23
This is a small release that fixes a bug in the Weaviate writer and includes a few other bug fixes and documentation improvements.
What's Changed
- fix bug in weaviate writer causing api keys to be of wrong type by @HenryL27 in #893
- Expose local easyocr kwargs by @baitsguy in #894
- Fix PDFMiner Output Parsing by @karanataryn in #890
- Allow passing custom ocr object to arynpartitioner by @baitsguy in #895
- Update Elasticsearch Port by @karanataryn in #896
- Update Merger Parameters in Docs by @sohamkasar19 in #897
- Fix Elasticsearch Docs by @karanataryn in #899
- Cleanup Docs by @karanataryn in #900
- Add smaller pdfminer bboxs to large detr bboxs by doing iob and not iou by @dhruvkaliraman7 in #901
- Fix anonymous reading in materialize and add rate limited logging. by @eric-anderson in #898
- Bump version to v0.1.23. by @bsowell in #903
Full Changelog: v0.1.22...v0.1.23
v0.1.22
This sycamore release includes support for Python 3.12, a connector for the Qdrant vector database, and many bug fixes and enhancements. Thanks to @Anush008 for contributing the Qdrant support!
What's Changed
- bump sdk to 0.1.4 by @HenryL27 in #823
- Fix issue with empty tool response leading to hallucinations. by @mdwelsh in #818
- Fix bug where prompt is modified by OpenAIEntityExtractor. by @mdwelsh in #824
- Fix poetry.lock with missing dependency. by @mdwelsh in #825
- Query trace viewer for Luna demo, and better PDF previews. by @mdwelsh in #828
- Batch Processing Bug Fix by @karanataryn in #829
- Get local mode working 1/n by @eric-anderson in #826
- Changing titles for some posts by @AbhijitP-009 in #827
- Transform to convert Document into Markdown. by @alexaryn in #811
- Fix query trace viewer. by @mdwelsh in #830
- Ingest more fields into OpenSearch schema for NTSB demo. by @mdwelsh in #834
- Fix bug with trace view. by @mdwelsh in #833
- Improved sorting of elements by bbox for one and two columns. by @alexaryn in #801
- Make PDFMiner Pipelined by @karanataryn in #807
- Fix error message on None value passed to DateTimeStandardizer. by @mdwelsh in #835
- Sundry improvements while using luna in a customer. by @eric-anderson in #832
- fix to pass string to tokenizer by @Soeb-aryn in #831
- Some improvements to query plans for Luna demo. by @mdwelsh in #836
- Update requires_modules type annotations to work with mypy. by @bsowell in #837
- Lazily Set Table Text Representation by @karanataryn in #839
- Have Luna use .keyword field for path field. by @mdwelsh in #841
- Add a simple logical query plan compare function by @baitsguy in #840
- Improve luna property handling by @eric-anderson in #842
- Add support for Python 3.12. by @bsowell in #838
- Fix Luna UI to show query plan operators. by @mdwelsh in #847
- bugfix to extract text summaries(dont just randomly assert) by @RitxmSaha in #848
- Ignore bad tables by @MarkLindblad in #849
- Add support for caching intermediate results of Luna queries. by @mdwelsh in #850
- add read.opensearch(reconstruct_document =True) option by @baitsguy in #845
- Fold in query-demo capability to query-ui. by @mdwelsh in #852
- Define parallelism on nodes by @eric-anderson in #853
- Basic documentation for APS markdown option. by @alexaryn in #854
- Implement output_format in Aryn SDK partition_file(). by @alexaryn in #857
- Add
local-inference
extra tosycamore-ai
dependency inapps/query-ui
. by @mdwelsh in #859 - Super basic FastAPI wrapper to Sycamore Query. by @mdwelsh in #855
- Support output_format in ArynPartitioner. by @alexaryn in #858
- Fix tile cannot extend outside image by @dhruvkaliraman7 in #856
- Support Jupyter saving to S3 by @eric-anderson in #860
- Add PaddleOCR and Refactor Text Extraction by @karanataryn in #745
- Fix broken test. by @mdwelsh in #863
- Get Local Mode working 2/n by @eric-anderson in #861
- Remove package-mode by @eric-anderson in #865
- Add similarity scoring and rerank transform by @baitsguy in #864
- adding docs for AssignDocProperties, Standardizer and ExtractTableProperties by @Soeb-aryn in #866
- Add newline before text elements. by @alexaryn in #862
- handle file paths in the sdk by @HenryL27 in #869
- Add packaging library to aryn-sdk pyproject.toml. by @bsowell in #870
- Do some escaping of special Markdown characters. by @alexaryn in #867
- fix type annotation for file by @HenryL27 in #871
- Element ordering and test improvements by @baitsguy in #872
- Test fixes and more local mode by @baitsguy in #873
- Add a few more files to .gitignore. by @bsowell in #875
- feat: Qdrant support by @Anush008 in #821
- Get llm_filter to support document structure + similarity sorting for elements by @baitsguy in #876
- Add documentation for Sycamore Query. by @mdwelsh in #878
- Move loaddata script to query-ui. by @mdwelsh in #877
- Remove deprecated query-demo UI. by @mdwelsh in #881
- Adjust Pinecene Docs for Clarity by @karanataryn in #883
- Add source_mode parameter to AutoMaterialize. by @bsowell in #885
- add optimization from training development by @HenryL27 in #886
- Fix documentation link, sentence grammar by @MarkLindblad in #879
- Clean Up Text Extraction by @karanataryn in #868
- Fix Parameter Error in Docs by @karanataryn in #888
- Enable document model in sycamore.query + query-ui improvements by @baitsguy in #884
- Fix parallelism bug. by @eric-anderson in #889
- fix issue when packages and containers do not align at all -> max([]) by @HenryL27 in #891
- Bump version to 0.1.22. by @bsowell in #892
New Contributors
Full Changelog: v0.1.21...v0.1.22
v0.1.21
This Sycamore release contains Aryn Partitioning Service client updates to support the new auto-threshold feature and add support for Microsoft Word (.doc and .docx) and Microsoft PowerPoint (.ppt and .pptx) files. It also contains a variety of bug fixes and stability improvements.
What's Changed
- Fix Lib/Sycamore README by @karanataryn in #771
- Allow custom SycamoreQueryClient in query-ui + cleanup by @baitsguy in #772
- Sycamore changes to support new NTSB demo. by @mdwelsh in #774
- improving ExtractTableProperties and standardizer transforms by @Soeb-aryn in #773
- add materialize to transform toc by @eric-anderson in #779
- Fix Bugs in Sycamore Pipeline by @karanataryn in #777
- New NTSB Luna demo UI. by @mdwelsh in #778
- neo4j, refactor pipeline to not auto resolve entities + add support for images in pipeline. by @RitxmSaha in #766
- Fix issue with duplicate widget keys. by @mdwelsh in #780
- Bug fixes in query path by @baitsguy in #781
- Add querydemo to pyproject.toml. by @mdwelsh in #783
- A few Luna demo fixes. by @mdwelsh in #784
- Bugfixes for context_vars by @baitsguy in #785
- Fix Local Mode Read Bug by @karanataryn in #786
- Make reorder_elements more like sorted() so we can use key= by @alexaryn in #787
- Add new OpenSearch writer notebook by @jonfritz in #788
- Fix function signature reading in contextvars by @baitsguy in #789
- Various Luna Demo fixes. by @mdwelsh in #790
- Making changes to docs. Better titles etc. by @AbhijitP-009 in #793
- Update our container support by @eric-anderson in #782
- Add natural language result flag. by @mdwelsh in #794
- Make Element Class More Robust by @karanataryn in #797
- updated docs to explain the new default threshold setting for ArynPartitioner by @dtecuci in #795
- Add support for pushing query filters down to OpenSearch. by @mdwelsh in #796
- neo4j writer docs by @RitxmSaha in #798
- Docs for nms change (take 2) by @dtecuci in #799
- Fix TableTransformer Bug by @karanataryn in #800
- Remove dead code. by @mdwelsh in #803
- Verify .map can run parallel classes by @eric-anderson in #802
- bugfix to extract graph entities by @RitxmSaha in #805
- Update default threshold values for ArynPartitioner. by @bsowell in #804
- Update type signatures for threshold in aryn_sdk. by @bsowell in #806
- Change Bounding Box Validity Assertion by @karanataryn in #808
- A few Luna demo tweaks. by @mdwelsh in #810
- Couple of Luna demo bugfixes. by @mdwelsh in #814
- Add QueryVectorDatabase to SycamoreQuery by @baitsguy in #813
- Add
.docx
documentation by @MarkLindblad in #812 - Ritam add example notebook by @RitxmSaha in #815
- query-ui: cosmetic changes by @baitsguy in #817
- Improved NTSB ingestion pipeline for Luna demo. by @mdwelsh in #816
- Bump version to v0.1.21. by @bsowell in #819
- Reverts README change to restore poetry build. by @bsowell in #820
- Fix typo scyamore -> sycamore. by @bsowell in #822
Full Changelog: v0.1.20...v0.1.21