Releases: aryn-ai/sycamore
v0.1.10
This Sycamore release adds support for near duplicate detection via shingling. It also includes documentation improvements and incremental bug fixes.
What's Changed
- Render schema extraction documentation by @mkyl in #194
- Additional documentation for Schema extraction by @mkyl in #195
- Add async-timeout dependency by @eric-anderson in #198
- Add docstrings to all public document methods so they show up on sycamore.readthedocs.io by @eric-anderson in #197
- Near-Duplicate Detection in Sycamore: Document Tagging and Document Dropping by @alexaryn in #199
- Bump version to v0.1.10. by @bsowell in #200
Full Changelog: v0.1.9...v0.1.10
v0.1.9
This Sycamore release adds improved heuristics for partitioning documents. It also includes a new method of automatically inferring entities to extract from unstructured documents, as well as incremental features and bug fixes.
What's Changed
- Change the default merge size to 256. by @eric-anderson in #178
- Simplify running the http crawler. by @eric-anderson in #180
- Fix text chunking for html importing to improve result quality. by @eric-anderson in #185
- Remove docker_compose and opensearch files. They were moved to quickstart. by @eric-anderson in #183
- Change simple_ingest and s3_ingest to use GTE-small embedding model. by @alexaryn in #169
- Remove unneeded mapping in OpenSearch index settings. by @alexaryn in #186
- Added HTML ingest example. Fixed order in S3 ingester. by @alexaryn in #188
- Simple transform to perform regex replacement on Elements. by @alexaryn in #187
- Update README.md by @jonfritz in #179
- Entity Extraction by @mkyl in #161
- Merging/breaking elements based on heuristics including bbox by @alexaryn in #171
- Update aiohttp and cryptography to address dependabot alerts. by @bsowell in #192
- Bump version to v0.1.9. by @bsowell in #191
New Contributors
Full Changelog: v0.1.8...v0.1.9
v0.1.8
This Sycamore release contains code to build Docker containers as well as small improvements and bug fixes.
What's Changed
- Add take_all operator on docsets. by @bsowell in #140
- Merge in crawler by @eric-anderson in #143
- Speed up 'poetry lock'. by @alexaryn in #147
- Merge after extract_entity so that elements don't exceed size limit. by @alexaryn in #148
- Add docker compose yaml files that run sycamore + crawler + arynai/opensearch + demoui by @eric-anderson in #145
- Add the scripts to dockerize opensearch in a way that works with the other sycamore components by @eric-anderson in #146
- Dockerize sycamore importing by @eric-anderson in #154
- Fix bug in dockerization from merge by @eric-anderson in #164
- Upgrade pyarrow version. by @bsowell in #165
- Bump version for 0.1.8 release. by @bsowell in #166
- Update README. by @austintlee in #167
- Mount data volume to demo-ui container by @pparmar30 in #170
- Lots of improvments to get sort benchmark working better by @eric-anderson in #172
- Move library dependencies back under [tool.poetry.dependencies] by @bsowell in #174
- Fixup dockerization -- skip sycamore library & add build-stamps by @eric-anderson in #175
- Update poetry.lock. by @bsowell in #176
- Update S3 crawler Dockerfile to skip library dependencies and add build stamps by @eric-anderson in #177
New Contributors
- @austintlee made their first contribution in #167
Full Changelog: v0.1.7...v0.1.8
v0.1.7
This Sycamore release adds support for reading JSON and using Azure OpenAI to enrich data and generate vector embeddings. It also includes documentation improvements, improvements to merging and partitioning, new incremental features, and bug fixes.
What's Changed
- Fix docstrings for merge elements and add to docs by @bsowell in #126
- Add documentation for OpenAI and Bedrock embeddings by @bsowell in #125
- Update pyproject.toml to exclude 3.12 from list of python version by @bsowell in #122
- Use element merger in examples by @alexaryn in #123
- Merge transform fix by @baitsguy in #128
- Parameter to help with debugging bounding box issues by @bsowell in #137
- Update OpenAI LLM and OpenAIEmbedder to also support Azure OpenAI by @bsowell in #134
- Re-lock poetry config to fix a warning around the 'executing' package by @eric-anderson in #138
- Provide a way to modify the text to embed by @pparmar30 in #132
- Add manifest reader by @pparmar30 in #133
- Add random_sample transformation by @bsowell in #136
- Add a JSON reader by @pparmar30 in #117
- Bump version to 0.1.7 by @bsowell in #139
Full Changelog: v0.1.6...v0.1.7
v0.1.6
This Sycamore release adds basic support for ingesting PPTX files and support for OpenAI and Amazon Bedrock embedding models. It also contains small improvements and bug fixes.
What's Changed
- Fix bug in README.md; opensearch function expects kwargs. by @eric-anderson in #112
- Update partion_pdf paramters to match Unstructured. by @bsowell in #109
- Utility function for writing out elements. by @bsowell in #114
- Inject function name while creating Ray function by @pparmar30 in #116
- Add support for PPTX files by @pparmar30 in #107
- merge_elements transform by @HenryL27 in #115
- Add SerDe for Document by @bohou-aryn in #110
- Add support for OpenAI embeddings. by @bsowell in #118
- Inject function name while creating Ray callables by @pparmar30 in #120
- Add support for Amazon Bedrock embeddings. by @bsowell in #119
- Bump version to v0.1.6. by @bsowell in #121
New Contributors
- @eric-anderson made their first contribution in #112
Full Changelog: v0.1.5...v0.1.6
v0.1.5
This Sycamore release contains small improvements and bug fixes.
What's Changed
- Update end_to_end_tutorials.md by @HenryL27 in #97
- Add notebook tests and utilities to strip output from notebooks. by @bsowell in #95
- Examples to show simple ingestion using Sycamore. by @alexaryn in #98
- Assume a reasonable default; passing None yielded tiny elements. by @alexaryn in #99
- Allow bbox into index. by @alexaryn in #101
- Increase nbmake timeout to address integ test issues. by @bsowell in #104
- Fix example so it makes a proper KNN index. by @alexaryn in #102
- Transform to denormalize specified properties from parents to children by @alexaryn in #103
- Bump version to 0.1.5 by @bsowell in #106
Full Changelog: v0.1.4...v0.1.5
Release version 0.1.4
What's Changed
This Sycamore release has a variety of small improvements and bug fixes:
- Add demo vid as additional resource by @HenryL27 in #70
- Reword ETL -> data preparation in the docs. by @bsowell in #71
- Quick lint rule to detect references to notion.so in docs by @alexaryn in #69
- Run integ tests on larger runner by @bsowell in #75
- UnstructuredPdfPartitioner to inherit parent properties by @alexaryn in #78
- Fix TypeError in extract_table.py. Resolves #76 by @alexaryn in #80
- Fix up encoding and avoid encode-decode waste. Resolves #82 by @alexaryn in #86
- Add FileMetadataProvider by @baitsguy in #84
- Add and Update API docs by @pparmar30 in #81
- Fix bug of SentenceTransformer exception for subscripting None by @bohou-aryn in #77
- Add support min partition char length by @pparmar30 in #88
- Pin torch version. by @bsowell in #89
- Standarize the bounding box representation. by @bohou-aryn in #87
- FileWriter Implementation by @bsowell in #38
- Fix RTD publishing by @pparmar30 in #90
- Update the doc building process for RTD by @bsowell in #91
- Bug fix for docset.show() function by @pparmar30 in #93
- Bump the version by @pparmar30 in #92
New Contributors
Full Changelog: v0.1.3...v0.1.4