Skip to content

Commit

Permalink
Reformatting from running pre-commit. (#237)
Browse files Browse the repository at this point in the history
* .ipnb -> .ipynb typo fix in two files
*  Reformatting from running pre-commit
  • Loading branch information
eric-anderson authored Feb 8, 2024
1 parent 830f202 commit 9dc1cd1
Show file tree
Hide file tree
Showing 11 changed files with 26 additions and 28 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,15 +28,15 @@ You can easily deploy Sycamore locally or on a virtual machine using Docker.

With Docker installed:

1. Clone the Sycamore repo:
1. Clone the Sycamore repo:

```git clone https://github.com/aryn-ai/sycamore```

2. Set OpenAI Key:

```export OPENAI_API_KEY=YOUR-KEY```

3. Go to:
3. Go to:

```/sycamore```

Expand Down
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Data Preparation Concepts

You can use the [default data preparation code](../../../notebooks/default-prep-script.ipnb) to segment, process, enrich, embed, and load your data into Sycamore. This runs automatically when using the [crawlers to load data](..//load_data.md#using-a-crawler), and is used in the [Get Started examples](../welcome_to_sycamore/get_started.md). However, to get the best results on complex data, you will likely need to write custom code specific for your data to prepare it for search and analytics.
You can use the [default data preparation code](../../../notebooks/default-prep-script.ipynb) to segment, process, enrich, embed, and load your data into Sycamore. This runs automatically when using the [crawlers to load data](..//load_data.md#using-a-crawler), and is used in the [Get Started examples](../welcome_to_sycamore/get_started.md). However, to get the best results on complex data, you will likely need to write custom code specific for your data to prepare it for search and analytics.

Sycamore provides a toolkit for data cleaning, information extraction, enrichment, summarization, and generation of vector embeddings that encapsulate the semantics of your data. It uses your choice of generative AI models to make these operations simple and effective, and it enables quick experimentation and iteration. You write your data preparation code in Python, and Sycamore uses Ray to easily scale as your workloads grow.

Sycamore data preparation code uses the concepts below, and available transforms are [here](/transforms.rst). Also, as an example, you can view the code for the default data preparation code [here](https://github.com/aryn-ai/sycamore/blob/main/notebooks/default-prep-script.ipnb) and learn more about how to run your custom code [here](/running_a_data_preparation_job.md).
Sycamore data preparation code uses the concepts below, and available transforms are [here](/transforms.rst). Also, as an example, you can view the code for the default data preparation code [here](https://github.com/aryn-ai/sycamore/blob/main/notebooks/default-prep-script.ipynb) and learn more about how to run your custom code [here](/running_a_data_preparation_job.md).

## Sycamore data preparation concepts

Expand All @@ -28,7 +28,7 @@ docset = context.read\
A Document is a generic representation of an unstructured document in a format like PDF or HTML. Though different types of Documents may have different properties, they all contain [the following common fields](https://sycamore.readthedocs.io/en/stable/APIs/data/data.html#sycamore.data.document.Document):

* **binary_representation:** The raw content of the document. May not be present in elements after partitioning of non-binary inputs such as HTML.

* **doc_id:** A unique identifier for the Document. Defaults to a UUID.

* **elements:** A list of elements belonging to this Document. If the document has no elements, for example before it is chunked, this field will be [].
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,3 @@ Information on supported generative AI models for each operation are in the spec
* [Entity extraction](/transforms/extract_entity.md)
* [Schema extraction](/transforms/extract_schema.md)
* [Summarize](/transforms/summarize.md)

Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@ For certain PDF processing operations, you also need to install poppler, which

`brew install poppler`

For an example Sycamore script, check out the [default preparation script](https://github.com/aryn-ai/sycamore/blob/main/notebooks/default-prep-script.ipnb).
For an example Sycamore script, check out the [default preparation script](https://github.com/aryn-ai/sycamore/blob/main/notebooks/default-prep-script.ipynb).
2 changes: 1 addition & 1 deletion docs/source/data_ingestion_and_preparation/load_data.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,4 +49,4 @@ To copy a local HTML file, run:

## Use data preparation libraries to load data

You can write data preparation jobs [using the Sycamore libraries](/installing_sycamore_libraries_locally.md) directly or [using Jupyter](/using_jupyter.md) and [load this data into your Sycamore stack](/running_a_data_preparation_job.md).
You can write data preparation jobs [using the Sycamore libraries](/installing_sycamore_libraries_locally.md) directly or [using Jupyter](/using_jupyter.md) and [load this data into your Sycamore stack](/running_a_data_preparation_job.md).
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ The easiest way to run your data preparation code is to use the Jupyter notebook

## Using the Sycamore-Importer container

You can also copy your code to the Sycamore-Importer container and run it there. However, we don’t recommend this method, and instead we suggest you use the Jupyter methods above. If you do copy your file to the Sycamore-Importer container, we recommend you save it to `/app/.scrapy` so it persists.
You can also copy your code to the Sycamore-Importer container and run it there. However, we don’t recommend this method, and instead we suggest you use the Jupyter methods above. If you do copy your file to the Sycamore-Importer container, we recommend you save it to `/app/.scrapy` so it persists.

1. Copy your file to the Sycamore-Importer container:

Expand Down Expand Up @@ -60,4 +60,3 @@ os_client_args = {
To run your Sycamore job:

```python /path/to/your-file-name.py```

Original file line number Diff line number Diff line change
Expand Up @@ -22,4 +22,4 @@ If you already have custom data preparation code, the easiest way to run it is u

## Running Jupyter locally

You can run Jupyter locally and load the output of your data preparation script into your Sycamore stack. The OpenSearch client configuration must match the endpoint of your Sycamore stack. For instructions on how to install and configure Jupyter locally for Sycamore, [click here](/sycamore-jupyter-dev-example.md#in-your-local-development-environment). For an example script, [click here](https://github.com/aryn-ai/sycamore/blob/main/notebooks/sycamore_local_dev_example.ipynb).
You can run Jupyter locally and load the output of your data preparation script into your Sycamore stack. The OpenSearch client configuration must match the endpoint of your Sycamore stack. For instructions on how to install and configure Jupyter locally for Sycamore, [click here](/sycamore-jupyter-dev-example.md#in-your-local-development-environment). For an example script, [click here](https://github.com/aryn-ai/sycamore/blob/main/notebooks/sycamore_local_dev_example.ipynb).
8 changes: 4 additions & 4 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,23 +31,23 @@ You can easily deploy Sycamore locally or on a virtual machine using Docker.

With Docker installed:

1. Clone the Sycamore repo:
1. Clone the Sycamore repo:

``git clone https://github.com/aryn-ai/sycamore``

2. Set OpenAI Key:

``export OPENAI_API_KEY=YOUR-KEY``

3. Go to:
3. Go to:

``/sycamore``

4. Launch Sycamore. Containers will be pulled from DockerHub:

``docker compose up --pull=always``

5. The Sycamore demo query UI is located at:
5. The Sycamore demo query UI is located at:

``http://localhost:3000/``

Expand Down Expand Up @@ -118,7 +118,7 @@ More Resources
/querying_data/hybrid_search.md
/querying_data/integrate_your_application.md
/querying_data/generative_ai_configurations.md



.. toctree::
Expand Down
4 changes: 2 additions & 2 deletions docs/source/tutorials/sycamore-jupyter-dev-example.md
Original file line number Diff line number Diff line change
Expand Up @@ -238,12 +238,12 @@ print("Visit http://localhost:3000 and use the", index, " index to query these r
```

2. Once the data is loaded into Sycamore, you can use the [demo query UI](../querying_data/demo_query_ui.md) for conversational search on it.
- Using your internet browser, visit `http://localhost:3000`.
- Using your internet browser, visit `http://localhost:3000`.
- Make sure the index selected in the dropdown at the bottom of the UI has the same name you provided in step 1j
- Create a new conversation. Enter the name for your conversation in the text box in the left "Conversations" panel, and hit enter or click the "Add convo" icon on the right of the text box.
- As a sample question, you can ask "Who wrote Attention Is All You Need?"

The results of the hybrid search are in the right hand panel, and you can click through to find the highlighted passage (step 1c enabled this).
The results of the hybrid search are in the right hand panel, and you can click through to find the highlighted passage (step 1c enabled this).

Though we are getting good results back from hybrid search, it would be nice if we could have the titles and other information for each passage. In the next section, we will iterate on our Sycamore job, and use a Sycamore transform that leverages generative AI to extract some metadata.

Expand Down
16 changes: 8 additions & 8 deletions docs/source/welcome_to_sycamore/get_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Sycamore is deployed using Docker, and you can launch it locally or on a virtual

`git clone https://github.com/aryn-ai/sycamore`

2. Create OpenAI Key for LLM access. Sycamore’s default configuration uses OpenAI for RAG and entity extraction. You can create an OpenAI account [here](https://platform.openai.com/signup), or if you already have one, you can retrieve your key [here](https://platform.openai.com/account/api-keys).
2. Create OpenAI Key for LLM access. Sycamore’s default configuration uses OpenAI for RAG and entity extraction. You can create an OpenAI account [here](https://platform.openai.com/signup), or if you already have one, you can retrieve your key [here](https://platform.openai.com/account/api-keys).

3. Set OpenAI Key:

Expand All @@ -22,7 +22,7 @@ Sycamore is deployed using Docker, and you can launch it locally or on a virtual

`Docker compose up --pull=always`

Note: You can alternately remove the `--pull=always` and instead run `docker compose pull` to control when new images are downloaded. `--pull=always` guarantees you have the most recent images for the specified version.
Note: You can alternately remove the `--pull=always` and instead run `docker compose pull` to control when new images are downloaded. `--pull=always` guarantees you have the most recent images for the specified version.

Congrats – you have launched Sycamore! Now, it’s time to ingest and prepare some data, and run conversational search on it. Continue on to the next section to do this with a sample dataset or a website that you specify.

Expand All @@ -40,15 +40,15 @@ Sycamore’s default data ingestion and preparation code can optionally use [Ama

If you have started Sycamore already, you'll need to restart it after following these instructions.

1. If you do not have an AWS account, sign up [here](https://portal.aws.amazon.com/billing/signup). You will need this during configuration.
1. If you do not have an AWS account, sign up [here](https://portal.aws.amazon.com/billing/signup). You will need this during configuration.

2. Create an Amazon S3 bucket in your AWS account in the us-east-1 region for use with Textract (e.g. `s3://username-textract-bucket`). We recommend you set up bucket lifecycle rules that automatically delete files in this bucket, as the data stored here is only needed temporarily during a Sycamore data processing job.
2. Create an Amazon S3 bucket in your AWS account in the us-east-1 region for use with Textract (e.g. `s3://username-textract-bucket`). We recommend you set up bucket lifecycle rules that automatically delete files in this bucket, as the data stored here is only needed temporarily during a Sycamore data processing job.

3. Enable Sycamore to use Textract by setting the S3 prefix/bucket name for Textract to use:

`export SYCAMORE_TEXTRACT_PREFIX=s3://your-bucket-name-here`
`export SYCAMORE_TEXTRACT_PREFIX=s3://your-bucket-name-here`

4. Configure your AWS credentials. You can enable AWS SSO login with [these instructions](https://docs.aws.amazon.com/cli/latest/userguide/sso-configure-profile-token.html#sso-configure-profile-token-auto-sso), or you can use other methods to set up AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and if needed AWS_SESSION_TOKEN.
4. Configure your AWS credentials. You can enable AWS SSO login with [these instructions](https://docs.aws.amazon.com/cli/latest/userguide/sso-configure-profile-token.html#sso-configure-profile-token-auto-sso), or you can use other methods to set up AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and if needed AWS_SESSION_TOKEN.

If using AWS SSO:

Expand Down Expand Up @@ -85,7 +85,7 @@ Sycamore will automatically start processing the new data. The processing job is

3. Use the demo query UI for conversational search. Using your internet browser, visit: `http://localhost:3000`. You can interact with the demo query UI while data is being added to the index, but the data won't all be available until the job is done. How to use the UI:

* Create a new conversation. Enter the name for your conversation in the text box in the left "Conversations" panel, and hit enter or click the "Add conversation" icon on the right of the text box.
* Create a new conversation. Enter the name for your conversation in the text box in the left "Conversations" panel, and hit enter or click the "Add conversation" icon on the right of the text box.
* Select your conversation, and then write a question into the text box in the middle panel. Hit enter.
* Ask follow up questions. You'll see the actual results from the Sycamore's hybrid search for your question in the right panel, and the conversational search in the middle panel.

Expand Down Expand Up @@ -161,7 +161,7 @@ No changes at [datetime] sleeping

A [Jupyter](https://jupyter.org/) notebook is a development environment that lets you
write and experiment with different data preparation jobs to improve the results of
processing your documents. The [using Jupyter
processing your documents. The [using Jupyter
tutorial](../tutorials/sycamore-jupyter-dev-example.md) will walk you through using the containerized Jupyter notebook included with Sycamore or installing it locally, and preparing a new set of documents.


Expand Down
6 changes: 3 additions & 3 deletions sycamore/transforms/partition.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ def __init__(
include_metadata: bool = True,
include_slide_notes: bool = False,
chunking_strategy: Optional[str] = None,
**kwargs
**kwargs,
):
self._include_page_breaks = include_page_breaks
self._include_metadata = include_metadata
Expand All @@ -126,7 +126,7 @@ def partition(self, document: Document) -> Document:
include_metadata=self._include_metadata,
include_slide_notes=self._include_slide_notes,
chunking_strategy=self._chunking_strategy,
**self._kwargs
**self._kwargs,
)

# Here we convert unstructured.io elements into our elements and
Expand Down Expand Up @@ -424,7 +424,7 @@ def execute(self) -> Dataset:
dataset = input_dataset.map(
generate_map_class_from_callable(self._partitioner.partition),
compute=ActorPoolStrategy(min_size=1, max_size=math.ceil(available_gpus / gpu_per_task)),
**self.resource_args
**self.resource_args,
)
else:
dataset = input_dataset.map(generate_map_function(self._partitioner.partition))
Expand Down

0 comments on commit 9dc1cd1

Please sign in to comment.