Reformatting from running pre-commit. (#237)

* .ipnb -> .ipynb typo fix in two files * Reformatting from running pre-commit
aryn-ai · Feb 8, 2024 · 9dc1cd1 · 9dc1cd1
1 parent 830f202
commit 9dc1cd1
Show file tree

Hide file tree

Showing 11 changed files with 26 additions and 28 deletions.
diff --git a/README.md b/README.md
@@ -28,15 +28,15 @@ You can easily deploy Sycamore locally or on a virtual machine using Docker.
 
 With Docker installed:
 
-1.	Clone the Sycamore repo: 
+1.	Clone the Sycamore repo:
 
 ```git clone https://github.com/aryn-ai/sycamore```
 
 2.	Set OpenAI Key:
 
 ```export OPENAI_API_KEY=YOUR-KEY```
 
-3.	Go to: 
+3.	Go to:
 
 ```/sycamore```
 

diff --git a/docs/source/data_ingestion_and_preparation/data_preparation_concepts.md b/docs/source/data_ingestion_and_preparation/data_preparation_concepts.md
@@ -1,10 +1,10 @@
 # Data Preparation Concepts
 
-You can use the [default data preparation code](../../../notebooks/default-prep-script.ipnb) to segment, process, enrich, embed, and load your data into Sycamore. This runs automatically when using the [crawlers to load data](..//load_data.md#using-a-crawler), and is used in the [Get Started examples](../welcome_to_sycamore/get_started.md). However, to get the best results on complex data, you will likely need to write custom code specific for your data to prepare it for search and analytics.
+You can use the [default data preparation code](../../../notebooks/default-prep-script.ipynb) to segment, process, enrich, embed, and load your data into Sycamore. This runs automatically when using the [crawlers to load data](..//load_data.md#using-a-crawler), and is used in the [Get Started examples](../welcome_to_sycamore/get_started.md). However, to get the best results on complex data, you will likely need to write custom code specific for your data to prepare it for search and analytics.
 
 Sycamore provides a toolkit for data cleaning, information extraction, enrichment, summarization, and generation of vector embeddings that encapsulate the semantics of your data. It uses your choice of generative AI models to make these operations simple and effective, and it enables quick experimentation and iteration. You write your data preparation code in Python, and Sycamore uses Ray to easily scale as your workloads grow.
 
-Sycamore data preparation code uses the concepts below, and available transforms are [here](/transforms.rst). Also, as an example, you can view the code for the default data preparation code [here](https://github.com/aryn-ai/sycamore/blob/main/notebooks/default-prep-script.ipnb) and learn more about how to run your custom code [here](/running_a_data_preparation_job.md).
+Sycamore data preparation code uses the concepts below, and available transforms are [here](/transforms.rst). Also, as an example, you can view the code for the default data preparation code [here](https://github.com/aryn-ai/sycamore/blob/main/notebooks/default-prep-script.ipynb) and learn more about how to run your custom code [here](/running_a_data_preparation_job.md).
 
 ## Sycamore data preparation concepts
 
@@ -28,7 +28,7 @@ docset = context.read\
 A Document is a generic representation of an unstructured document in a format like PDF or HTML. Though different types of Documents may have different properties, they all contain [the following common fields](https://sycamore.readthedocs.io/en/stable/APIs/data/data.html#sycamore.data.document.Document):
 
 * **binary_representation:** The raw content of the document. May not be present in elements after partitioning of non-binary inputs such as HTML.
-    
+
 * **doc_id:** A unique identifier for the Document. Defaults to a UUID.
 
 * **elements:** A list of elements belonging to this Document. If the document has no elements, for example before it is chunked, this field will be [].

diff --git a/docs/source/data_ingestion_and_preparation/generative_ai_configuration.md b/docs/source/data_ingestion_and_preparation/generative_ai_configuration.md
@@ -8,4 +8,3 @@ Information on supported generative AI models for each operation are in the spec
 * [Entity extraction](/transforms/extract_entity.md)
 * [Schema extraction](/transforms/extract_schema.md)
 * [Summarize](/transforms/summarize.md)
-
diff --git a/.../source/data_ingestion_and_preparation/installing_sycamore_libraries_locally.md b/.../source/data_ingestion_and_preparation/installing_sycamore_libraries_locally.md
@@ -12,4 +12,4 @@ For certain PDF processing operations, you also need to install poppler, which
 
 `brew install poppler`
 
-For an example Sycamore script, check out the [default preparation script](https://github.com/aryn-ai/sycamore/blob/main/notebooks/default-prep-script.ipnb). 
+For an example Sycamore script, check out the [default preparation script](https://github.com/aryn-ai/sycamore/blob/main/notebooks/default-prep-script.ipynb).
diff --git a/docs/source/data_ingestion_and_preparation/load_data.md b/docs/source/data_ingestion_and_preparation/load_data.md
@@ -49,4 +49,4 @@ To copy a local HTML file, run:
 
 ## Use data preparation libraries to load data
 
-You can write data preparation jobs [using the Sycamore libraries](/installing_sycamore_libraries_locally.md) directly or [using Jupyter](/using_jupyter.md) and [load this data into your Sycamore stack](/running_a_data_preparation_job.md). 
+You can write data preparation jobs [using the Sycamore libraries](/installing_sycamore_libraries_locally.md) directly or [using Jupyter](/using_jupyter.md) and [load this data into your Sycamore stack](/running_a_data_preparation_job.md).
diff --git a/docs/source/data_ingestion_and_preparation/running_a_data_preparation_job.md b/docs/source/data_ingestion_and_preparation/running_a_data_preparation_job.md
@@ -19,7 +19,7 @@ The easiest way to run your data preparation code is to use the Jupyter notebook
 
 ## Using the Sycamore-Importer container
 
-You can also copy your code to the Sycamore-Importer container and run it there. However, we don’t recommend this method, and instead we suggest you use the Jupyter methods above. If you do copy your file to the Sycamore-Importer container, we recommend you save it to `/app/.scrapy` so it persists. 
+You can also copy your code to the Sycamore-Importer container and run it there. However, we don’t recommend this method, and instead we suggest you use the Jupyter methods above. If you do copy your file to the Sycamore-Importer container, we recommend you save it to `/app/.scrapy` so it persists.
 
 1. Copy your file to the Sycamore-Importer container:
 
@@ -60,4 +60,3 @@ os_client_args = {
 To run your Sycamore job:
 
 ```python /path/to/your-file-name.py```
-
diff --git a/docs/source/data_ingestion_and_preparation/using_jupyter.md b/docs/source/data_ingestion_and_preparation/using_jupyter.md
@@ -22,4 +22,4 @@ If you already have custom data preparation code, the easiest way to run it is u
 
 ## Running Jupyter locally
 
-You can run Jupyter locally and load the output of your data preparation script into your Sycamore stack. The OpenSearch client configuration must match the endpoint of your Sycamore stack. For instructions on how to install and configure Jupyter locally for Sycamore, [click here](/sycamore-jupyter-dev-example.md#in-your-local-development-environment). For an example script, [click here](https://github.com/aryn-ai/sycamore/blob/main/notebooks/sycamore_local_dev_example.ipynb). 
+You can run Jupyter locally and load the output of your data preparation script into your Sycamore stack. The OpenSearch client configuration must match the endpoint of your Sycamore stack. For instructions on how to install and configure Jupyter locally for Sycamore, [click here](/sycamore-jupyter-dev-example.md#in-your-local-development-environment). For an example script, [click here](https://github.com/aryn-ai/sycamore/blob/main/notebooks/sycamore_local_dev_example.ipynb).
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -31,23 +31,23 @@ You can easily deploy Sycamore locally or on a virtual machine using Docker.
 
 With Docker installed:
 
-1.	Clone the Sycamore repo: 
+1.	Clone the Sycamore repo:
 
 ``git clone https://github.com/aryn-ai/sycamore``
 
 2.	Set OpenAI Key:
 
 ``export OPENAI_API_KEY=YOUR-KEY``
 
-3.	Go to: 
+3.	Go to:
 
 ``/sycamore``
 
 4.	Launch Sycamore. Containers will be pulled from DockerHub:
 
 ``docker compose up --pull=always``
 
-5.	The Sycamore demo query UI is located at: 
+5.	The Sycamore demo query UI is located at:
 
 ``http://localhost:3000/``
 
@@ -118,7 +118,7 @@ More Resources
    /querying_data/hybrid_search.md
    /querying_data/integrate_your_application.md
    /querying_data/generative_ai_configurations.md
-   
+
 
 
 .. toctree::

diff --git a/docs/source/tutorials/sycamore-jupyter-dev-example.md b/docs/source/tutorials/sycamore-jupyter-dev-example.md
@@ -238,12 +238,12 @@ print("Visit http://localhost:3000 and use the", index, " index to query these r
 ```
 
 2. Once the data is loaded into Sycamore, you can use the [demo query UI](../querying_data/demo_query_ui.md) for conversational search on it.
-- Using your internet browser, visit `http://localhost:3000`. 
+- Using your internet browser, visit `http://localhost:3000`.
 - Make sure the index selected in the dropdown at the bottom of the UI has the same name you provided in step 1j
 - Create a new conversation. Enter the name for your conversation in the text box in the left "Conversations" panel, and hit enter or click the "Add convo" icon on the right of the text box.
 - As a sample question, you can ask "Who wrote Attention Is All You Need?"
 
-The results of the hybrid search are in the right hand panel, and you can click through to find the highlighted passage (step 1c enabled this). 
+The results of the hybrid search are in the right hand panel, and you can click through to find the highlighted passage (step 1c enabled this).
 
 Though we are getting good results back from hybrid search, it would be nice if we could have the titles and other information for each passage. In the next section, we will iterate on our Sycamore job, and use a Sycamore transform that leverages generative AI to extract some metadata.
 

diff --git a/docs/source/welcome_to_sycamore/get_started.md b/docs/source/welcome_to_sycamore/get_started.md
@@ -8,7 +8,7 @@ Sycamore is deployed using Docker, and you can launch it locally or on a virtual
 
 `git clone https://github.com/aryn-ai/sycamore`
 
-2. Create OpenAI Key for LLM access. Sycamore’s default configuration uses OpenAI for RAG and entity extraction. You can create an OpenAI account [here](https://platform.openai.com/signup), or if you already have one, you can retrieve your key [here](https://platform.openai.com/account/api-keys). 
+2. Create OpenAI Key for LLM access. Sycamore’s default configuration uses OpenAI for RAG and entity extraction. You can create an OpenAI account [here](https://platform.openai.com/signup), or if you already have one, you can retrieve your key [here](https://platform.openai.com/account/api-keys).
 
 3. Set OpenAI Key:
 
@@ -22,7 +22,7 @@ Sycamore is deployed using Docker, and you can launch it locally or on a virtual
 
 `Docker compose up --pull=always`
 
-Note: You can alternately remove the `--pull=always` and instead run `docker compose pull` to control when new images are downloaded. `--pull=always` guarantees you have the most recent images for the specified version. 
+Note: You can alternately remove the `--pull=always` and instead run `docker compose pull` to control when new images are downloaded. `--pull=always` guarantees you have the most recent images for the specified version.
 
 Congrats – you have launched Sycamore! Now, it’s time to ingest and prepare some data, and run conversational search on it. Continue on to the next section to do this with a sample dataset or a website that you specify.
 
@@ -40,15 +40,15 @@ Sycamore’s default data ingestion and preparation code can optionally use [Ama
 
 If you have started Sycamore already, you'll need to restart it after following these instructions.
 
-1. If you do not have an AWS account, sign up [here](https://portal.aws.amazon.com/billing/signup). You will need this during configuration. 
+1. If you do not have an AWS account, sign up [here](https://portal.aws.amazon.com/billing/signup). You will need this during configuration.
 
-2. Create an Amazon S3 bucket in your AWS account in the us-east-1 region for use with Textract (e.g. `s3://username-textract-bucket`). We recommend you set up bucket lifecycle rules that automatically delete files in this bucket, as the data stored here is only needed temporarily during a Sycamore data processing job. 
+2. Create an Amazon S3 bucket in your AWS account in the us-east-1 region for use with Textract (e.g. `s3://username-textract-bucket`). We recommend you set up bucket lifecycle rules that automatically delete files in this bucket, as the data stored here is only needed temporarily during a Sycamore data processing job.
 
 3. Enable Sycamore to use Textract by setting the S3 prefix/bucket name for Textract to use:
 
-`export SYCAMORE_TEXTRACT_PREFIX=s3://your-bucket-name-here` 
+`export SYCAMORE_TEXTRACT_PREFIX=s3://your-bucket-name-here`
 
-4. Configure your AWS credentials. You can enable AWS SSO login with [these instructions](https://docs.aws.amazon.com/cli/latest/userguide/sso-configure-profile-token.html#sso-configure-profile-token-auto-sso), or you can use other methods to set up AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and if needed AWS_SESSION_TOKEN. 
+4. Configure your AWS credentials. You can enable AWS SSO login with [these instructions](https://docs.aws.amazon.com/cli/latest/userguide/sso-configure-profile-token.html#sso-configure-profile-token-auto-sso), or you can use other methods to set up AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and if needed AWS_SESSION_TOKEN.
 
 If using AWS SSO:
 
@@ -85,7 +85,7 @@ Sycamore will automatically start processing the new data. The processing job is
 
 3. Use the demo query UI for conversational search. Using your internet browser, visit: `http://localhost:3000`. You can interact with the demo query UI while data is being added to the index, but the data won't all be available until the job is done. How to use the UI:
 
-* Create a new conversation. Enter the name for your conversation in the text box in the left "Conversations" panel, and hit enter or click the "Add conversation" icon on the right of the text box. 
+* Create a new conversation. Enter the name for your conversation in the text box in the left "Conversations" panel, and hit enter or click the "Add conversation" icon on the right of the text box.
 * Select your conversation, and then write a question into the text box in the middle panel. Hit enter.
 * Ask follow up questions. You'll see the actual results from the Sycamore's hybrid search for your question in the right panel, and the conversational search in the middle panel.
 
@@ -161,7 +161,7 @@ No changes at [datetime] sleeping
 
 A [Jupyter](https://jupyter.org/) notebook is a development environment that lets you
 write and experiment with different data preparation jobs to improve the results of
-processing your documents. The [using Jupyter 
+processing your documents. The [using Jupyter
 tutorial](../tutorials/sycamore-jupyter-dev-example.md) will walk you through using the containerized Jupyter notebook included with Sycamore or installing it locally, and preparing a new set of documents.
 
 

diff --git a/sycamore/transforms/partition.py b/sycamore/transforms/partition.py
@@ -107,7 +107,7 @@ def __init__(
         include_metadata: bool = True,
         include_slide_notes: bool = False,
         chunking_strategy: Optional[str] = None,
-        **kwargs
+        **kwargs,
     ):
         self._include_page_breaks = include_page_breaks
         self._include_metadata = include_metadata
@@ -126,7 +126,7 @@ def partition(self, document: Document) -> Document:
             include_metadata=self._include_metadata,
             include_slide_notes=self._include_slide_notes,
             chunking_strategy=self._chunking_strategy,
-            **self._kwargs
+            **self._kwargs,
         )
 
         # Here we convert unstructured.io elements into our elements and
@@ -424,7 +424,7 @@ def execute(self) -> Dataset:
             dataset = input_dataset.map(
                 generate_map_class_from_callable(self._partitioner.partition),
                 compute=ActorPoolStrategy(min_size=1, max_size=math.ceil(available_gpus / gpu_per_task)),
-                **self.resource_args
+                **self.resource_args,
             )
         else:
             dataset = input_dataset.map(generate_map_function(self._partitioner.partition))
Original file line number	Diff line number	Diff line change
Expand Up		@@ -8,4 +8,3 @@ Information on supported generative AI models for each operation are in the spec
		* [Entity extraction](/transforms/extract_entity.md)
		* [Schema extraction](/transforms/extract_schema.md)
		* [Summarize](/transforms/summarize.md)
Original file line number	Diff line number	Diff line change
Expand Up		@@ -12,4 +12,4 @@ For certain PDF processing operations, you also need to install poppler, which

		`brew install poppler`

		For an example Sycamore script, check out the [default preparation script](https://github.com/aryn-ai/sycamore/blob/main/notebooks/default-prep-script.ipnb).
		For an example Sycamore script, check out the [default preparation script](https://github.com/aryn-ai/sycamore/blob/main/notebooks/default-prep-script.ipynb).
Original file line number	Diff line number	Diff line change
Expand Up		@@ -49,4 +49,4 @@ To copy a local HTML file, run:

		## Use data preparation libraries to load data

		You can write data preparation jobs [using the Sycamore libraries](/installing_sycamore_libraries_locally.md) directly or [using Jupyter](/using_jupyter.md) and [load this data into your Sycamore stack](/running_a_data_preparation_job.md).
		You can write data preparation jobs [using the Sycamore libraries](/installing_sycamore_libraries_locally.md) directly or [using Jupyter](/using_jupyter.md) and [load this data into your Sycamore stack](/running_a_data_preparation_job.md).
Original file line number	Diff line number	Diff line change
Expand Up		@@ -22,4 +22,4 @@ If you already have custom data preparation code, the easiest way to run it is u

		## Running Jupyter locally

		You can run Jupyter locally and load the output of your data preparation script into your Sycamore stack. The OpenSearch client configuration must match the endpoint of your Sycamore stack. For instructions on how to install and configure Jupyter locally for Sycamore, [click here](/sycamore-jupyter-dev-example.md#in-your-local-development-environment). For an example script, [click here](https://github.com/aryn-ai/sycamore/blob/main/notebooks/sycamore_local_dev_example.ipynb).
		You can run Jupyter locally and load the output of your data preparation script into your Sycamore stack. The OpenSearch client configuration must match the endpoint of your Sycamore stack. For instructions on how to install and configure Jupyter locally for Sycamore, [click here](/sycamore-jupyter-dev-example.md#in-your-local-development-environment). For an example script, [click here](https://github.com/aryn-ai/sycamore/blob/main/notebooks/sycamore_local_dev_example.ipynb).