diff --git a/api/getting-started/index.html b/api/getting-started/index.html index 6fcfbda..7e71bba 100644 --- a/api/getting-started/index.html +++ b/api/getting-started/index.html @@ -486,6 +486,17 @@ + + @@ -496,6 +507,52 @@ + + + + @@ -569,6 +626,41 @@ + + + @@ -587,7 +679,8 @@

Getting started

First, go to Parsera web page and generate an API key.

-

Paste this key to X-API-KEY header to send the request: +

Extract endpoint

+

Paste this key to X-API-KEY header to send the request to extract endpoint:

curl https://api.parsera.org/v1/extract \
 --header 'Content-Type: application/json' \
 --header 'X-API-KEY: <YOUR_API_KEY>' \
@@ -607,6 +700,27 @@ 

Getting started

}'

By default, proxy_country is random, it's recommended to set proxy_country parameter to a specific country in the request since a page could not be available from all locations. Here you can find a full list of proxy countries available.

+

Parse endpoint

+

In addition to extract, there is a parse endpoint that can be used to parse data generated on your side instead of one from url.
+There is a content attribute for passing data, which accepts both raw html and string:
+

curl https://api.parsera.org/v1/parse \
+--header 'Content-Type: application/json' \
+--header 'X-API-KEY: <YOUR_API_KEY>' \
+--data '{
+    "content": <HTML_OR_TEXT_HERE>,
+    "attributes": [
+        {
+            "name": "Title",
+            "description": "News title"
+        },
+        {
+            "name": "Points",
+            "description": "Number of points"
+        }
+    ],
+}'
+

+

Swagger doc

You can also explore Swagger doc of the API following this link: https://api.parsera.org/docs#/.

diff --git a/search/search_index.json b/search/search_index.json index 3d90ca9..938a852 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Welcome to Parsera","text":"

Parsera is a lightweight Python library for scraping websites with LLMs.

There are 2 ways of using Parsera:

"},{"location":"#community","title":"Community","text":"

If you like this project star it on GitHub and join our discussions on Discord server.

"},{"location":"#contributors","title":"Contributors","text":"

If you are considering contributing to Parsera, check out the guidelines to get started.

"},{"location":"contributing/","title":"Contributing","text":"

Thanks for considering contributing to Parsera! This project is in the early stage of development, so any help will be highly appreciated. You can start from looking through existing issues, or directly asking about the most helpful contributions on Discord.

"},{"location":"contributing/#issues","title":"Issues","text":"

The best way to ask a question, report a bug, or submit feature request is to submit an Issue. It's much better than asking about it in email or Discord since conversation becomes publicly available and easy to navigate.

"},{"location":"contributing/#pull-requests","title":"Pull requests","text":""},{"location":"contributing/#installation-and-setup","title":"Installation and setup","text":"

Fork the repository on GitHub and clone your fork locally.

Next, install dependencies using poetry:

# Clone your fork and cd into the repo directory\ngit clone git@github.com:<your username>/parsera.git\ncd parsera\n\n# If you don't have poetry install it first:\n# https://python-poetry.org/docs/\n# Then:\npoetry install\n# If you are using VS Code you can get python venv path to switch:\npoetry which python\n# To activate virtual environment with installation run:\npoetry shell\n
Now you have a virtual environment with Parsera and all necessary dependencies installed.

"},{"location":"contributing/#code-style","title":"Code style","text":"

The project uses black and isort for formatting. Set up them in your IDE or run this before committing:

make format\n

"},{"location":"contributing/#commit-and-push-changes","title":"Commit and push changes","text":"

Commit your changes and push them to your fork, then create a pull request to the Parsera's repository.

Thanks a lot for helping improve Parsera!

"},{"location":"getting-started/","title":"Welcome to Parsera","text":"

Parsera is a lightweight Python library for scraping websites with LLMs. You can run clone and run it locally or use an API, which provides more scalable way and some extra features like built-in proxy.

"},{"location":"getting-started/#installation","title":"Installation","text":"
pip install parsera\nplaywright install\n
"},{"location":"getting-started/#basic-usage","title":"Basic usage","text":"

If you want to use OpenAI, remember to set up OPENAI_API_KEY env variable. You can do this from python with:

import os\n\nos.environ[\"OPENAI_API_KEY\"] = \"YOUR_OPENAI_API_KEY_HERE\"\n

Next, you can run a basic version that uses gpt-4o-mini

from parsera import Parsera\n\nurl = \"https://news.ycombinator.com/\"\nelements = {\n    \"Title\": \"News title\",\n    \"Points\": \"Number of points\",\n    \"Comments\": \"Number of comments\",\n}\n\nscraper = Parsera()\nresult = scraper.run(url=url, elements=elements)\n

result variable will contain a json with a list of records:

[\n   {\n      \"Title\":\"Hacking the largest airline and hotel rewards platform (2023)\",\n      \"Points\":\"104\",\n      \"Comments\":\"24\"\n   },\n    ...\n]\n

There is also arun async method available:

result = await scrapper.arun(url=url, elements=elements)\n

"},{"location":"getting-started/#running-with-cli","title":"Running with CLI","text":"

Before you run Parsera as command line tool don't forget to put your OPENAI_API_KEY to env variables or .env file

"},{"location":"getting-started/#usage","title":"Usage","text":"

You can configure elements to parse using JSON string or FILE. Optionally, you can provide FILE to write output.

python -m parsera.main URL {--scheme '{\"title\":\"h1\"}' | --file FILENAME} [--output FILENAME]\n
"},{"location":"getting-started/#more-features","title":"More features","text":"

Check out further documentation to explore more features:

"},{"location":"api/getting-started/","title":"Getting started","text":"

First, go to Parsera web page and generate an API key.

Paste this key to X-API-KEY header to send the request:

curl https://api.parsera.org/v1/extract \\\n--header 'Content-Type: application/json' \\\n--header 'X-API-KEY: <YOUR_API_KEY>' \\\n--data '{\n    \"url\": \"https://news.ycombinator.com/\",\n    \"attributes\": [\n        {\n            \"name\": \"Title\",\n            \"description\": \"News title\"\n        },\n        {\n            \"name\": \"Points\",\n            \"description\": \"Number of points\"\n        }\n    ],\n    \"proxy_country\": \"UnitedStates\"\n}'\n

By default, proxy_country is random, it's recommended to set proxy_country parameter to a specific country in the request since a page could not be available from all locations. Here you can find a full list of proxy countries available.

You can also explore Swagger doc of the API following this link: https://api.parsera.org/docs#/.

"},{"location":"api/proxy/","title":"Proxy","text":""},{"location":"api/proxy/#setting-proxy-country","title":"Setting proxy country","text":"

You can use the proxy_country parameter to set a proxy country. The default is random, and it's recommended to change it since your page could not be available from all locations.

To scrape the page from the United States you have to set proxy_country to UnitedStates:

curl https://api.parsera.org/v1/extract \\\n--header 'Content-Type: application/json' \\\n--header 'X-API-KEY: <YOUR-API-KEY>' \\\n--data '{\n    \"url\": <TARGET-URL>,\n    \"attributes\": [\n        {\n            \"name\": <First attribute name>,\n            \"description\": <First attribute description>,\n        },\n        {\n            \"name\": <Second attribute name>,\n            \"description\": <Second attribute description>\n        }\n    ],\n    \"proxy_country\": \"UnitedStates\"\n}'\n

"},{"location":"api/proxy/#list-of-proxy-countries","title":"List of proxy countries","text":"

Send a GET request to this URL https://api.parsera.org/v1/proxy-countries, to get the list of countries programmatically.

Here is the list of countries available:

"},{"location":"features/custom-models/","title":"Custom models","text":""},{"location":"features/custom-models/#run-with-custom-model","title":"Run with custom model","text":"

You can instantiate Parsera with any chat model supported by LangChain, for example, to run the model from Azure:

import os\nfrom langchain_openai import AzureChatOpenAI\n\nllm = AzureChatOpenAI(\n    azure_endpoint=os.getenv(\"AZURE_GPT_BASE_URL\"),\n    openai_api_version=\"2023-05-15\",\n    deployment_name=os.getenv(\"AZURE_GPT_DEPLOYMENT_NAME\"),\n    openai_api_key=os.getenv(\"AZURE_GPT_API_KEY\"),\n    openai_api_type=\"azure\",\n    temperature=0.0,\n)\n\nurl = \"https://news.ycombinator.com/\"\nelements = {\n    \"Title\": \"News title\",\n    \"Points\": \"Number of points\",\n    \"Comments\": \"Number of comments\",\n}\nscrapper = Parsera(model=llm)\nresult = scrapper.run(url=url, elements=elements)\n

"},{"location":"features/custom-models/#run-local-model-with-trasformers","title":"Run local model with Trasformers","text":"

Currently, we only support models that include a system token

You should install Transformers with either pytorch (recommended) or TensorFlow 2.0

Transformers Installation Guide

Example:

from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM\nfrom parsera.engine.model import HuggingFaceModel\nfrom parsera import Parsera\n\n# Define the URL and elements to scrape\nurl = \"https://news.ycombinator.com/\"\nelements = {\n\"Title\": \"News title\",\n\"Points\": \"Number of points\",\n\"Comments\": \"Number of comments\",\n}\n\n# Initialize model with transformers pipeline\ntokenizer = AutoTokenizer.from_pretrained(\"microsoft/Phi-3-mini-128k-instruct\", trust_remote_code=True)\nmodel = AutoModelForCausalLM.from_pretrained(\"microsoft/Phi-3-mini-128k-instruct\", trust_remote_code=True)\npipe = pipeline(\"text-generation\", model=model, tokenizer=tokenizer, max_new_tokens=5000)\n\n# Initialize HuggingFaceModel\nllm = HuggingFaceModel(pipeline=pipe)\n\n# Scrapper with HuggingFace model\nscrapper = Parsera(model=llm)\nresult = scrapper.run(url=url, elements=elements)\n

"},{"location":"features/custom-playwright/","title":"Custom playwright","text":""},{"location":"features/custom-playwright/#parserascript","title":"ParseraScript","text":"

With ParseraScript class you can execute custom playwright scripts during scraping. There are 2 types of code you can run:

"},{"location":"features/custom-playwright/#example-log-in-and-load-data","title":"Example: log in and load data","text":"

You can log in to parsera.org and get credits amount with the following code:

from playwright.async_api import Page\nfrom parsera import ParseraScript\n\n# Define the script to execute during the session creation\nasync def initial_script(page: Page) -> Page:\n    await page.goto(\"https://parsera.org/auth/sign-in\")\n    await page.wait_for_load_state(\"networkidle\")\n    await page.get_by_label(\"Email\").fill(EMAIL)\n    await page.get_by_label(\"Password\").fill(PASSWORD)\n    await page.get_by_role(\"button\", name=\"Sign In\", exact=True).click()\n    await page.wait_for_selector(\"text=Playground\")\n    return page\n\n# This script is executed after the url is opened\nasync def repeating_script(page: Page) -> Page:\n    await page.wait_for_timeout(1000)  # Wait one second for page to load\n    return page\n\nparsera = ParseraScript(model=model, initial_script=initial_script)\nresult = await parsera.arun(\n    url=\"https://parsera.org/app\",\n    elements={\n        \"credits\": \"number of credits\",\n    },\n    playwright_script=repeating_script,\n)\n

"},{"location":"features/custom-playwright/#access-playwright-instance","title":"Access Playwright instance","text":"

The page is fetched via the ParseraScript.loader, which contains the playwright instance.

from parsera import ParseraScript\n\nparsera = ParseraScript(model=model)\n\n## You can manually initialize playwright session and modify it:\nawait parsera.new_session()\nawait parsera.loader.load_content(url=url)\n\n## After page is loaded you can access playwright elements, like Page:\nparsera.loader.page.getByRole('button').click()\n\n## Next you cun run extraction process\nresult = await parsera.arun(\n    url=extraction_url,\n    elements=elements_dict,\n)\n

"},{"location":"features/docker/","title":"Docker","text":""},{"location":"features/docker/#running-in-docker","title":"Running in Docker","text":"

You can get access to the CLI or development environment using Docker.

"},{"location":"features/docker/#prerequisites","title":"Prerequisites","text":""},{"location":"features/docker/#quickstart","title":"Quickstart","text":"
  1. Create a .env file in the project root directory with the following content:
URL=https://parsera.org\nFILE=/app/scheme.json\nOUTPUT=/app/output/result.json\n
  1. Create scheme.json file with the parsing scheme in the repository root directory.

  2. Run make up in this directory.

  3. The output will be saved as output/results.json file.

"},{"location":"features/docker/#docker-make-targets","title":"Docker Make Targets","text":"
make build # Build Docker image\n\nmake up # Start containers using Docker Compose\n\nmake down # Stop and remove containers using Docker Compose\n\nmake restart # Restart containers using Docker Compose\n\nmake logs # View logs of the containers\n\nmake shell # Open a shell in the running container\n\nmake clean # Remove all stopped containers, unused networks, and dangling images\n
"},{"location":"features/extractors/","title":"Extractors","text":""},{"location":"features/extractors/#different-extractor-types","title":"Different extractor types","text":"

There are different types of extractors, that provide output in different formats:

By default a tabular extractor is used.

"},{"location":"features/extractors/#tabular-extractor","title":"Tabular extractor","text":"

from parsera import Parsera\n\nscraper = Parsera(extractor=Parsera.ExtractorType.TABULAR)\n
The tabular extractor is used to find rows of tabular data and has output of the form:
[\n    {\"name\": \"name1\", \"price\": \"100\"},\n    {\"name\": \"name2\", \"price\": \"150\"},\n    {\"name\": \"name3\", \"price\": \"300\"},\n]\n

"},{"location":"features/extractors/#list-extractor","title":"List extractor","text":"

from parsera import Parsera\n\nscraper = Parsera(extractor=Parsera.ExtractorType.LIST)\n
The list extractor is used to find lists of different values and has output of the form:
{\n    \"name\": [\"name1\", \"name2\", \"name3\"],\n    \"price\": [\"100\", \"150\", \"300\"]\n}\n

"},{"location":"features/extractors/#item-extractor","title":"Item extractor","text":"

from parsera import Parsera\n\nscraper = Parsera(extractor=Parsera.ExtractorType.ITEM)\n
The item extractor is used to get singular items from a page like a title or price and has output of the form:
{\n    \"name\": \"name1\",\n    \"price\": \"100\"\n}\n

"},{"location":"features/proxy/","title":"Proxy","text":""},{"location":"features/proxy/#using-proxy","title":"Using proxy","text":"

You can use serve the traffic via proxy server when calling run method:

proxy_settings = {\n    \"server\": \"https://1.2.3.4:5678\",\n    \"username\": <PROXY_USERNAME>,\n    \"password\": <PROXY_PASSWORD>,\n}\nresult = scrapper.run(url=url, elements=elements, proxy_settings=proxy_settings)\n

Where proxy_settings contains your proxy credentials.

"}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Welcome to Parsera","text":"

Parsera is a lightweight Python library for scraping websites with LLMs.

There are 2 ways of using Parsera:

"},{"location":"#community","title":"Community","text":"

If you like this project star it on GitHub and join our discussions on Discord server.

"},{"location":"#contributors","title":"Contributors","text":"

If you are considering contributing to Parsera, check out the guidelines to get started.

"},{"location":"contributing/","title":"Contributing","text":"

Thanks for considering contributing to Parsera! This project is in the early stage of development, so any help will be highly appreciated. You can start from looking through existing issues, or directly asking about the most helpful contributions on Discord.

"},{"location":"contributing/#issues","title":"Issues","text":"

The best way to ask a question, report a bug, or submit feature request is to submit an Issue. It's much better than asking about it in email or Discord since conversation becomes publicly available and easy to navigate.

"},{"location":"contributing/#pull-requests","title":"Pull requests","text":""},{"location":"contributing/#installation-and-setup","title":"Installation and setup","text":"

Fork the repository on GitHub and clone your fork locally.

Next, install dependencies using poetry:

# Clone your fork and cd into the repo directory\ngit clone git@github.com:<your username>/parsera.git\ncd parsera\n\n# If you don't have poetry install it first:\n# https://python-poetry.org/docs/\n# Then:\npoetry install\n# If you are using VS Code you can get python venv path to switch:\npoetry which python\n# To activate virtual environment with installation run:\npoetry shell\n
Now you have a virtual environment with Parsera and all necessary dependencies installed.

"},{"location":"contributing/#code-style","title":"Code style","text":"

The project uses black and isort for formatting. Set up them in your IDE or run this before committing:

make format\n

"},{"location":"contributing/#commit-and-push-changes","title":"Commit and push changes","text":"

Commit your changes and push them to your fork, then create a pull request to the Parsera's repository.

Thanks a lot for helping improve Parsera!

"},{"location":"getting-started/","title":"Welcome to Parsera","text":"

Parsera is a lightweight Python library for scraping websites with LLMs. You can run clone and run it locally or use an API, which provides more scalable way and some extra features like built-in proxy.

"},{"location":"getting-started/#installation","title":"Installation","text":"
pip install parsera\nplaywright install\n
"},{"location":"getting-started/#basic-usage","title":"Basic usage","text":"

If you want to use OpenAI, remember to set up OPENAI_API_KEY env variable. You can do this from python with:

import os\n\nos.environ[\"OPENAI_API_KEY\"] = \"YOUR_OPENAI_API_KEY_HERE\"\n

Next, you can run a basic version that uses gpt-4o-mini

from parsera import Parsera\n\nurl = \"https://news.ycombinator.com/\"\nelements = {\n    \"Title\": \"News title\",\n    \"Points\": \"Number of points\",\n    \"Comments\": \"Number of comments\",\n}\n\nscraper = Parsera()\nresult = scraper.run(url=url, elements=elements)\n

result variable will contain a json with a list of records:

[\n   {\n      \"Title\":\"Hacking the largest airline and hotel rewards platform (2023)\",\n      \"Points\":\"104\",\n      \"Comments\":\"24\"\n   },\n    ...\n]\n

There is also arun async method available:

result = await scrapper.arun(url=url, elements=elements)\n

"},{"location":"getting-started/#running-with-cli","title":"Running with CLI","text":"

Before you run Parsera as command line tool don't forget to put your OPENAI_API_KEY to env variables or .env file

"},{"location":"getting-started/#usage","title":"Usage","text":"

You can configure elements to parse using JSON string or FILE. Optionally, you can provide FILE to write output.

python -m parsera.main URL {--scheme '{\"title\":\"h1\"}' | --file FILENAME} [--output FILENAME]\n
"},{"location":"getting-started/#more-features","title":"More features","text":"

Check out further documentation to explore more features:

"},{"location":"api/getting-started/","title":"Getting started","text":"

First, go to Parsera web page and generate an API key.

"},{"location":"api/getting-started/#extract-endpoint","title":"Extract endpoint","text":"

Paste this key to X-API-KEY header to send the request to extract endpoint:

curl https://api.parsera.org/v1/extract \\\n--header 'Content-Type: application/json' \\\n--header 'X-API-KEY: <YOUR_API_KEY>' \\\n--data '{\n    \"url\": \"https://news.ycombinator.com/\",\n    \"attributes\": [\n        {\n            \"name\": \"Title\",\n            \"description\": \"News title\"\n        },\n        {\n            \"name\": \"Points\",\n            \"description\": \"Number of points\"\n        }\n    ],\n    \"proxy_country\": \"UnitedStates\"\n}'\n

By default, proxy_country is random, it's recommended to set proxy_country parameter to a specific country in the request since a page could not be available from all locations. Here you can find a full list of proxy countries available.

"},{"location":"api/getting-started/#parse-endpoint","title":"Parse endpoint","text":"

In addition to extract, there is a parse endpoint that can be used to parse data generated on your side instead of one from url. There is a content attribute for passing data, which accepts both raw html and string:

curl https://api.parsera.org/v1/parse \\\n--header 'Content-Type: application/json' \\\n--header 'X-API-KEY: <YOUR_API_KEY>' \\\n--data '{\n    \"content\": <HTML_OR_TEXT_HERE>,\n    \"attributes\": [\n        {\n            \"name\": \"Title\",\n            \"description\": \"News title\"\n        },\n        {\n            \"name\": \"Points\",\n            \"description\": \"Number of points\"\n        }\n    ],\n}'\n

"},{"location":"api/getting-started/#swagger-doc","title":"Swagger doc","text":"

You can also explore Swagger doc of the API following this link: https://api.parsera.org/docs#/.

"},{"location":"api/proxy/","title":"Proxy","text":""},{"location":"api/proxy/#setting-proxy-country","title":"Setting proxy country","text":"

You can use the proxy_country parameter to set a proxy country. The default is random, and it's recommended to change it since your page could not be available from all locations.

To scrape the page from the United States you have to set proxy_country to UnitedStates:

curl https://api.parsera.org/v1/extract \\\n--header 'Content-Type: application/json' \\\n--header 'X-API-KEY: <YOUR-API-KEY>' \\\n--data '{\n    \"url\": <TARGET-URL>,\n    \"attributes\": [\n        {\n            \"name\": <First attribute name>,\n            \"description\": <First attribute description>,\n        },\n        {\n            \"name\": <Second attribute name>,\n            \"description\": <Second attribute description>\n        }\n    ],\n    \"proxy_country\": \"UnitedStates\"\n}'\n

"},{"location":"api/proxy/#list-of-proxy-countries","title":"List of proxy countries","text":"

Send a GET request to this URL https://api.parsera.org/v1/proxy-countries, to get the list of countries programmatically.

Here is the list of countries available:

"},{"location":"features/custom-models/","title":"Custom models","text":""},{"location":"features/custom-models/#run-with-custom-model","title":"Run with custom model","text":"

You can instantiate Parsera with any chat model supported by LangChain, for example, to run the model from Azure:

import os\nfrom langchain_openai import AzureChatOpenAI\n\nllm = AzureChatOpenAI(\n    azure_endpoint=os.getenv(\"AZURE_GPT_BASE_URL\"),\n    openai_api_version=\"2023-05-15\",\n    deployment_name=os.getenv(\"AZURE_GPT_DEPLOYMENT_NAME\"),\n    openai_api_key=os.getenv(\"AZURE_GPT_API_KEY\"),\n    openai_api_type=\"azure\",\n    temperature=0.0,\n)\n\nurl = \"https://news.ycombinator.com/\"\nelements = {\n    \"Title\": \"News title\",\n    \"Points\": \"Number of points\",\n    \"Comments\": \"Number of comments\",\n}\nscrapper = Parsera(model=llm)\nresult = scrapper.run(url=url, elements=elements)\n

"},{"location":"features/custom-models/#run-local-model-with-trasformers","title":"Run local model with Trasformers","text":"

Currently, we only support models that include a system token

You should install Transformers with either pytorch (recommended) or TensorFlow 2.0

Transformers Installation Guide

Example:

from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM\nfrom parsera.engine.model import HuggingFaceModel\nfrom parsera import Parsera\n\n# Define the URL and elements to scrape\nurl = \"https://news.ycombinator.com/\"\nelements = {\n\"Title\": \"News title\",\n\"Points\": \"Number of points\",\n\"Comments\": \"Number of comments\",\n}\n\n# Initialize model with transformers pipeline\ntokenizer = AutoTokenizer.from_pretrained(\"microsoft/Phi-3-mini-128k-instruct\", trust_remote_code=True)\nmodel = AutoModelForCausalLM.from_pretrained(\"microsoft/Phi-3-mini-128k-instruct\", trust_remote_code=True)\npipe = pipeline(\"text-generation\", model=model, tokenizer=tokenizer, max_new_tokens=5000)\n\n# Initialize HuggingFaceModel\nllm = HuggingFaceModel(pipeline=pipe)\n\n# Scrapper with HuggingFace model\nscrapper = Parsera(model=llm)\nresult = scrapper.run(url=url, elements=elements)\n

"},{"location":"features/custom-playwright/","title":"Custom playwright","text":""},{"location":"features/custom-playwright/#parserascript","title":"ParseraScript","text":"

With ParseraScript class you can execute custom playwright scripts during scraping. There are 2 types of code you can run:

"},{"location":"features/custom-playwright/#example-log-in-and-load-data","title":"Example: log in and load data","text":"

You can log in to parsera.org and get credits amount with the following code:

from playwright.async_api import Page\nfrom parsera import ParseraScript\n\n# Define the script to execute during the session creation\nasync def initial_script(page: Page) -> Page:\n    await page.goto(\"https://parsera.org/auth/sign-in\")\n    await page.wait_for_load_state(\"networkidle\")\n    await page.get_by_label(\"Email\").fill(EMAIL)\n    await page.get_by_label(\"Password\").fill(PASSWORD)\n    await page.get_by_role(\"button\", name=\"Sign In\", exact=True).click()\n    await page.wait_for_selector(\"text=Playground\")\n    return page\n\n# This script is executed after the url is opened\nasync def repeating_script(page: Page) -> Page:\n    await page.wait_for_timeout(1000)  # Wait one second for page to load\n    return page\n\nparsera = ParseraScript(model=model, initial_script=initial_script)\nresult = await parsera.arun(\n    url=\"https://parsera.org/app\",\n    elements={\n        \"credits\": \"number of credits\",\n    },\n    playwright_script=repeating_script,\n)\n

"},{"location":"features/custom-playwright/#access-playwright-instance","title":"Access Playwright instance","text":"

The page is fetched via the ParseraScript.loader, which contains the playwright instance.

from parsera import ParseraScript\n\nparsera = ParseraScript(model=model)\n\n## You can manually initialize playwright session and modify it:\nawait parsera.new_session()\nawait parsera.loader.load_content(url=url)\n\n## After page is loaded you can access playwright elements, like Page:\nparsera.loader.page.getByRole('button').click()\n\n## Next you cun run extraction process\nresult = await parsera.arun(\n    url=extraction_url,\n    elements=elements_dict,\n)\n

"},{"location":"features/docker/","title":"Docker","text":""},{"location":"features/docker/#running-in-docker","title":"Running in Docker","text":"

You can get access to the CLI or development environment using Docker.

"},{"location":"features/docker/#prerequisites","title":"Prerequisites","text":""},{"location":"features/docker/#quickstart","title":"Quickstart","text":"
  1. Create a .env file in the project root directory with the following content:
URL=https://parsera.org\nFILE=/app/scheme.json\nOUTPUT=/app/output/result.json\n
  1. Create scheme.json file with the parsing scheme in the repository root directory.

  2. Run make up in this directory.

  3. The output will be saved as output/results.json file.

"},{"location":"features/docker/#docker-make-targets","title":"Docker Make Targets","text":"
make build # Build Docker image\n\nmake up # Start containers using Docker Compose\n\nmake down # Stop and remove containers using Docker Compose\n\nmake restart # Restart containers using Docker Compose\n\nmake logs # View logs of the containers\n\nmake shell # Open a shell in the running container\n\nmake clean # Remove all stopped containers, unused networks, and dangling images\n
"},{"location":"features/extractors/","title":"Extractors","text":""},{"location":"features/extractors/#different-extractor-types","title":"Different extractor types","text":"

There are different types of extractors, that provide output in different formats:

By default a tabular extractor is used.

"},{"location":"features/extractors/#tabular-extractor","title":"Tabular extractor","text":"

from parsera import Parsera\n\nscraper = Parsera(extractor=Parsera.ExtractorType.TABULAR)\n
The tabular extractor is used to find rows of tabular data and has output of the form:
[\n    {\"name\": \"name1\", \"price\": \"100\"},\n    {\"name\": \"name2\", \"price\": \"150\"},\n    {\"name\": \"name3\", \"price\": \"300\"},\n]\n

"},{"location":"features/extractors/#list-extractor","title":"List extractor","text":"

from parsera import Parsera\n\nscraper = Parsera(extractor=Parsera.ExtractorType.LIST)\n
The list extractor is used to find lists of different values and has output of the form:
{\n    \"name\": [\"name1\", \"name2\", \"name3\"],\n    \"price\": [\"100\", \"150\", \"300\"]\n}\n

"},{"location":"features/extractors/#item-extractor","title":"Item extractor","text":"

from parsera import Parsera\n\nscraper = Parsera(extractor=Parsera.ExtractorType.ITEM)\n
The item extractor is used to get singular items from a page like a title or price and has output of the form:
{\n    \"name\": \"name1\",\n    \"price\": \"100\"\n}\n

"},{"location":"features/proxy/","title":"Proxy","text":""},{"location":"features/proxy/#using-proxy","title":"Using proxy","text":"

You can use serve the traffic via proxy server when calling run method:

proxy_settings = {\n    \"server\": \"https://1.2.3.4:5678\",\n    \"username\": <PROXY_USERNAME>,\n    \"password\": <PROXY_PASSWORD>,\n}\nresult = scrapper.run(url=url, elements=elements, proxy_settings=proxy_settings)\n

Where proxy_settings contains your proxy credentials.

"}]} \ No newline at end of file diff --git a/sitemap.xml b/sitemap.xml index 7139d8a..36111d9 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -2,42 +2,42 @@ https://docs.parsera.org/ - 2024-09-13 + 2024-09-23 https://docs.parsera.org/contributing/ - 2024-09-13 + 2024-09-23 https://docs.parsera.org/getting-started/ - 2024-09-13 + 2024-09-23 https://docs.parsera.org/api/getting-started/ - 2024-09-13 + 2024-09-23 https://docs.parsera.org/api/proxy/ - 2024-09-13 + 2024-09-23 https://docs.parsera.org/features/custom-models/ - 2024-09-13 + 2024-09-23 https://docs.parsera.org/features/custom-playwright/ - 2024-09-13 + 2024-09-23 https://docs.parsera.org/features/docker/ - 2024-09-13 + 2024-09-23 https://docs.parsera.org/features/extractors/ - 2024-09-13 + 2024-09-23 https://docs.parsera.org/features/proxy/ - 2024-09-13 + 2024-09-23 \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz index 066eb2e..7f56590 100644 Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ