Code block extraction fails when using selectors on replicate.com docs #1116

fengyunzaidushi · 2025-01-01T03:47:33Z

Code block extraction fails when using selectors on replicate.com docs

Issue Description

When using Jina AI Reader API to extract content from replicate.com documentation pages, code blocks are not being properly extracted when selectors are used.

Current Behavior

Without Selectors:
- ✅ Code blocks are extracted correctly
- ❌ But includes unwanted navigation menus, headers, footers, etc.
With Selectors:
- ❌ Code blocks are completely lost
- ✅ Successfully removes unwanted content (menus, headers, etc.)

Expected Behavior

The API should:

Extract code blocks correctly
Remove unwanted page elements
Maintain document structure

Reproduction Steps

Try to extract content from any replicate.com documentation page with code blocks
Use the following selector combinations:

# Attempt 1
headers = {
    'X-Target-Selector': 'div#mdx-content,figure[data-rehype-pretty-code-figure],pre[data-language]',
    'X-Remove-Selector': '.r8-btn,nav,header,footer,aside'
}

# Attempt 2
headers = {
    'X-Target-Selector': 'div#mdx-content,figure',
    'X-Remove-Selector': 'div#toc,header,footer,aside',
    'X-Wait-For-Selector': 'figure'
}

# Attempt 3
headers = {
    'X-Remove-Selector': 'div#toc,header,footer,aside,nav,.table-of-contents',
    'X-Wait-For-Selector': 'figure[data-rehype-pretty-code-figure]',
    'X-With-Shadow-Dom': 'true'
}

Technical Details

The code blocks on replicate.com use the following DOM structure:

<figure data-rehype-pretty-code-figure="">
  <div class="relative group -space-y-1">
    <div class="bg-r8-gray-2 border border-r8-gray-3 rounded mx-px">
      <pre><code>...</code></pre>
    </div>
  </div>
</figure>

Current Workaround

Currently, we have to:

Not use any selectors
Handle unwanted content removal through post-processing
Use regex to clean up the content

This workaround has several limitations:

Increased data transfer
Additional processing overhead
More complex maintenance
Fragile regex patterns

Environment

API Version: Latest
Example URL: https://replicate.com/docs/get-started/nodejs

Request Headers:

headers = {
    'Accept': 'application/json',
    'Authorization': 'Bearer ***',
    'X-No-Cache': 'true'
}

Additional Context

The issue seems specific to replicate.com's documentation pages
Code blocks appear to be rendered using a custom syntax highlighting component
The DOM structure might be dynamically generated

nomagick · 2025-01-07T10:23:02Z

This has something to do with "redability" being used in Reader by default.

Specify x-return-format: markdown.
This will prevent readability to "smartly" remove anything.

fengyunzaidushi · 2025-01-17T14:05:56Z

thank you !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code block extraction fails when using selectors on replicate.com docs #1116

Code block extraction fails when using selectors on replicate.com docs #1116

fengyunzaidushi commented Jan 1, 2025

nomagick commented Jan 7, 2025

fengyunzaidushi commented Jan 17, 2025

Code block extraction fails when using selectors on replicate.com docs #1116

Code block extraction fails when using selectors on replicate.com docs #1116

Comments

fengyunzaidushi commented Jan 1, 2025

Code block extraction fails when using selectors on replicate.com docs

Issue Description

Current Behavior

Expected Behavior

Reproduction Steps

Technical Details

Current Workaround

Environment

Additional Context

nomagick commented Jan 7, 2025

fengyunzaidushi commented Jan 17, 2025