Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code block extraction fails when using selectors on replicate.com docs #1116

Open
fengyunzaidushi opened this issue Jan 1, 2025 · 2 comments

Comments

@fengyunzaidushi
Copy link

Code block extraction fails when using selectors on replicate.com docs

Issue Description

When using Jina AI Reader API to extract content from replicate.com documentation pages, code blocks are not being properly extracted when selectors are used.

Current Behavior

  1. Without Selectors:

    • ✅ Code blocks are extracted correctly
    • ❌ But includes unwanted navigation menus, headers, footers, etc.
  2. With Selectors:

    • ❌ Code blocks are completely lost
    • ✅ Successfully removes unwanted content (menus, headers, etc.)

Expected Behavior

The API should:

  • Extract code blocks correctly
  • Remove unwanted page elements
  • Maintain document structure

Reproduction Steps

  1. Try to extract content from any replicate.com documentation page with code blocks
  2. Use the following selector combinations:
# Attempt 1
headers = {
    'X-Target-Selector': 'div#mdx-content,figure[data-rehype-pretty-code-figure],pre[data-language]',
    'X-Remove-Selector': '.r8-btn,nav,header,footer,aside'
}

# Attempt 2
headers = {
    'X-Target-Selector': 'div#mdx-content,figure',
    'X-Remove-Selector': 'div#toc,header,footer,aside',
    'X-Wait-For-Selector': 'figure'
}

# Attempt 3
headers = {
    'X-Remove-Selector': 'div#toc,header,footer,aside,nav,.table-of-contents',
    'X-Wait-For-Selector': 'figure[data-rehype-pretty-code-figure]',
    'X-With-Shadow-Dom': 'true'
}

Technical Details

The code blocks on replicate.com use the following DOM structure:

<figure data-rehype-pretty-code-figure="">
  <div class="relative group -space-y-1">
    <div class="bg-r8-gray-2 border border-r8-gray-3 rounded mx-px">
      <pre><code>...</code></pre>
    </div>
  </div>
</figure>

Current Workaround

Currently, we have to:

  1. Not use any selectors
  2. Handle unwanted content removal through post-processing
  3. Use regex to clean up the content

This workaround has several limitations:

  • Increased data transfer
  • Additional processing overhead
  • More complex maintenance
  • Fragile regex patterns

Environment

Additional Context

  1. The issue seems specific to replicate.com's documentation pages
  2. Code blocks appear to be rendered using a custom syntax highlighting component
  3. The DOM structure might be dynamically generated
@nomagick
Copy link
Member

nomagick commented Jan 7, 2025

This has something to do with "redability" being used in Reader by default.

Specify x-return-format: markdown.
This will prevent readability to "smartly" remove anything.

@fengyunzaidushi
Copy link
Author

thank you !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants