Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude one or more directories and/or files #772

Open
stevespringett opened this issue Jan 10, 2025 · 1 comment
Open

Exclude one or more directories and/or files #772

stevespringett opened this issue Jan 10, 2025 · 1 comment

Comments

@stevespringett
Copy link

Thanks so much for pagefind. I'm implementing it for the first time. The site I'm implementing it on has the typical marketing type pages, but it also has documentation pages for multiple versions of a product. I want pagefind to exclude all but the most recent version of the product from being indexed.

Ideally, pagefind could read robots.txt and utilize the allow and disallow patterns in there. But that's essentially what I'm looking for. If I search for a term, I want the results from the most recent version of the product to be returned, not legacy or unsupported versions.

The only way I've found to do this is some trickery with the build process which is less than ideal.

@bglw
Copy link
Contributor

bglw commented Jan 19, 2025

👋 @stevespringett I'll rattle off some ideas

Pagefind Globs

Pagefind's globs configure what files are ingested. Globs can contain multiple patterns, e.g:

# in pagefind.yml
glob: "{pages/**/*.html,about/**/*.html,/docs/latest/**/*.html}"

data-pagefind-body

If any pages on your site have this attribute, Pagefind will only index pages which have it configured. In this case, assuming the site is built via some static site generator, my go-to would be to have the data-pagefind-body attribute added to the template for all of the marketing pages, and for the latest version of the documentation pages. I'd give the older versions of the documentation a different layout/template that didn't include this attribute, which would make Pagefind omit their content.

This is the approach git-scm.com takes — if you compare source between https://git-scm.com/docs/git-diff and https://git-scm.com/docs/git-diff/2.47.0 you'll see the former has data-pagefind-body on div id="main" while the latter does not.

Indexing APIs

For anything super custom, either the Python API or the NodeJS API can be a good avenue. This lets you do more custom logic via those programming languages when building out your index.


Ideally, pagefind could read robots.txt and utilize the allow and disallow patterns in there

I do like this idea also! It would have to be opt-in as I have (and have seen) many use-cases where content is disallowed from external search engines, but in-scope for Pagefind. It does seem like a great setting to have available though.
It's not something I have time to jump on at the moment, so hopefully one of the existing approaches will be able to solve your use-case for now. Let me know how you get on :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants