Exclude one or more directories and/or files #772

stevespringett · 2025-01-10T01:25:14Z

Thanks so much for pagefind. I'm implementing it for the first time. The site I'm implementing it on has the typical marketing type pages, but it also has documentation pages for multiple versions of a product. I want pagefind to exclude all but the most recent version of the product from being indexed.

Ideally, pagefind could read robots.txt and utilize the allow and disallow patterns in there. But that's essentially what I'm looking for. If I search for a term, I want the results from the most recent version of the product to be returned, not legacy or unsupported versions.

The only way I've found to do this is some trickery with the build process which is less than ideal.

bglw · 2025-01-19T19:37:48Z

👋 @stevespringett I'll rattle off some ideas

Pagefind Globs

Pagefind's globs configure what files are ingested. Globs can contain multiple patterns, e.g:

# in pagefind.yml
glob: "{pages/**/*.html,about/**/*.html,/docs/latest/**/*.html}"

`data-pagefind-body`

If any pages on your site have this attribute, Pagefind will only index pages which have it configured. In this case, assuming the site is built via some static site generator, my go-to would be to have the data-pagefind-body attribute added to the template for all of the marketing pages, and for the latest version of the documentation pages. I'd give the older versions of the documentation a different layout/template that didn't include this attribute, which would make Pagefind omit their content.

This is the approach git-scm.com takes — if you compare source between https://git-scm.com/docs/git-diff and https://git-scm.com/docs/git-diff/2.47.0 you'll see the former has data-pagefind-body on div id="main" while the latter does not.

Indexing APIs

For anything super custom, either the Python API or the NodeJS API can be a good avenue. This lets you do more custom logic via those programming languages when building out your index.

Ideally, pagefind could read robots.txt and utilize the allow and disallow patterns in there

I do like this idea also! It would have to be opt-in as I have (and have seen) many use-cases where content is disallowed from external search engines, but in-scope for Pagefind. It does seem like a great setting to have available though.
It's not something I have time to jump on at the moment, so hopefully one of the existing approaches will be able to solve your use-case for now. Let me know how you get on :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exclude one or more directories and/or files #772

Exclude one or more directories and/or files #772

stevespringett commented Jan 10, 2025

bglw commented Jan 19, 2025

Exclude one or more directories and/or files #772

Exclude one or more directories and/or files #772

Comments

stevespringett commented Jan 10, 2025

bglw commented Jan 19, 2025

Pagefind Globs

data-pagefind-body

Indexing APIs

`data-pagefind-body`