Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Searching the documentation with filters #131

Open
wants to merge 37 commits into
base: master
Choose a base branch
from

Conversation

polibb
Copy link
Collaborator

@polibb polibb commented May 14, 2021

Searching the mathlib library contents in the documentation and filtering of the resulting set of declarations.
One is now able to:

  • search through the declaration names, the module names and the description texts that belong to any declarations of theorems, axioms, definitions, etc.
  • filter the search results before or after submitting the search query by choosing values for attributes and kind of the resulting declarations
  • selected attributes are combined with an OR. If an attribute in the list of attributes for any given declaration is found in the list submitted in the filters, then the declaration is legitimate.
  • selected kind options are combined with an OR.
  • if both any attributes are chosen and any kind options are selected, then the two sets of OR-connected values are combined with an AND. Hence, if a declaration fits the selection of one, but does not fit the possibilities allowed by the other filter, the declaration is considered illegitimate.

More information:

  1. Results are shown in prioritized sequence from most matches per declaration contents to least.
  2. Highest priority is given to the declaration name. Following is its descriptive text added as a comment in the mathlib library itself. Lastly, the module name the declaration is found in (file name) carries the least priority.
  3. Submitting the filters form at any point also re-submits the search query, if there is any.
  4. Maximum of 10 results is shown, but an option to expand and review all results is present in the UI.
  5. While the user is on any given page and searches for a term there for the first time the results will take awhile to load, but any subsequent search takes no extra time whatsoever. Thus, we recommend loading a search result once on any page and designating it for further searches, if needed, because it will save you time.
  6. Navigation with the keyboard is not possible at this point, please expect it in the near future.

polibb and others added 30 commits November 10, 2020 09:41
This way pylint will notice if you accidentally use the wrong key.
I'm not sure if this is officially "pythonic".

It seems like more could be done translating the json to actual data structures with named fields; then these lookup tables wouldn't be necessary!
But this might be a lot of work for little gain.
use class instead of dict for structure field names
@polibb
Copy link
Collaborator Author

polibb commented May 15, 2021

UPDATE: Fixes pushed, please pull to see them.

  • handled console errors for missing elements and trying to access methods on them
  • loading added when going from basic (max 10) results set to "Show all" when rendering the results on the DOM itself takes too long and might crash the site is abused

The script that has been used for almost a year and is running in the background for the search to function in this PR too is being ran by a SharedWorker which is not compatible with most browsers on a phone and Safari and IE on a desktop. This is not ideal, I understand, but will be improved in the future as it is not part of the scope at this time for this PR.

@bryangingechen
Copy link
Collaborator

#deploy

@github-actions
Copy link

This PR has been successfully deployed at http://leanprover-community.github.io/mathlib_docs_demo!

Copy link
Contributor

@RaitoBezarius RaitoBezarius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's super cool :)

@@ -0,0 +1,53 @@
class mathlibStructures:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, that would make sense to make those inherit from enum.Enum also?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did try using Enum at first, yes, but there was some issue - not sure what it was since I haven't focused on that in a while. Actually none of the classes here are used at the moment, this is going to be included in another PR which focuses on organizational structure of the python script itself, so that it's easier to read and update. I should probably remove this file from this PR and will look into using Enums again for the next one where those will be needed.

}
}

return hasKindFilter && hasAttrFilter ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to split these 3 ternary expressions into 3-ifs or early return?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ternary operator by itself is faster than if/else if no additional computation is involved during its execution, but what's more important I think is that this is more readable and easier to change or extend to more types of filters in the long run. I can return the result before getting here, of course - where the check for hasKindFilter is, for example, just return if there is no hasAttrFilter. Also return directly when hasAttrFilter gives a value to isAttrFilterIncluded, if there is no hasKindFilter. But this kind of structuring mixes two different actions: checking if the result matches any of the filters, and deciding whether this is sufficient. To be more exact, if we decide to change the logic from "you need to cover both the attribute filters and the kind filters, if both are present" to "you can cover either filter selected, as long as at least one of any of them matches your attributes or kind" we will have to change the return statement here only. Otherwise we will have to follow through the whole logic in detail and decide whether to move each return statement and how are they connected at all for the purpose of the method itself. I'd leave it like it is, to be honest

@gebner
Copy link
Member

gebner commented May 17, 2021

#deploy

@github-actions
Copy link

This PR has been successfully deployed at http://leanprover-community.github.io/mathlib_docs_demo!

@bryangingechen
Copy link
Collaborator

#deploy

@github-actions
Copy link

github-actions bot commented Jun 8, 2021

This PR has been successfully deployed at http://leanprover-community.github.io/mathlib_docs_demo!

searchWorker.js Outdated
// Adapted from the default tokenizer and
// https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7BZ%7D&abb=on&c=on&esc=on&g=&i=
const SEPARATOR = /[._\n\r \u00A0\u1680\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]+/u
req.open('GET', 'searchable_data.json', false /* blocking */);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is 7-8x larger than the decl.bmp we currently load- I worry this will substantially increase load time for every search.

@bryangingechen
Copy link
Collaborator

One other thing that would be great to have before this goes live is an easily accessible / easy-to-find page describing all the search features. Maybe an extra section on the mostly-blank index page https://leanprover-community.github.io/mathlib_docs_demo/ will do?

@robertylewis
Copy link
Member

I think the only blocking thing here is the size of the json file as Eric points out. Was the reason for using a bmp before compression or caching? If we zip the json file when we generate it and unzip locally, I guess browsers won't cache the unzipped version, right?

For @bryangingechen 's suggestion, I think that's a good place to describe the search features and we could probably use some text from this PR.

@gebner
Copy link
Member

gebner commented Jun 29, 2021

The reason for the .bmp extension instead of .json is that github will then automatically gzip it. See #125

@gebner
Copy link
Member

gebner commented Jun 29, 2021

#deploy

@github-actions
Copy link

This PR has been successfully deployed at http://leanprover-community.github.io/mathlib_docs_demo!

@eric-wieser
Copy link
Member

The bmp is now 1660854 bytes, and loads in half a second on my machine - I think it was 15mB before compression before, so the file extension change definitely helps.

However, the call to miniSearch.addAll(indexedData) takes 28 seconds for me, vs 6 seconds it takes on the live site.

@robertylewis
Copy link
Member

I've created polibb#4 to briefly document the search.

I agree with Eric that there's still a slowdown, but it's not near 28 seconds for me. To me the new performance is acceptable, and I'm okay merging, what do others think?

@robertylewis
Copy link
Member

#deploy

@github-actions
Copy link

This PR has been successfully deployed at http://leanprover-community.github.io/mathlib_docs_demo!

@robertylewis
Copy link
Member

Hmm, the 404 page needs to be updated as well:
https://leanprover-community.github.io/mathlib_docs_demo/find/nat.addo

@robertylewis
Copy link
Member

#deploy

@github-actions
Copy link

This PR has been successfully deployed at http://leanprover-community.github.io/mathlib_docs_demo!

@gebner
Copy link
Member

gebner commented Jan 28, 2022

#deploy

@github-actions
Copy link

This PR has been successfully deployed at http://leanprover-community.github.io/mathlib_docs_demo!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants