-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop explicitly generating bibliography files (re #3997) #4294
base: master
Are you sure you want to change the base?
Conversation
@mjpost This should be the last radical change I will propose to the build pipeline, for this holiday season at least ;) |
What would be the implications for Zotero importing? I seem to recall that that relies on the .bib files but maybe could be modified to obtain BibTeX in a different way. |
Would an alternative be to have the server generate individual .bib files on demand (when the URL is requested)? This would still break any tool that relies upon listing all the .bib files, but wget would work (it would just be slower for any files that have not been requested before). |
I love the efficiency gains but wonder if there is a silent contingent of wget users. One thought is I wonder if we could generate the paper bib files on the fly via an htaccess rule and a Python script that would extract it from the associated volume-level bib? |
I just tried Zotero Connector on the branch preview and it works fine. Or is there another way to do Zotero importing I’m not aware of? |
It would be the first non-static component we’d introduce on the server, no? You could also try
on the server to see if that could be an alternative. ;) |
One more thought about this: Not even arXiv seems to have this. BibTeX is not even embedded in the HTML, but generated on the fly via JavaScript. |
More or less, but I'm more concerned to have a single cohesive source for everything, which we still would in this repo. The
The server has an incompatible Python (3.8) and upgrading got complicated, but running on my MacBook was routinely between 1.5 and 2 seconds. That seems fine to me, to maintain this functionality while still getting the file savings.
This is a setting where we're superior, though, which is nice; I dislike working with arxiv's citations (which is why David Vilar and I first wrote bibsearch). Another is how much easier it is to manipulate our URLs to switch between the canonical paper page, the PDF, and the BibTeX file. I think this functions as a kind of superuser feature that adds some cachet to the Anthology (evidence). I'd hate to lose it. I'll try to get Python straightened out on the server tomorrow. We could test your quick proof-of-concept demo with some dummy extension. |
As we still generate a full bibtex, it seems easier to me to return the snippet from the full bibtex file. This would not require additional XML data on the server and also guarantee consistency on the bibtex end (and would most likely be much faster). |
With ripgrep:
EDIT: Ah wait, closing braces can also appear within an entry. Needs some refinement. :) |
Here's the script, which can be placed in #!/bin/bash
if [[ -z $QUERY_STRING ]]; then
QUERY_STRING="?anthology_id=$1"
fi
anthid=${QUERY_STRING#anthology_id=}
# Set content type headers for PDF
echo "Content-Type: text/plain"
echo ""
# Get volume name
volume=$(echo $anthid | cut -d. -f1-2)
echo "Looking for $anthid in $volume..." And here's the
The script would have to be updated to handle old-style IDs, too. We might just want to do a Python script even if it just loads the volume file. I've got to move to other things today. |
Note that generating these on the fly with a script would also allow us to handle the "v2" discrepancy. |
Okay, I pushed htaccess rules and a CGI script that should work, even on this preview, provided that |
Build successful. Some useful links:
This preview will be removed when the branch is merged. |
All right, this works now (example). Two comments:
|
Wow, impressive! Let me have a closer look later and try to see if it breaks anywhere. Another question is if we want the same thing for MODS XML and Endnote URLs.
It still saves a lot of build time, as generating the BibTeX/MODS/Endnote is currently by far the slowest part of the pre-Hugo build process. Then again, it’s a lot faster now overall. One option could be to generate all the BibTeX, but skip generating the full anthology.bib and the MODS/Endnote formats. In this PR, this can be achieved by simply skipping the |
Another note (to myself, mainly): Before this is ready for merging, we need to go through all the .bib links in the Hugo templates and adapt their logic; they currently check if a physical .bib file exists for generating the link. |
I think we should. It's easy functionality to preserve, and I wouldn't be surprised if people have used it. I wonder if we should also continue to generate the volume-level MODS XML and Endnote files, too? It's related to an issue in https://github.com/acl-org/acl-style-files, where we sometimes consider dropping the Word template, but there are actual users of it. I still think there's an argument for that, since it's hard to maintain compatibility (especially with the reviewer submission), but here, it's still very little trouble to generate those. |
Right, that should be easy to add. TODOs:
|
I was also trying to get the CGI script to return a 404 if the bib wasn't found, but I wonder if the script excecution is already downstream of the headers being returned. I could only get it to return a file with 404 in the contents. |
It should work by just starting the output with "Status: 404 Not Found". https://serverfault.com/a/121738 |
That's exactly what I was doing. I'll look again later. |
Fixed this—I had a stray print statement at the top of |
I generalized the script to work with MODS XML and Endnote. Once those volumes are generated we can test it (they're correctly 404s now): |
There’s one functional change with this approach, btw: Abstracts are not part of the volume-level BibTeX, but they were part of the paper-level BibTeX before (and still are on the paper pages). If we fetch entries from the volume-level files, they will not contain abstracts. Not sure how we feel about that? |
Good catch. We definitely need to have the abstracts for individual BibTeX files. I can think of two solutions:
|
Another idea here: now that we're doing CGI, we could generate a single DB and turn every bib file request into a parameterized script, e.g., |
I just copied that behaviour from the old script. It also gave me an efficient (if a little hacky way) to generate the anthology.bib both with and without abstracts in a separate script. But I’m sure this can be done differently.
At this point, we can just copy the JSON files generated by |
I implemented this, but they’re still correctly 404 because of the NOBIB=true behaviour not generating them :) Could get rid of that, but let’s maybe first figure out if we even need these files in the first place. |
My guess here is that you would just concatenate the individual volume bib files? This would break the macros and shortcuts that I've implemented to keep the full size anthology bib under 50 MB. |
No, I run the exact same macro and shortcut generation that you wrote on them. The efficiency comes from not having to load Anthology() and generate the BibTeX once again in the create_bib.py script. |
This is a proposal that comes with performance improvements, but also functional changes. I would like to hear your thoughts on this.
This PR addresses #3997 in the following ways:
create_hugo_data.py
also writes BibTeX entries (with abstracts) to the Hugo data files.create_hugo_data.py
also writes .bib files for entire volumes (without abstracts).create_bib.py
(formerlycreate_bibtex.py
) now creates the full Anthology bib files (as before), but also calls bibutils to generate MODS + Endnote formats, and adds them to the Hugo data files.create_hugo_data.py
produced; no calls to the library are necessary here.Functional changes / Disadvantages
wget
can no longer be used to download bibliographic files, since there are no more files.Advantages
Building on master
All runs were done starting with
make clean ; make venv/bin/activate
.time make -j 4 hugo_data bibtex mods endnote
= 297 secstime make hugo
= 211 secsBuilding on this branch
time make hugo_data bib
= 141 secstime make hugo
= 214 secs