Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop explicitly generating bibliography files (re #3997) #4294

Draft
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

mbollmann
Copy link
Member

This is a proposal that comes with performance improvements, but also functional changes. I would like to hear your thoughts on this.

This PR addresses #3997 in the following ways:

  1. Only volume-level BibTeX files are generated; no physical files are created anymore for individual papers, or for MODS and Endnote formats. This is achieved as follows:
  • create_hugo_data.py also writes BibTeX entries (with abstracts) to the Hugo data files.
  • create_hugo_data.py also writes .bib files for entire volumes (without abstracts).
  • create_bib.py (formerly create_bibtex.py) now creates the full Anthology bib files (as before), but also calls bibutils to generate MODS + Endnote formats, and adds them to the Hugo data files.
    • All information for the full Anthology bib files is in what create_hugo_data.py produced; no calls to the library are necessary here.
    • The script uses multiprocessing to speed up the generation of MODS + Endnote formats.
    • It also feeds in entire volumes at a time to reduce the number of subprocess calls to bib2xml and xml2end, then performs some simple parsing to split up the generated output into papers again.
  1. Hugo templates use JavaScript to provide a file download functionality as before. This is achieved as follows:
  • We embed FileSaver.js, a rather lightweight JS library, that triggers a download of the bibliographic info as a file when clicking the button on the paper pages.

Functional changes / Disadvantages

  • On paper pages, everything should look and work exactly as before.
  • Author pages and volume pages no longer show the "bib" button for downloading a BibTeX file for individual papers, since that file no longer exists, and the information is currently only embedded into paper pages.
    • My gut feeling is that these are probably rarely used anyway, but I have no data to back this up.
  • Tools like wget can no longer be used to download bibliographic files, since there are no more files.
    • But getting bibliographic info programmatically can be done with the Python library...

Advantages

  • We avoid generating a large number of small files (currently over 300,000 files are being generated just for bibliographic information).
  • Hugo allocates a lot less memory (9.04 GB here vs. 23.18 GB on master).
  • The build time of Hugo does not seem to be affected much, but the generation of the bibliographic data is a lot faster (2m 21s here vs. 4m 57s on master).

Building on master

All runs were done starting with make clean ; make venv/bin/activate.

  • time make -j 4 hugo_data bibtex mods endnote = 297 secs
  • time make hugo = 211 secs
hugo v0.140.1+extended+withdeploy linux/amd64 BuildDate=2024-12-23T16:26:35Z VendorInfo=brew

INFO  static: removing all files from destination that don't exist in static dirs
INFO  static: syncing static files to / duration 45.261064888s
INFO  build:  step process substep collect files 44 files_total 44 pages_total 202816 resources_total 1 duration 10.699247093s
INFO  build:  step process duration 10.699304982s
INFO  dynacache: adjusted partitions' max size evicted 4030 numGC 119 limit 2.87 GB alloc 3.66 GB totalAlloc 21.01 GB
INFO  build:  step assemble duration 2.735582227s
INFO  dynacache: adjusted partitions' max size evicted 4028 numGC 120 limit 2.87 GB alloc 3.71 GB totalAlloc 22.03 GB
INFO  dynacache: adjusted partitions' max size evicted 3674 numGC 120 limit 2.87 GB alloc 4.86 GB totalAlloc 23.18 GB
INFO  build:  step render substep pages site en outputFormat html duration 2m26.160003988s
INFO  build:  step render substep pages site en outputFormat rss duration 4.141861147s
INFO  build:  step render pages 202832 content 95666 duration 2m30.525456322s
INFO  build:  step render deferred count 0 duration 2.214µs
INFO  build:  step postProcess duration 133.27µs
INFO  build:  duration 2m43.960821449s

                   |   EN
-------------------+---------
  Pages            | 202832
  Paginator pages  |      0
  Non-page files   |      1
  Static files     | 321503
  Processed images |      0
  Aliases          |      0
  Cleaned          |      0

Total in 209247 ms

Building on this branch

  • time make hugo_data bib = 141 secs
  • time make hugo = 214 secs
hugo v0.140.1+extended+withdeploy linux/amd64 BuildDate=2024-12-23T16:26:35Z VendorInfo=brew

INFO  static: removing all files from destination that don't exist in static dirs
INFO  static: syncing static files to / duration 857.012584ms
INFO  dynacache: adjusted partitions' max size evicted 4030 numGC 39 limit 2.87 GB alloc 3.27 GB totalAlloc 6.71 GB
INFO  dynacache: adjusted partitions' max size evicted 4028 numGC 40 limit 2.87 GB alloc 3.19 GB totalAlloc 8.20 GB
INFO  build:  step process substep collect files 44 files_total 44 pages_total 202816 resources_total 1 duration 17.859181235s
INFO  build:  step process duration 17.859243603s
INFO  dynacache: adjusted partitions' max size evicted 3525 numGC 40 limit 2.87 GB alloc 4.03 GB totalAlloc 9.04 GB
INFO  build:  step assemble duration 3.195823565s
INFO  build:  step render substep pages site en outputFormat html duration 3m3.958069813s
INFO  build:  step render substep pages site en outputFormat rss duration 5.883356589s
INFO  build:  step render pages 202832 content 95666 duration 3m10.167427315s
INFO  build:  step render deferred count 0 duration 2.194µs
INFO  build:  step postProcess duration 95.56µs
INFO  build:  duration 3m31.222840854s

                   |   EN
-------------------+---------
  Pages            | 202832
  Paginator pages  |      0
  Non-page files   |      1
  Static files     |   3109
  Processed images |      0
  Aliases          |      0
  Cleaned          |      0

Total in 212109 ms

@mbollmann
Copy link
Member Author

@mjpost This should be the last radical change I will propose to the build pipeline, for this holiday season at least ;)

@mbollmann mbollmann linked an issue Jan 1, 2025 that may be closed by this pull request
@nschneid
Copy link
Contributor

nschneid commented Jan 1, 2025

What would be the implications for Zotero importing? I seem to recall that that relies on the .bib files but maybe could be modified to obtain BibTeX in a different way.

@nschneid
Copy link
Contributor

nschneid commented Jan 1, 2025

Would an alternative be to have the server generate individual .bib files on demand (when the URL is requested)? This would still break any tool that relies upon listing all the .bib files, but wget would work (it would just be slower for any files that have not been requested before).

@mjpost
Copy link
Member

mjpost commented Jan 2, 2025

I love the efficiency gains but wonder if there is a silent contingent of wget users. One thought is I wonder if we could generate the paper bib files on the fly via an htaccess rule and a Python script that would extract it from the associated volume-level bib?

@mbollmann
Copy link
Member Author

What would be the implications for Zotero importing?

I just tried Zotero Connector on the branch preview and it works fine. Or is there another way to do Zotero importing I’m not aware of?

@mbollmann
Copy link
Member Author

I love the efficiency gains but wonder if there is a silent contingent of wget users. One thought is I wonder if we could generate the paper bib files on the fly via an htaccess rule and a Python script that would extract it from the associated volume-level bib?

It would be the first non-static component we’d introduce on the server, no?

You could also try

time python -c "from acl_anthology import Anthology; Anthology('../data').get_paper('2020.lrec-1.879').to_bibtex(with_abstract=True)"

on the server to see if that could be an alternative. ;)

@mbollmann
Copy link
Member Author

I love the efficiency gains but wonder if there is a silent contingent of wget users.

One more thought about this: Not even arXiv seems to have this. BibTeX is not even embedded in the HTML, but generated on the fly via JavaScript.

@mjpost
Copy link
Member

mjpost commented Jan 2, 2025

It would be the first non-static component we’d introduce on the server, no?

More or less, but I'm more concerned to have a single cohesive source for everything, which we still would in this repo. The .htacess file is already checked in and could be setup to run a script in the same repo. And even so, I wouldn't object out of hand adding little dynamic components like this if they were important enough.

You could also try

time python -c "from acl_anthology import Anthology; Anthology('../data').get_paper('2020.lrec-1.879').to_bibtex(with_abstract=True)"

on the server to see if that could be an alternative. ;)

The server has an incompatible Python (3.8) and upgrading got complicated, but running on my MacBook was routinely between 1.5 and 2 seconds. That seems fine to me, to maintain this functionality while still getting the file savings.

One more thought about this: Not even arXiv seems to have this. BibTeX is not even embedded in the HTML, but generated on the fly via JavaScript.

This is a setting where we're superior, though, which is nice; I dislike working with arxiv's citations (which is why David Vilar and I first wrote bibsearch). Another is how much easier it is to manipulate our URLs to switch between the canonical paper page, the PDF, and the BibTeX file. I think this functions as a kind of superuser feature that adds some cachet to the Anthology (evidence). I'd hate to lose it.

I'll try to get Python straightened out on the server tomorrow. We could test your quick proof-of-concept demo with some dummy extension.

@akoehn
Copy link
Member

akoehn commented Jan 2, 2025

time python -c "from acl_anthology import Anthology; Anthology('../data').get_paper('2020.lrec-1.879').to_bibtex(with_abstract=True)"

on the server to see if that could be an alternative. ;)

The server has an incompatible Python (3.8) and upgrading got complicated, but running on my MacBook was routinely between 1.5 and 2 seconds. That seems fine to me, to maintain this functionality while still getting the file savings.

As we still generate a full bibtex, it seems easier to me to return the snippet from the full bibtex file. This would not require additional XML data on the server and also guarantee consistency on the bibtex end (and would most likely be much faster).

@mbollmann
Copy link
Member Author

mbollmann commented Jan 2, 2025

With ripgrep:

rg -U '^@[^}]*2024.acl-long.42/[^}]*}' 2024.acl-long.bib

EDIT: Ah wait, closing braces can also appear within an entry. Needs some refinement. :)

@mjpost
Copy link
Member

mjpost commented Jan 2, 2025

Here's the script, which can be placed in cgi-bin/generate_bib.cgi:

#!/bin/bash

if [[ -z $QUERY_STRING ]]; then
    QUERY_STRING="?anthology_id=$1"
fi
anthid=${QUERY_STRING#anthology_id=}

# Set content type headers for PDF
echo "Content-Type: text/plain"
echo ""

# Get volume name
volume=$(echo $anthid | cut -d. -f1-2)

echo "Looking for $anthid in $volume..."

And here's the .htaccess lines:

RewriteRule ^([A-Za-z]\d{2}\-\d{4})\.bib$ /dont-generate-bib-files/cgi-bin/generate_bib.cgi?anthology_id=$1 [L,NC]
RewriteRule ^(\d{4}\.[a-zA-Z\d]+-[a-zA-Z\d]+\.[a-zA-Z\d]+)\.bib$ /dont-generate-bib-files/cgi-bin/generate_bib.cgi?anthology_id=$1 [L,NC]

The script would have to be updated to handle old-style IDs, too. We might just want to do a Python script even if it just loads the volume file. I've got to move to other things today.

@mjpost
Copy link
Member

mjpost commented Jan 2, 2025

Note that generating these on the fly with a script would also allow us to handle the "v2" discrepancy.

@mjpost
Copy link
Member

mjpost commented Jan 2, 2025

Okay, I pushed htaccess rules and a CGI script that should work, even on this preview, provided that hugo/static/cgi-bin/generate_bib.cgi gets copied to cgi-bin/generate_bib.cgi in the build. I'll know once this builds, and if not, maybe you can advise.

Copy link

github-actions bot commented Jan 3, 2025

Build successful. Some useful links:

This preview will be removed when the branch is merged.

@mjpost
Copy link
Member

mjpost commented Jan 3, 2025

All right, this works now (example). Two comments:

  1. If you like, we can dispense with cgi-bin, and just put the CGI script in the root directory
  2. Now that we are not generating individual bibs, we could do away with the NOBIB=true setting for previews. We adopted this to save time+space generating all those files, which is no longer needed, and it's also often a source of confusion for people viewing the the previews.

@mbollmann
Copy link
Member Author

All right, this works now (example).

Wow, impressive! Let me have a closer look later and try to see if it breaks anywhere.

Another question is if we want the same thing for MODS XML and Endnote URLs.

  1. Now that we are not generating individual bibs, we could do away with the NOBIB=true setting for previews. We adopted this to save time+space generating all those files, which is no longer needed, and it's also often a source of confusion for people viewing the the previews.

It still saves a lot of build time, as generating the BibTeX/MODS/Endnote is currently by far the slowest part of the pre-Hugo build process. Then again, it’s a lot faster now overall.

One option could be to generate all the BibTeX, but skip generating the full anthology.bib and the MODS/Endnote formats. In this PR, this can be achieved by simply skipping the make bib step, and removing the check for NOBIB=true from create_hugo_data.py.

@mbollmann
Copy link
Member Author

Another note (to myself, mainly): Before this is ready for merging, we need to go through all the .bib links in the Hugo templates and adapt their logic; they currently check if a physical .bib file exists for generating the link.

@mjpost
Copy link
Member

mjpost commented Jan 3, 2025

Another question is if we want the same thing for MODS XML and Endnote URLs.

I think we should. It's easy functionality to preserve, and I wouldn't be surprised if people have used it.

I wonder if we should also continue to generate the volume-level MODS XML and Endnote files, too? It's related to an issue in https://github.com/acl-org/acl-style-files, where we sometimes consider dropping the Word template, but there are actual users of it. I still think there's an argument for that, since it's hard to maintain compatibility (especially with the reviewer submission), but here, it's still very little trouble to generate those.

@mbollmann
Copy link
Member Author

mbollmann commented Jan 3, 2025

I wonder if we should also continue to generate the volume-level MODS XML and Endnote files, too?

Right, that should be easy to add.

TODOs:

  • Review and test CGI script
  • Add CGI scripts for MODS & Endnote
  • Generate volume-level MODS & Endnote
  • Adapt link generation on all Hugo templates

@mjpost
Copy link
Member

mjpost commented Jan 3, 2025

I was also trying to get the CGI script to return a 404 if the bib wasn't found, but I wonder if the script excecution is already downstream of the headers being returned. I could only get it to return a file with 404 in the contents.

@mbollmann
Copy link
Member Author

I was also trying to get the CGI script to return a 404 if the bib wasn't found, but I wonder if the script excecution is already downstream of the headers being returned. I could only get it to return a file with 404 in the contents.

It should work by just starting the output with "Status: 404 Not Found". https://serverfault.com/a/121738

@mjpost
Copy link
Member

mjpost commented Jan 3, 2025

That's exactly what I was doing. I'll look again later.

@mjpost
Copy link
Member

mjpost commented Jan 4, 2025

Fixed this—I had a stray print statement at the top of main().

@mjpost
Copy link
Member

mjpost commented Jan 4, 2025

I generalized the script to work with MODS XML and Endnote. Once those volumes are generated we can test it (they're correctly 404s now):

@mbollmann
Copy link
Member Author

There’s one functional change with this approach, btw: Abstracts are not part of the volume-level BibTeX, but they were part of the paper-level BibTeX before (and still are on the paper pages). If we fetch entries from the volume-level files, they will not contain abstracts. Not sure how we feel about that?

@mjpost
Copy link
Member

mjpost commented Jan 5, 2025

Good catch. We definitely need to have the abstracts for individual BibTeX files. I can think of two solutions:

  • We add the abstracts to the volume-level files. Is there any reason not to? I don't think we have file size limitations there, like the Overleaf-imposed max on the full-Anthology BibTeX.
  • We write a separate file with abstracts and modify the script to grab those from there. This could be easier—we could just write it out as a JSON dictionary, so the search would be the same for all formats and would just be a load + dict key lookup.

@mjpost
Copy link
Member

mjpost commented Jan 5, 2025

Another idea here: now that we're doing CGI, we could generate a single DB and turn every bib file request into a parameterized script, e.g., file.bib?abstract=1&...

@mbollmann
Copy link
Member Author

  • We add the abstracts to the volume-level files. Is there any reason not to?

I just copied that behaviour from the old script. It also gave me an efficient (if a little hacky way) to generate the anthology.bib both with and without abstracts in a separate script. But I’m sure this can be done differently.

  • We write a separate file with abstracts and modify the script to grab those from there. This could be easier—we could just write it out as a JSON dictionary, so the search would be the same for all formats and would just be a load + dict key lookup.

At this point, we can just copy the JSON files generated by create_hugo_data.py onto the server and use those directly, including for volume-level files :)

@mbollmann
Copy link
Member Author

I generalized the script to work with MODS XML and Endnote. Once those volumes are generated we can test it (they're correctly 404s now):

I implemented this, but they’re still correctly 404 because of the NOBIB=true behaviour not generating them :)

Could get rid of that, but let’s maybe first figure out if we even need these files in the first place.

@mjpost
Copy link
Member

mjpost commented Jan 5, 2025

I just copied that behaviour from the old script. It also gave me an efficient (if a little hacky way) to generate the anthology.bib both with and without abstracts in a separate script. But I’m sure this can be done differently.

My guess here is that you would just concatenate the individual volume bib files? This would break the macros and shortcuts that I've implemented to keep the full size anthology bib under 50 MB.

@mbollmann
Copy link
Member Author

My guess here is that you would just concatenate the individual volume bib files? This would break the macros and shortcuts that I've implemented to keep the full size anthology bib under 50 MB.

No, I run the exact same macro and shortcut generation that you wrote on them. The efficiency comes from not having to load Anthology() and generate the BibTeX once again in the create_bib.py script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stop explicitly generating bibliography files
4 participants