Stop explicitly generating bibliography files (re #3997) #4294

mbollmann · 2025-01-01T23:10:18Z

This is a proposal that comes with performance improvements, but also functional changes. I would like to hear your thoughts on this.

This PR addresses #3997 in the following ways:

Only volume-level BibTeX files are generated; no physical files are created anymore for individual papers, or for MODS and Endnote formats. This is achieved as follows:

create_hugo_data.py also writes BibTeX entries (with abstracts) to the Hugo data files.
create_hugo_data.py also writes .bib files for entire volumes (without abstracts).
create_bib.py (formerly create_bibtex.py) now creates the full Anthology bib files (as before), but also calls bibutils to generate MODS + Endnote formats, and adds them to the Hugo data files.
- All information for the full Anthology bib files is in what create_hugo_data.py produced; no calls to the library are necessary here.
- The script uses multiprocessing to speed up the generation of MODS + Endnote formats.
- It also feeds in entire volumes at a time to reduce the number of subprocess calls to bib2xml and xml2end, then performs some simple parsing to split up the generated output into papers again.

Hugo templates use JavaScript to provide a file download functionality as before. This is achieved as follows:

We embed FileSaver.js, a rather lightweight JS library, that triggers a download of the bibliographic info as a file when clicking the button on the paper pages.

Functional changes / Disadvantages

On paper pages, everything should look and work exactly as before.
Author pages and volume pages no longer show the "bib" button for downloading a BibTeX file for individual papers, since that file no longer exists, and the information is currently only embedded into paper pages.
- My gut feeling is that these are probably rarely used anyway, but I have no data to back this up.
Tools like wget can no longer be used to download bibliographic files, since there are no more files.
- But getting bibliographic info programmatically can be done with the Python library...

Advantages

We avoid generating a large number of small files (currently over 300,000 files are being generated just for bibliographic information).
Hugo allocates a lot less memory (9.04 GB here vs. 23.18 GB on master).
The build time of Hugo does not seem to be affected much, but the generation of the bibliographic data is a lot faster (2m 21s here vs. 4m 57s on master).

Building on master

All runs were done starting with make clean ; make venv/bin/activate.

time make -j 4 hugo_data bibtex mods endnote = 297 secs
time make hugo = 211 secs

hugo v0.140.1+extended+withdeploy linux/amd64 BuildDate=2024-12-23T16:26:35Z VendorInfo=brew

INFO  static: removing all files from destination that don't exist in static dirs
INFO  static: syncing static files to / duration 45.261064888s
INFO  build:  step process substep collect files 44 files_total 44 pages_total 202816 resources_total 1 duration 10.699247093s
INFO  build:  step process duration 10.699304982s
INFO  dynacache: adjusted partitions' max size evicted 4030 numGC 119 limit 2.87 GB alloc 3.66 GB totalAlloc 21.01 GB
INFO  build:  step assemble duration 2.735582227s
INFO  dynacache: adjusted partitions' max size evicted 4028 numGC 120 limit 2.87 GB alloc 3.71 GB totalAlloc 22.03 GB
INFO  dynacache: adjusted partitions' max size evicted 3674 numGC 120 limit 2.87 GB alloc 4.86 GB totalAlloc 23.18 GB
INFO  build:  step render substep pages site en outputFormat html duration 2m26.160003988s
INFO  build:  step render substep pages site en outputFormat rss duration 4.141861147s
INFO  build:  step render pages 202832 content 95666 duration 2m30.525456322s
INFO  build:  step render deferred count 0 duration 2.214µs
INFO  build:  step postProcess duration 133.27µs
INFO  build:  duration 2m43.960821449s

                   |   EN
-------------------+---------
  Pages            | 202832
  Paginator pages  |      0
  Non-page files   |      1
  Static files     | 321503
  Processed images |      0
  Aliases          |      0
  Cleaned          |      0

Total in 209247 ms

Building on this branch

time make hugo_data bib = 141 secs
time make hugo = 214 secs

hugo v0.140.1+extended+withdeploy linux/amd64 BuildDate=2024-12-23T16:26:35Z VendorInfo=brew

INFO  static: removing all files from destination that don't exist in static dirs
INFO  static: syncing static files to / duration 857.012584ms
INFO  dynacache: adjusted partitions' max size evicted 4030 numGC 39 limit 2.87 GB alloc 3.27 GB totalAlloc 6.71 GB
INFO  dynacache: adjusted partitions' max size evicted 4028 numGC 40 limit 2.87 GB alloc 3.19 GB totalAlloc 8.20 GB
INFO  build:  step process substep collect files 44 files_total 44 pages_total 202816 resources_total 1 duration 17.859181235s
INFO  build:  step process duration 17.859243603s
INFO  dynacache: adjusted partitions' max size evicted 3525 numGC 40 limit 2.87 GB alloc 4.03 GB totalAlloc 9.04 GB
INFO  build:  step assemble duration 3.195823565s
INFO  build:  step render substep pages site en outputFormat html duration 3m3.958069813s
INFO  build:  step render substep pages site en outputFormat rss duration 5.883356589s
INFO  build:  step render pages 202832 content 95666 duration 3m10.167427315s
INFO  build:  step render deferred count 0 duration 2.194µs
INFO  build:  step postProcess duration 95.56µs
INFO  build:  duration 3m31.222840854s

                   |   EN
-------------------+---------
  Pages            | 202832
  Paginator pages  |      0
  Non-page files   |      1
  Static files     |   3109
  Processed images |      0
  Aliases          |      0
  Cleaned          |      0

Total in 212109 ms

…bib.py

mbollmann · 2025-01-01T23:11:13Z

@mjpost This should be the last radical change I will propose to the build pipeline, for this holiday season at least ;)

nschneid · 2025-01-01T23:50:46Z

What would be the implications for Zotero importing? I seem to recall that that relies on the .bib files but maybe could be modified to obtain BibTeX in a different way.

nschneid · 2025-01-01T23:56:30Z

Would an alternative be to have the server generate individual .bib files on demand (when the URL is requested)? This would still break any tool that relies upon listing all the .bib files, but wget would work (it would just be slower for any files that have not been requested before).

mjpost · 2025-01-02T00:17:41Z

I love the efficiency gains but wonder if there is a silent contingent of wget users. One thought is I wonder if we could generate the paper bib files on the fly via an htaccess rule and a Python script that would extract it from the associated volume-level bib?

mbollmann · 2025-01-02T00:26:59Z

What would be the implications for Zotero importing?

I just tried Zotero Connector on the branch preview and it works fine. Or is there another way to do Zotero importing I’m not aware of?

mbollmann · 2025-01-02T00:32:37Z

I love the efficiency gains but wonder if there is a silent contingent of wget users. One thought is I wonder if we could generate the paper bib files on the fly via an htaccess rule and a Python script that would extract it from the associated volume-level bib?

It would be the first non-static component we’d introduce on the server, no?

You could also try

time python -c "from acl_anthology import Anthology; Anthology('../data').get_paper('2020.lrec-1.879').to_bibtex(with_abstract=True)"

on the server to see if that could be an alternative. ;)

mbollmann · 2025-01-02T00:56:54Z

I love the efficiency gains but wonder if there is a silent contingent of wget users.

One more thought about this: Not even arXiv seems to have this. BibTeX is not even embedded in the HTML, but generated on the fly via JavaScript.

mjpost · 2025-01-02T03:26:45Z

It would be the first non-static component we’d introduce on the server, no?

More or less, but I'm more concerned to have a single cohesive source for everything, which we still would in this repo. The .htacess file is already checked in and could be setup to run a script in the same repo. And even so, I wouldn't object out of hand adding little dynamic components like this if they were important enough.

You could also try

time python -c "from acl_anthology import Anthology; Anthology('../data').get_paper('2020.lrec-1.879').to_bibtex(with_abstract=True)"

on the server to see if that could be an alternative. ;)

The server has an incompatible Python (3.8) and upgrading got complicated, but running on my MacBook was routinely between 1.5 and 2 seconds. That seems fine to me, to maintain this functionality while still getting the file savings.

One more thought about this: Not even arXiv seems to have this. BibTeX is not even embedded in the HTML, but generated on the fly via JavaScript.

This is a setting where we're superior, though, which is nice; I dislike working with arxiv's citations (which is why David Vilar and I first wrote bibsearch). Another is how much easier it is to manipulate our URLs to switch between the canonical paper page, the PDF, and the BibTeX file. I think this functions as a kind of superuser feature that adds some cachet to the Anthology (evidence). I'd hate to lose it.

I'll try to get Python straightened out on the server tomorrow. We could test your quick proof-of-concept demo with some dummy extension.

akoehn · 2025-01-02T07:31:21Z

time python -c "from acl_anthology import Anthology; Anthology('../data').get_paper('2020.lrec-1.879').to_bibtex(with_abstract=True)"

on the server to see if that could be an alternative. ;)

The server has an incompatible Python (3.8) and upgrading got complicated, but running on my MacBook was routinely between 1.5 and 2 seconds. That seems fine to me, to maintain this functionality while still getting the file savings.

As we still generate a full bibtex, it seems easier to me to return the snippet from the full bibtex file. This would not require additional XML data on the server and also guarantee consistency on the bibtex end (and would most likely be much faster).

mbollmann · 2025-01-02T11:34:16Z

With ripgrep:

rg -U '^@[^}]*2024.acl-long.42/[^}]*}' 2024.acl-long.bib

EDIT: Ah wait, closing braces can also appear within an entry. Needs some refinement. :)

mjpost · 2025-01-02T13:14:57Z

Here's the script, which can be placed in cgi-bin/generate_bib.cgi:

#!/bin/bash

if [[ -z $QUERY_STRING ]]; then
    QUERY_STRING="?anthology_id=$1"
fi
anthid=${QUERY_STRING#anthology_id=}

# Set content type headers for PDF
echo "Content-Type: text/plain"
echo ""

# Get volume name
volume=$(echo $anthid | cut -d. -f1-2)

echo "Looking for $anthid in $volume..."

And here's the .htaccess lines:

RewriteRule ^([A-Za-z]\d{2}\-\d{4})\.bib$ /dont-generate-bib-files/cgi-bin/generate_bib.cgi?anthology_id=$1 [L,NC]
RewriteRule ^(\d{4}\.[a-zA-Z\d]+-[a-zA-Z\d]+\.[a-zA-Z\d]+)\.bib$ /dont-generate-bib-files/cgi-bin/generate_bib.cgi?anthology_id=$1 [L,NC]

The script would have to be updated to handle old-style IDs, too. We might just want to do a Python script even if it just loads the volume file. I've got to move to other things today.

mjpost · 2025-01-02T13:29:24Z

Note that generating these on the fly with a script would also allow us to handle the "v2" discrepancy.

mjpost · 2025-01-02T23:59:01Z

Okay, I pushed htaccess rules and a CGI script that should work, even on this preview, provided that hugo/static/cgi-bin/generate_bib.cgi gets copied to cgi-bin/generate_bib.cgi in the build. I'll know once this builds, and if not, maybe you can advise.

github-actions · 2025-01-03T00:13:21Z

Build successful. Some useful links:

Complete site preview: https://preview.aclanthology.org/dont-generate-bib-files
Potential volumes of interest: 2006.amta-papers, 2006.amta-talks, 2006.amta-users, 2006.amta-panel1, 2006.amta-panel2, 2006.amta-tutorials

This preview will be removed when the branch is merged.

mjpost · 2025-01-03T02:17:03Z

All right, this works now (example). Two comments:

If you like, we can dispense with cgi-bin, and just put the CGI script in the root directory
Now that we are not generating individual bibs, we could do away with the NOBIB=true setting for previews. We adopted this to save time+space generating all those files, which is no longer needed, and it's also often a source of confusion for people viewing the the previews.

mbollmann · 2025-01-03T12:38:39Z

All right, this works now (example).

Wow, impressive! Let me have a closer look later and try to see if it breaks anywhere.

Another question is if we want the same thing for MODS XML and Endnote URLs.

Now that we are not generating individual bibs, we could do away with the NOBIB=true setting for previews. We adopted this to save time+space generating all those files, which is no longer needed, and it's also often a source of confusion for people viewing the the previews.

It still saves a lot of build time, as generating the BibTeX/MODS/Endnote is currently by far the slowest part of the pre-Hugo build process. Then again, it’s a lot faster now overall.

One option could be to generate all the BibTeX, but skip generating the full anthology.bib and the MODS/Endnote formats. In this PR, this can be achieved by simply skipping the make bib step, and removing the check for NOBIB=true from create_hugo_data.py.

mbollmann · 2025-01-03T12:41:18Z

Another note (to myself, mainly): Before this is ready for merging, we need to go through all the .bib links in the Hugo templates and adapt their logic; they currently check if a physical .bib file exists for generating the link.

Fixes https://x.com/zngu/status/1449007350346625024

mjpost · 2025-01-03T12:59:59Z

Another question is if we want the same thing for MODS XML and Endnote URLs.

I think we should. It's easy functionality to preserve, and I wouldn't be surprised if people have used it.

I wonder if we should also continue to generate the volume-level MODS XML and Endnote files, too? It's related to an issue in https://github.com/acl-org/acl-style-files, where we sometimes consider dropping the Word template, but there are actual users of it. I still think there's an argument for that, since it's hard to maintain compatibility (especially with the reviewer submission), but here, it's still very little trouble to generate those.

mbollmann · 2025-01-03T13:17:13Z

I wonder if we should also continue to generate the volume-level MODS XML and Endnote files, too?

Right, that should be easy to add.

TODOs:

Review and test CGI script
Add CGI scripts for MODS & Endnote
Generate volume-level MODS & Endnote
Adapt link generation on all Hugo templates

mjpost · 2025-01-03T13:28:10Z

I was also trying to get the CGI script to return a 404 if the bib wasn't found, but I wonder if the script excecution is already downstream of the headers being returned. I could only get it to return a file with 404 in the contents.

mbollmann · 2025-01-03T19:36:50Z

I was also trying to get the CGI script to return a 404 if the bib wasn't found, but I wonder if the script excecution is already downstream of the headers being returned. I could only get it to return a file with 404 in the contents.

It should work by just starting the output with "Status: 404 Not Found". https://serverfault.com/a/121738

mjpost · 2025-01-03T19:52:53Z

That's exactly what I was doing. I'll look again later.

mjpost · 2025-01-04T00:14:13Z

Fixed this—I had a stray print statement at the top of main().

mjpost · 2025-01-04T15:11:02Z

I generalized the script to work with MODS XML and Endnote. Once those volumes are generated we can test it (they're correctly 404s now):

mbollmann · 2025-01-05T12:57:26Z

There’s one functional change with this approach, btw: Abstracts are not part of the volume-level BibTeX, but they were part of the paper-level BibTeX before (and still are on the paper pages). If we fetch entries from the volume-level files, they will not contain abstracts. Not sure how we feel about that?

mjpost · 2025-01-05T14:39:51Z

Good catch. We definitely need to have the abstracts for individual BibTeX files. I can think of two solutions:

We add the abstracts to the volume-level files. Is there any reason not to? I don't think we have file size limitations there, like the Overleaf-imposed max on the full-Anthology BibTeX.
We write a separate file with abstracts and modify the script to grab those from there. This could be easier—we could just write it out as a JSON dictionary, so the search would be the same for all formats and would just be a load + dict key lookup.

mjpost · 2025-01-05T16:33:19Z

Another idea here: now that we're doing CGI, we could generate a single DB and turn every bib file request into a parameterized script, e.g., file.bib?abstract=1&...

mbollmann · 2025-01-05T17:41:02Z

We add the abstracts to the volume-level files. Is there any reason not to?

I just copied that behaviour from the old script. It also gave me an efficient (if a little hacky way) to generate the anthology.bib both with and without abstracts in a separate script. But I’m sure this can be done differently.

We write a separate file with abstracts and modify the script to grab those from there. This could be easier—we could just write it out as a JSON dictionary, so the search would be the same for all formats and would just be a load + dict key lookup.

At this point, we can just copy the JSON files generated by create_hugo_data.py onto the server and use those directly, including for volume-level files :)

mbollmann · 2025-01-05T17:43:07Z

I generalized the script to work with MODS XML and Endnote. Once those volumes are generated we can test it (they're correctly 404s now):

I implemented this, but they’re still correctly 404 because of the NOBIB=true behaviour not generating them :)

Could get rid of that, but let’s maybe first figure out if we even need these files in the first place.

mjpost · 2025-01-05T17:46:25Z

I just copied that behaviour from the old script. It also gave me an efficient (if a little hacky way) to generate the anthology.bib both with and without abstracts in a separate script. But I’m sure this can be done differently.

My guess here is that you would just concatenate the individual volume bib files? This would break the macros and shortcuts that I've implemented to keep the full size anthology bib under 50 MB.

mbollmann · 2025-01-05T17:48:40Z

My guess here is that you would just concatenate the individual volume bib files? This would break the macros and shortcuts that I've implemented to keep the full size anthology bib under 50 MB.

No, I run the exact same macro and shortcut generation that you wrote on them. The efficiency comes from not having to load Anthology() and generate the BibTeX once again in the create_bib.py script.

mbollmann added 3 commits January 1, 2025 22:39

Move paper+volume BibTeX creation into create_hugo_data.py

67753cb

Move MODS+Endnote generation into create_bibtex.py, rename to create_…

be87353

…bib.py

Change Hugo templates to read bib from data, add FileSaver.js

1d74ed4

mbollmann linked an issue Jan 1, 2025 that may be closed by this pull request

Stop explicitly generating bibliography files #3997

Open

mjpost added 3 commits January 2, 2025 17:57

Add .htaccess redirects

0f18af8

Add half-working bash script

b283fb2

Add CGI script

17bf6d2

mjpost added 2 commits January 2, 2025 19:28

Add options

7d1f9b9

Get rid of acl_anthology import for simplicity

392038c

Handle variants for bib files

bcabf52

Fixes https://x.com/zngu/status/1449007350346625024

Fix bug with 404

78b710f

mjpost added 3 commits January 4, 2025 09:38

Generalize htaccess rule and script for all formats

60b9896

Generalize iterator and format file

70c2ce7

Add iterator for mods volumes

35c31f5

mjpost added 2 commits January 4, 2025 12:08

Fix typo; handle missing volume file

8ba296f

Make pattern work for previews

9e7a7ba

Create volume-level MODS and Endnote files, log exceptions properly

9c161b4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop explicitly generating bibliography files (re #3997) #4294

Stop explicitly generating bibliography files (re #3997) #4294

mbollmann commented Jan 1, 2025

mbollmann commented Jan 1, 2025

nschneid commented Jan 1, 2025

nschneid commented Jan 1, 2025

mjpost commented Jan 2, 2025

mbollmann commented Jan 2, 2025

mbollmann commented Jan 2, 2025

mbollmann commented Jan 2, 2025

mjpost commented Jan 2, 2025

akoehn commented Jan 2, 2025

mbollmann commented Jan 2, 2025 •

edited

Loading

mjpost commented Jan 2, 2025 •

edited

Loading

mjpost commented Jan 2, 2025

mjpost commented Jan 2, 2025

github-actions bot commented Jan 3, 2025

mjpost commented Jan 3, 2025

mbollmann commented Jan 3, 2025

mbollmann commented Jan 3, 2025

mjpost commented Jan 3, 2025

mbollmann commented Jan 3, 2025 •

edited

Loading

mjpost commented Jan 3, 2025

mbollmann commented Jan 3, 2025

mjpost commented Jan 3, 2025

mjpost commented Jan 4, 2025

mjpost commented Jan 4, 2025

mbollmann commented Jan 5, 2025

mjpost commented Jan 5, 2025

mjpost commented Jan 5, 2025

mbollmann commented Jan 5, 2025

mbollmann commented Jan 5, 2025

mjpost commented Jan 5, 2025

mbollmann commented Jan 5, 2025

Stop explicitly generating bibliography files (re #3997) #4294

Are you sure you want to change the base?

Stop explicitly generating bibliography files (re #3997) #4294

Conversation

mbollmann commented Jan 1, 2025

This is a proposal that comes with performance improvements, but also functional changes. I would like to hear your thoughts on this.

Functional changes / Disadvantages

Advantages

Building on master

Building on this branch

mbollmann commented Jan 1, 2025

nschneid commented Jan 1, 2025

nschneid commented Jan 1, 2025

mjpost commented Jan 2, 2025

mbollmann commented Jan 2, 2025

mbollmann commented Jan 2, 2025

mbollmann commented Jan 2, 2025

mjpost commented Jan 2, 2025

akoehn commented Jan 2, 2025

mbollmann commented Jan 2, 2025 • edited Loading

mjpost commented Jan 2, 2025 • edited Loading

mjpost commented Jan 2, 2025

mjpost commented Jan 2, 2025

github-actions bot commented Jan 3, 2025

mjpost commented Jan 3, 2025

mbollmann commented Jan 3, 2025

mbollmann commented Jan 3, 2025

mjpost commented Jan 3, 2025

mbollmann commented Jan 3, 2025 • edited Loading

mjpost commented Jan 3, 2025

mbollmann commented Jan 3, 2025

mjpost commented Jan 3, 2025

mjpost commented Jan 4, 2025

mjpost commented Jan 4, 2025

mbollmann commented Jan 5, 2025

mjpost commented Jan 5, 2025

mjpost commented Jan 5, 2025

mbollmann commented Jan 5, 2025

mbollmann commented Jan 5, 2025

mjpost commented Jan 5, 2025

mbollmann commented Jan 5, 2025

mbollmann commented Jan 2, 2025 •

edited

Loading

mjpost commented Jan 2, 2025 •

edited

Loading

mbollmann commented Jan 3, 2025 •

edited

Loading