reading multiple word files uses excessive space on /tmp #180

gcpoole · 2025-01-03T06:00:11Z

readtext leaves temporary files on /tmp in linux each time you read a .docx file. If you read a lot of large .docx files, the space on /tmp can get used up quickly.

I have been using Natural Language Processing on a corpus of 1000 books, where each book is a few hundred pages. Each time I open a book, about 10 MB of space is consumed on /tmp. The files are not cleaned up by readtext. So if I open and process all 1000 books (one at a time, in a loop), about 10 GB of disk space is consumed on /tmp!!

It would be really nice if readtext would clean up the /tmp files it creates before it returns its result, rather than leaving potentially large files behind.

A reprex would be to create a .docx file (foo.docx) consisting of about 300 pages of text. Then run the following loop in R

for (i in 1:1000) {
  dummy <- readtext::readtext("/path/to/foo.docx")
}

This will open foo.docx 1000 times, creating 1000 temporary folders on /tmp which will consume 5-10 GB of disk space.

The text was updated successfully, but these errors were encountered:

kbenoit · 2025-01-07T05:08:18Z

Thanks @gcpoole, I'll try to address this soon. Or even better... would you consider fixing it and issuing a PR?

gcpoole · 2025-01-18T16:32:40Z

Hey @kbenoit. I was able to figure out the PR. The fix I implemented was pretty basic. As far as I could tell, there were two file types -- docx and odt -- that wrote information to a temporary file at part of their corresponding "get_" functions. I added a call to "unlink(path)" to each of these "get_" functions.

I also noted in the following comment (line 195, readtext.R):

    # TODO: files need to be imported as they are discovered. Currently
    # list_files() uses a lot of storage space for temporary files when there
    # are a lot of archives.

I started to dig into that a bit, but came across this comment (line 78, utils.R):

#  The implementation of list_files and list_file might seem very
#  complex, but it was arrived at after a lot of toil. The main design decision
#  made here is ...

which gave me pause about messing with list_files(). So, I've issued a PR for an incremental patch that solves my issue (I'm using dir() for file discovery and lapply() to read files one at a time in my code), but it seems like the issue of /tmp might extend past my proposed solution.

Regardless, thanks for making your package available! It's been a big help to me in processing large .docx files!

kbenoit · 2025-01-19T12:51:52Z

Excellent, thanks! I'll get to that this week.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reading multiple word files uses excessive space on /tmp #180

reading multiple word files uses excessive space on /tmp #180

gcpoole commented Jan 3, 2025

kbenoit commented Jan 7, 2025

gcpoole commented Jan 18, 2025 •

edited

Loading

kbenoit commented Jan 19, 2025

reading multiple word files uses excessive space on /tmp #180

reading multiple word files uses excessive space on /tmp #180

Comments

gcpoole commented Jan 3, 2025

kbenoit commented Jan 7, 2025

gcpoole commented Jan 18, 2025 • edited Loading

kbenoit commented Jan 19, 2025

gcpoole commented Jan 18, 2025 •

edited

Loading