Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reading multiple word files uses excessive space on /tmp #180

Open
gcpoole opened this issue Jan 3, 2025 · 3 comments
Open

reading multiple word files uses excessive space on /tmp #180

gcpoole opened this issue Jan 3, 2025 · 3 comments

Comments

@gcpoole
Copy link

gcpoole commented Jan 3, 2025

readtext leaves temporary files on /tmp in linux each time you read a .docx file. If you read a lot of large .docx files, the space on /tmp can get used up quickly.

I have been using Natural Language Processing on a corpus of 1000 books, where each book is a few hundred pages. Each time I open a book, about 10 MB of space is consumed on /tmp. The files are not cleaned up by readtext. So if I open and process all 1000 books (one at a time, in a loop), about 10 GB of disk space is consumed on /tmp!!

It would be really nice if readtext would clean up the /tmp files it creates before it returns its result, rather than leaving potentially large files behind.

A reprex would be to create a .docx file (foo.docx) consisting of about 300 pages of text. Then run the following loop in R

for (i in 1:1000) {
  dummy <- readtext::readtext("/path/to/foo.docx")
}

This will open foo.docx 1000 times, creating 1000 temporary folders on /tmp which will consume 5-10 GB of disk space.

@kbenoit
Copy link
Collaborator

kbenoit commented Jan 7, 2025

Thanks @gcpoole, I'll try to address this soon. Or even better... would you consider fixing it and issuing a PR?

@gcpoole
Copy link
Author

gcpoole commented Jan 18, 2025

Hey @kbenoit. I was able to figure out the PR. The fix I implemented was pretty basic. As far as I could tell, there were two file types -- docx and odt -- that wrote information to a temporary file at part of their corresponding "get_" functions. I added a call to "unlink(path)" to each of these "get_" functions.

I also noted in the following comment (line 195, readtext.R):

    # TODO: files need to be imported as they are discovered. Currently
    # list_files() uses a lot of storage space for temporary files when there
    # are a lot of archives.

I started to dig into that a bit, but came across this comment (line 78, utils.R):

#  The implementation of list_files and list_file might seem very
#  complex, but it was arrived at after a lot of toil. The main design decision
#  made here is ...

which gave me pause about messing with list_files(). So, I've issued a PR for an incremental patch that solves my issue (I'm using dir() for file discovery and lapply() to read files one at a time in my code), but it seems like the issue of /tmp might extend past my proposed solution.

Regardless, thanks for making your package available! It's been a big help to me in processing large .docx files!

@kbenoit
Copy link
Collaborator

kbenoit commented Jan 19, 2025

Excellent, thanks! I'll get to that this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants