-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reading multiple word files uses excessive space on /tmp #180
Comments
Thanks @gcpoole, I'll try to address this soon. Or even better... would you consider fixing it and issuing a PR? |
Hey @kbenoit. I was able to figure out the PR. The fix I implemented was pretty basic. As far as I could tell, there were two file types -- docx and odt -- that wrote information to a temporary file at part of their corresponding "get_" functions. I added a call to "unlink(path)" to each of these "get_" functions. I also noted in the following comment (line 195, readtext.R):
I started to dig into that a bit, but came across this comment (line 78, utils.R):
which gave me pause about messing with list_files(). So, I've issued a PR for an incremental patch that solves my issue (I'm using Regardless, thanks for making your package available! It's been a big help to me in processing large .docx files! |
Excellent, thanks! I'll get to that this week. |
readtext
leaves temporary files on/tmp
in linux each time you read a .docx file. If you read a lot of large .docx files, the space on/tmp
can get used up quickly.I have been using Natural Language Processing on a corpus of 1000 books, where each book is a few hundred pages. Each time I open a book, about 10 MB of space is consumed on /tmp. The files are not cleaned up by
readtext
. So if I open and process all 1000 books (one at a time, in a loop), about 10 GB of disk space is consumed on/tmp
!!It would be really nice if readtext would clean up the /tmp files it creates before it returns its result, rather than leaving potentially large files behind.
A reprex would be to create a .docx file (
foo.docx
) consisting of about 300 pages of text. Then run the following loop in RThis will open
foo.docx
1000 times, creating 1000 temporary folders on/tmp
which will consume 5-10 GB of disk space.The text was updated successfully, but these errors were encountered: