-
Notifications
You must be signed in to change notification settings - Fork 249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with stopwords when working with make_doc_from_text_chunks #230
Comments
I've been experimenting with this and I think I've found a work around however it may be very computationally inefficient. I initiate the document, then set the stopwords then re-initiate the document. This not only seems to solve the issue with not being able to reset stopwords but it also seems to fix the issue with some stopwords not being picked up on the first pass. The need to do this is very odd behavior however:
|
I'm having numerous issues with stopwords when working with textacy's make_doc_from_text_chunks functionality.
Expected Behavior
I want to be able to load a model and then fire documents at it in order to find keywords. I want to do this in a way whereupon I can reset the stopwords I'm using from document to document.
Current Behavior
Setting stopwords for the first document works fine but when I attempt to reset the stopwords for the next document it appears to revert the stopwords back to the default and not allow me to use a new, custom set of stopwords. It also seems to miss some stopwords on the first pass.
Possible Solution
I think a flag is being set somewhere in textacy when I call make_doc_from_chunks to set the stopwords and I can't for the life of me find a way to unset it. I would say this is a bug somewhere.
Steps to Reproduce (for bugs)
In order to ensure reproducibility I have provided both some example python code showing the bug and a Dockerfile (in the environment section) which should make it easy to reproduce the problem. Example code:-
Details of the docker container are given in the environment section.
Context
I want to create a tool which produces keywords from arbitrarily large documents with stopwords set based on their context which does not require a restart when processing a different set of documents. For example a series of financial reports should not return "fiscal" or "financial" in their keywords and the tool should not have to restart to process a series of performance reviews with "performance" set as a stopword.
Your Environment
Run in a docker container using the following code:-
and the requirements file is:-
spacy
version: 2.0.18spacy
models: en,en_core_web_smtextacy
version: 0.6.2The text was updated successfully, but these errors were encountered: