-
Notifications
You must be signed in to change notification settings - Fork 22
programmers_guide_eng
- User sends file and additional parameters via POST request.
- API module saves file in a temporary directory and calls manager (i.e code in this file
dedoc/api/dedoc_api.py
) - Manager renames file saving extension. It is important that file's name doesn't contain spaces, ascii symbols, injections and other unnecessary stuff. After that manager tries to convert file with
FileConverter
. Manager's code is herededoc/manager/dedoc_manager.py
. - FileConverter checks if it can convert file with the specified extension. If it's able to do that, then it performs the procedure and returns the name of the converted file. Otherwise it returns the input file name. Code is here
dedoc_project/converters/file_converter.py
. - After the procedure, information is extracted from the file - this is done by
DocParser
. The pair(UnstructuredDocument, if file contains attachments)
is returned. Code is in filededoc/readers/doc_parser.py
. -
StructureConstructor
creates structured file. It (constructor) takesUnstructuredDocument
as input parameter and returnsDocumentContent
. Example can be found indedoc/structure_constructor/tree_constructor.py
. -
MetadataExtractor
enriches document with metadata. Its code can be found in filededoc/metadata_extractor/basic_metadata_extractor.py
8*. (optional) Attachments are being extracted and analized. This procedure is performed by manager (each attachment file goes from stage 2 to 8). - User gets result as a response.
Is responsible for processing requests and sending responses back, it also contains helper functions (e.g. for dealing with online-docs, displaying logo and etc.). Code is stored in file dedoc/api/dedoc_api.py
Manager is performing the major part of the work, but as it often happens, he does that by delegating tasks to his subordinates. Manager is responsible for all of the pipeline stages except for getting and sending the response back. Manager can process file from request as well as from local file system. Manager's configuration is done with special conf file (it is stored in dedoc/manager_config.py
). Code is here dedoc/manager/dedoc_manager.py
.
FileConverter tries to convert file, it has a list of basic converters for it. FileConverter 'asks' every converter if it can process file with this particular extension and if yes - it returns new name of processed file. If none of the listed converters can perform operation, then converter simply returns the file name.
DocParser has a list of basic readers, with which it performs file reading process. One by one, DocReader asks every listed reader if it is able to read the file or not (it depends on the file extension). If no reader is able to read the file, then BadFileFormatException
is raised. Otherwise, the file is read by one of the readers.
BaseReader
is used for deriving data and metadata about document's content (UnstructuredDocument
) and information if document can possibly have attachments. UnstructuredDocument
consists of list of pages and lines, where every line is represented as LineWithMeta
class object.
LineWithMeta
contains text, metadata about the text (which type the line is, number of the line, etc.), list of annotations (annotation contains information about individual words and parts of the text), and also HierarchyLevel
which is necessary for folding the document.
HierarchyLevel
defines nesting level: it (nesting level) is defined by 2 numbers - level1
and level2
(the less number is, the more important the line is). For example, if we see the lines (nesting level is indicated in brackets), then we can understand that the first line is the heading, the second one is nested in the first, and the third one is nested in the second one.
Look here to get more information.