| .spyproject/config | ||
| boe_processor | ||
| deprecated | ||
| eurlex_processor | ||
| llamaparse_processor | ||
| sherpa_processor | ||
| test_files | ||
| .gitignore | ||
| README.md | ||
| requirements.txt | ||
document_processor
Description
This repository contains the document processing modules. The logic of this module is:
- The user uploads a document or selects a document from an available source.
- The document is processed. It is common to all processors that the result is a list of strings.
- The list of strings must be converted into a JSON that contains general execution information, such as the user who made the request, the chosen processor, the date, etc. And a content field, with the list of strings inside.
- This JSON goes to the File Management module which sends it to the table where we store the users' documents , to the user's OpenAI Assistant and to the S3 bucket with reference to the user.
- The backend must allow the user to work with the processed document in all AI products, if the product needs the document by sections (Knowledge Extractor) it is sent by sections, if the product handles the entire document (Smart Assistant or Mirror) the sections are sent concatenated.
Processors
Sherpa Processor
This is the main processor for documents uploaded by the user or via URL. It supports the following formats: PDF, URL, HTML, DOCX, BYTES. The functions to interact with it are found in the sherpa_processor.py file, along with a demo.py
It has two main parts: the ingestor and the LLM-Sherpa API.
The ingestor is a docker image that must be raised on a server following these steps:
docker pull ghcr.io/nlmatics/nlm-ingestor:latest docker run -p 5010:5001 ghcr.io/nlmatics/nlm-ingestor:latest-
And it would be accessed locally here: "http://localhost:5010/api/parseDocument?renderFormat=all"
To put this service into production, some kind of reverse proxy or gateway must be added.
-- UPDATE --
Sherpa-Ingestor deployed: "http://162.19.110.103/api/parseDocument?renderFormat=all"
--
Once the doc is obtained (see demo.py), the sections of the document can be obtained.
The aim of this processor is to clean files and infer their sections intelligently. The get_section_text function produces the sections given a doc, and merge_small_sections merges the small sections, to form a larger one.
The output is a list of strings.
Eur-Lex Processor
This processor allows you to enter a URL of Eur-Lex articles and the result is a list of strings, so that each article forms a section.
The functions are found in eurlex_processor.py along with a demo.py
Like the Sherpa Processor, this processor returns a list of strings.
BOE Processor
This processor allows you to obtain the BOE Summary given a date. The BOE Summary is a document that is published every day and contains within it the references to all the specific BOEs.
A specific BOE is a legislative document.
- The logic for the user will be as follows: The user selects Source -> BOE.
- The Front will show a date selector, you can select 1 day.
- At this point, the get_boe function is executed for that day
- On the result of that function, the extract_boe_refs function is applied.
- The result is a list of dictionaries. Each dictionary has a 'type'.
- The 'metadata' type contains the information of the BOE Summary.
- The 'reference' type contains the information of each specific BOE to that Summary.
- The Front at this point shows a dropdown, in which you can select 1 of the unique values of the 'department' field.
- At the moment the user selects 1 department, all the 'epigraph' of that department are shown.
- Finally, you can select 1 'epigraph'
- At this point, we must show the user the urlPdf field of that Department-Epigraph, as a link so that they can open it in another tab.
- If the user presses an 'Add' button, then the urlPdf is sent to the Sherpa-Processor which processes it as a URL.
- If the process fails (it is not infallible), the PDF inside that URL is downloaded to a temporary folder, and instead of being sent as a URL, it is sent to the Sherpa-Processor as a PDF file.
- If it fails again, an error message is sent.
Test Files
The repository contains a 'test_files' folder with test files to test with the Sherpa-Processor.