A Docker-powered service for extracting pieces of information from PDFs
- Dependencies
- Requirements
- Docker containers
- How to use it
- How to use GPU
- HTTP server
- Queue processor
- Service configuration
- Get service logs
- Set up environment for development
- Execute tests
- Execute performance test
- Troubleshooting
-
Docker 23.0.1 install link
-
Docker-compose
Note: On mac Docker-compose is installed with Docker
- 12Gb RAM memory
A redis server is needed to use the service asynchronously. For that matter, it can be used the
command make start:testing
that has a built-in
redis server.
Containers with make start
Containers with make start-test
-
Start the service with docker compose
make start
-
Post xml files to train
curl -X POST -F 'file=@/PATH/TO/PDF/xml_file_name.xml' localhost:5056/xml_to_train/tenant_name/id
-
Post xml files to get suggestions
curl -X POST -F 'file=@/PATH/TO/PDF/xml_file_name.xml' localhost:5056/xml_to_predict/tenant_name/id
-
Post labeled data
Text, numeric or date cases:
curl -X POST --header "Content-Type: application/json" --data '{"xml_file_name": "xml_file_name.xml",
"id": "property_id",
"tenant": "tenant_name",
"language_iso": "en",
"label_text": "text",
"page_width": 612,
"page_height": 792,
"xml_segments_boxes": [{"left": 124, "top": 48, "width": 83, "height": 13, "page_number": 1}],
"label_segments_boxes": [{"left": 124, "top": 48, "width": 83, "height": 13, "page_number": 1}]
}' localhost:5056/labeled_data
curl -X POST --header "Content-Type: application/json" --data '{"xml_file_name": "xml_file_name.xml",
"id": "property_id",
"tenant": "tenant_name",
"language_iso": "en",
"values": [{"id": "1", "label": "option 1"}, {"id": "2", "label": "option 2"}],
"page_width": 612,
"page_height": 792,
"xml_segments_boxes": [{"left": 124, "top": 48, "width": 83, "height": 13, "page_number": 1}]
}' localhost:5056/labeled_data
- Post data to predict
curl -X POST --header "Content-Type: application/json" --data '{"xml_file_name": "xml_file_name.xml",
"id": "property_id",
"tenant": "tenant_name",
"page_width": 612,
"page_height": 792,
"xml_segments_boxes": [{"left": 124, "top": 48, "width": 83, "height": 13, "page_number": 1}]
}' localhost:5056/prediction_data
- Create model and calculate suggestions
To create the model or calculate the suggestions, a message to redis should be sent. The name for the tasks queue is " information_extraction_tasks"
queue = RedisSMQ(host='127.0.0.1', port='6579', qname='information_extraction_tasks', quiet=False)
# Text, numeric or date cases:
# Create model
queue.sendMessage(delay=0).message('{"tenant": "tenant_name", "task": "create_model", "params": {"id": "property_id"}}').execute()
# Calculate suggestions
queue.sendMessage(delay=0).message('{"tenant": "tenant_name", "task": "suggestions", "params": {"id": "property_id"}}').execute()
# Multi-option case:
# Create model
# The options parameter are all the posible values for all the PDF
# The multi_value parameter tells if the algorithm can pick more than one option per PDF
queue.sendMessage(delay=0).message('{"tenant": "tenant_name", "task": "create_model", "params": {"id": "property_id" , "options": [{"id": "1", "label": "option 1"}, {"id": "2", "label": "option 2"}, {"id": "3", "label": "option 3"}], "multi_value": false}}').execute()
# Calculate suggestions
queue.sendMessage(delay=0).message('{"tenant": "tenant_name", "task": "suggestions", "params": {"id": "property_id"}}').execute()
- Get service logs
A Redis queue stores the service logs for both training and prediction processes.
queue = RedisSMQ(host='127.0.0.1', port='6579', qname='information_extraction_logs', quiet=False)
results_message = queue.receiveMessage().exceptions(False).execute()
# The logs have the following format
# {"tenant": "tenant_name",
# "extraction_name": "extraction_id",
# "severity": "info" || "error",
# "message": ""}
- Get results
There is a redis queue where it is possible to get notified when the different tasks finish
queue = RedisSMQ(host='127.0.0.1', port='6579', qname='information_extraction_results', quiet=False)
results_message = queue.receiveMessage().exceptions(False).execute()
# The models have been created message
# {"tenant": "tenant_name",
# "task": "create_model",
# "params": {"id": "property_id"},
# "success": true,
# "error_message": ""}
# The suggestions have been computed
# {"tenant": "tenant_name",
# "task": "suggestions",
# "params": {"id": "property_id"},
# "success": true,
# "error_message": "",
# "data_url":""}
Get suggestions
curl -X GET localhost:5056/get_suggestions/tenant_name/id
or in python
requests.get(results_message.data_url)
The suggestions have the following format:
Text, numeric or date cases:
[{
"tenant": "tenant",
"id": "property_id",
"xml_file_name": "xml_file_name_1",
"text": "suggestion_text_1",
"segment_text": "segment_text_1",
"segments_boxes": [{"left": 1, "top": 2, "width": 3, "height": 4, "page_number": 1}]
},
{
"tenant": "tenant",
"id": "property_id",
"xml_file_name": "xml_file_name_2",
"text": "suggestion_text_2",
"segment_text": "segment_text_2",
"segments_boxes": [{"left": 1, "top": 2, "width": 3, "height": 4, "page_number": 2}]
}, ... ]
Multi-option case:
[{
"tenant": "tenant",
"id": "property_id",
"xml_file_name": "xml_file_name_1",
"values": [{"id": "1", "label": "option 1"}],
"segment_text": "segment_text_1",
"segments_boxes": [{"left": 1, "top": 2, "width": 3, "height": 4, "page_number": 1}]
},
{
"tenant": "tenant",
"id": "property_id",
"xml_file_name": "xml_file_name_2",
"values": [{"id": "2", "label": "option 2"}],
"segment_text": "segment_text_2",
"segments_boxes": [{"left": 1, "top": 2, "width": 3, "height": 4, "page_number": 2}]
}, ... ]
-
Stop the service
make stop
To use the GPU in the docker containers
-
Install the package:
nvidia-container-toolkit
-
Restart docker service
systemctl restart docker
-
Start the service with
make start_gpu
The container HTTP server
is coded using Python 3.9 and uses the FastApi web
framework.
If the service is running, the end point definitions can be founded in the following url:
http://localhost:5056/docs
The end points code can be founded inside the file app.py
.
The errors are reported to the file models_data/service.log
, if the configuration is not changed (
see Get service logs)
The container Queue processor
is coded using Python 3.9, and it is on charge of the communication with redis.
The code can be founded in the file QueueProcessor.py
and it uses the library RedisSMQ
to interact with the redis
queues.
See environment variables in the file .env
It works with Python 3.9 [install] (https://runnable.com/docker/getting-started/)
make install_venv
make start_for_testing
make test
cd src && python check_performance.py
And the results are stored in src/performance/results
Solution: Change RAM memory used by the docker containers to 3Gb or 4Gb