This is basically a good way to make your System to read invoice data from PDF, image to system readable XML, JSON, & CSV. We will start with python library invoice2data. Now a days Machine Learning (ML) is getting popular to make a system to create a program to learn itself to adapt the new things. Let’s make the PDF and image invoices readable and get the extracted texts in formated results. As we know that there are some popular libraries available to convert PDF, image to text, such as Tesseract, OpenCV, pdf2text, pdf2image those are all some examples. But with that we can get only the text data’s we have to make custom program to read the texts in a formatted way.
The invoice2data helps to convert the pdf and images to readable formated results. Let’s say if you input the pdf and you will get a formated results of data from it. We need to have our system ready with python, invoice2data and pdf2text, or pdfminer, tesseract, opencv to convert the raw files to system readable text. and with that text, the system can change it to formatted test.
it depends on the Operating system as well, you can follow the documentation from Invoice2Data to install python and other needy dependencies,
After that make a folder and put two more directories inside it.
One for pdf and other for regex files. making the regex is the complex part to create it as per your invoices.
And we need to put the pdf files inside invoices directory and yml files inside templates directory. and than create the convert.py file.
from invoice2data import extract_data from invoice2data.extract.loader import read_templates templates = read_templates('templates/pdf.yml') result = extract_data('invoices/QualityHosting.pdf') print(result)
And than open your terminal/command prompt to run the python
~/Desktop/invoice2data$ python3 convert.py
which will run and show the extracted results. Here its very basic example, that will simply use the pdf2text to convert the pdf image into text and use inside our invoice2data. But we can use other modules too. you need to configure the input module to take effect. Also you can save it to necessary output files like CSV, JSON, XML.