Introduction
Hi, I am Akira, the editor-in-chief of Data Without Code. In our previous tutorial, we built an automated pipeline to clean messy CRM data before uploading. Your databases are now spotless.
But there is one final boss in the world of business automation. It is a file format so stubborn, so universally used, and so incredibly difficult to extract data from, that it makes DX managers want to pull their hair out: The PDF document.
If you have ever received a PDF invoice or a monthly vendor report, you know the struggle. You highlight a beautiful table in the PDF, press Copy, and paste it into Excel. Instantly, the entire table collapses into a single, chaotic block of text in one cell.
In this final Automation Hack, I will show you how to stop manually typing out numbers from your screen. We are going to learn how to automate PDF data extraction with KNIME using a powerful open-source extension.
Why PDF Extraction is So Difficult
Unlike a CSV file or a SQL database, a PDF is not designed to store structured data. It is essentially a digital piece of paper designed for printing. The computer only sees coordinates of letters on a page, not “rows” and “columns.”
To read a PDF, we need to convert that digital paper into raw text first, and then use our KNIME data preparation skills to organize it.
Step 1: Install the Tika Parser Extension
By default, KNIME reads data files, not documents. To read a PDF, we need a special “Lego block” built by the Apache Software Foundation called Tika.
If you remember our guide on how to install KNIME extensions, open your installation menu and search for KNIME Textprocessing (which includes the Tika Parser). Install it, restart KNIME, and you are ready to go.
Step 2: Read the PDF (Tika Parser Node)
Now that you have the right tool, let’s extract the text from a PDF invoice.
- Search for the Tika Parser node and drag it onto your canvas.
- Double-click the node to configure it.
- In the “Input directory or file” box, click Browse and select your PDF file (e.g.,
Invoice_001.pdf).
Execute the node and view the output. You will see a table with a column called “Document”. This column contains every single word, number, and line of text from the entire PDF, crammed into one cell. It looks messy, but we have successfully extracted the data!
Step 3: Convert the Document to a String
Right now, KNIME views that cell as a special “Document” type. We need to turn it into standard text (a String) so we can clean it.
Search for the Document Data Extractor node. Connect it to your Tika Parser, double-click it, and select the “Text” checkbox. Execute it. You now have a standard String column containing all the raw text from the PDF.
Step 4: Clean and Structure the Data
This is where your Data Prep skills shine. The text is usually separated by hidden “newline” characters (like hitting Enter on a keyboard). We need to split this massive block of text into individual rows.
- Use the Cell Splitter node on your new text column.
- In the delimiter box, type
\n(which is computer code for a new line). - Choose the output format as “As list”.
- Connect an Ungroup node to separate that list into perfect, individual rows!
From here, you can use the Row Filter node to keep only the row that contains the word “Total Amount Due”, and then use the String Manipulation node to extract the exact dollar amount.
(Akira’s Pro-Tip: If you want to process 100 PDF invoices at once, just wrap this entire process inside a loop! We covered this exactly in our guide on how to loop through multiple files in a folder.)
Conclusion: Your Next Steps
Congratulations! You have conquered the PDF. By combining the Tika Parser with your text manipulation skills, you can now automate the extraction of invoices, contracts, and vendor reports.
With this tutorial, you have officially completed the Automation Hacks module! You now know how to pull data from local folders, websites, SQL databases, and even PDFs. You know how to schedule workflows and send automated emails.
You have the technical skills. Now, it is time to apply them to real-world business strategy.
Welcome to our final module: Use Cases. Are you ready to prove the financial value of your new automation skills to your marketing team? Join me in our next tutorial: Marketing campaign analysis: Measuring ROI with KNIME!
