I'm an ML engineer, worked as a part time data engineer consultant for a medical lines/claims extraction company, for 3 years, which majorly involved in extracting the tabular data from the PDFs and Images. Developer rules or parsers as such is JUST no help. You end up creating a new rule every time you miss the data extraction.

With that in consideration, and the existing resources are little help especially on skewed, blurry, handwritten and 2 different table structure in the input, I ended up creating an API service to extract tabular data from Images and PDFs - hosted as https://extracttable.com . We cared it to be robust, average extraction time on images is under 5 seconds. On top of maintaining accuracy, A bad extraction is eligible for credit usage refund, which literally not any service offer it.

i Invite HN users to give it a try and feel free to email [email protected] for extra API credits for the trail.

Hi, author and maintainer of Tabula (https://github.com/tabulapdf/tabula). We've been trying to contact you about the "Tabula Pro" version that you are offering.

Feel free to reachme at manuel at jazzido dot com