What does HackerNews think of grobid?

A machine learning software for extracting information from scholarly documents

Language: Java

PDF ChatBot – Upload, chat and interact with any PDF document | Apr 2023

Can you elaborate on how you parse the PDF? Are you simply converting it to text using a python library or something more robust like GROBID[1]?

1: https://github.com/kermitt2/grobid

What's so hard about PDF text extraction? | Mar 2020

For academic papers: GROBID [0] is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured XML/TEI encoded documents with a particular focus on technical and scientific publications.

[0] https://github.com/kermitt2/grobid/

Textract, a Python package for extracting text from any document | Aug 2014

Expand Context ↕

As far as I have been able to tell, the public state of the art in academic paper metadata parsing is Grobid: https://github.com/kermitt2/grobid

Not quite as simple a commandline interface as you suggest, but not too hard to set up, and pretty impressive. Now if only Google Scholar would open-source whatever they use...