New issue

Jump to bottom

PDF pagebreak hints in XML #40

Open

kthoden opened this issue Feb 18, 2020 · 3 comments

kthoden commented Feb 18, 2020

A possible way: Find first and last line of page in PDF and perform text search in XML.
https://stackoverflow.com/questions/50502581/how-can-i-get-the-last-line-position-of-pdf-file-using-python
https://stackoverflow.com/questions/2481945/how-to-read-line-by-line-in-pdf-file-using-pypdf

Author

kthoden commented Feb 21, 2020

Used Grobid service with http://cloud.science-miner.com/grobid/, sadly no pagebreak information

Proceedings11Chap06.pdf

Settings:

Output:
grobid.xml.txt

Author

kthoden commented Apr 14, 2020

Evaluate Apache tika and PDFBox (https://stackoverflow.com/questions/5824867/is-it-possible-to-extract-text-by-page-for-word-pdf-files-using-apache-tika)

Author

kthoden commented Nov 30, 2020

Maybe pdfminer (http://www.unixuser.org/~euske/python/pdfminer/) or pdftotext (https://www.xpdfreader.com/pdftotext-man.html) might help, too.

Sign in to join this conversation on GitHub.