Skip to content

PDF pagebreak hints in XML #40

Open
kthoden opened this issue Feb 18, 2020 · 3 comments
Open

PDF pagebreak hints in XML #40

kthoden opened this issue Feb 18, 2020 · 3 comments

Comments

@kthoden
Copy link

kthoden commented Feb 18, 2020

A possible way: Find first and last line of page in PDF and perform text search in XML.
https://stackoverflow.com/questions/50502581/how-can-i-get-the-last-line-position-of-pdf-file-using-python
https://stackoverflow.com/questions/2481945/how-to-read-line-by-line-in-pdf-file-using-pypdf

@kthoden
Copy link
Author

kthoden commented Feb 21, 2020

Used Grobid service with http://cloud.science-miner.com/grobid/, sadly no pagebreak information

Proceedings11Chap06.pdf

Settings:
Screenshot 2020-02-21 at 11 12 09

Output:
grobid.xml.txt

@kthoden
Copy link
Author

kthoden commented Apr 14, 2020

@kthoden
Copy link
Author

kthoden commented Nov 30, 2020

Maybe pdfminer (http://www.unixuser.org/~euske/python/pdfminer/) or pdftotext (https://www.xpdfreader.com/pdftotext-man.html) might help, too.

Sign in to join this conversation on GitHub.
Labels
None yet
Development

No branches or pull requests

1 participant