A possible way: Find first and last line of page in PDF and perform text search in XML. https://stackoverflow.com/questions/50502581/how-can-i-get-the-last-line-position-of-pdf-file-using-python https://stackoverflow.com/questions/2481945/how-to-read-line-by-line-in-pdf-file-using-pypdf
The text was updated successfully, but these errors were encountered:
Used Grobid service with http://cloud.science-miner.com/grobid/, sadly no pagebreak information
Proceedings11Chap06.pdf
Settings:
Output: grobid.xml.txt
Sorry, something went wrong.
Evaluate Apache tika and PDFBox (https://stackoverflow.com/questions/5824867/is-it-possible-to-extract-text-by-page-for-word-pdf-files-using-apache-tika)
Maybe pdfminer (http://www.unixuser.org/~euske/python/pdfminer/) or pdftotext (https://www.xpdfreader.com/pdftotext-man.html) might help, too.
No branches or pull requests
A possible way: Find first and last line of page in PDF and perform text search in XML.
https://stackoverflow.com/questions/50502581/how-can-i-get-the-last-line-position-of-pdf-file-using-python
https://stackoverflow.com/questions/2481945/how-to-read-line-by-line-in-pdf-file-using-pypdf
The text was updated successfully, but these errors were encountered: