Permalink
Cannot retrieve contributors at this time
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
EOASkripts/doc/fix_tei.md
Go to fileThis commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
33 lines (20 sloc)
2.04 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Document preparation | |
Conversion of docx documents to TEI XML | |
Used metypeset with parameters `--prettytei --puretei`. This tool, however, removes the div structure that is important for the sectioning of the work. | |
Therefore, another attempt with oxgarage which retains the div structure. | |
# General info | |
The script takes three arguments: | |
* `teifile`: the TEI file | |
* `bibfile`: the bibliography file in bibtex format | |
* `figdir`: the place where the images of that publication are stored | |
In a first step, some artifacts from the conversion are removed from the file. The output is written to `tmp_files/`$TEIFILE`-cleaned.xml` for inspection. | |
Next, some modifications are done on a string version of the XML tree. These are adjustments on the way the citations and references to figures are entered in the text. The output is written to `tmp_files/`$TEIFILE`-modified.xml` for inspection. Some XML parsing errors might show up here which should be taken care of in the original TEI file and the whole script is then to be run again. | |
In a final step, a report about missing citations, missing figures and non-parseable page ranges are gathered and other relevant data (infos about document structure, figures, footnotes, citations) are stored in `tmp_files/dict.pickle`. | |
The resulting XML file is suffixed with `-out` and can be then given to the next part of the workflow. | |
# Handling of citations | |
We use bibtex to store bibliographic data. When producing PDF, we can use the LaTeX tools to format citations and references. | |
For the HTML view, a similar workflow was used (tralics etc), but the output format of biber has been changed, we have not yet adapted to it. | |
One can use pandoc in conjunction with pandoc-citeproc to do the formatting. | |
The prepare_tei.py script produces a markdown file that only contains the references and being run with | |
pandoc -o ldaston.html -t html --filter=pandoc-citeproc --bibliography=03_daston.bib 03_daston-citations.md | |
Will produce an easily parseable html file that we can use to extract the formatted bibliography and references from. |