Skip to content
Permalink
bab08ad53e
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time
33 lines (20 sloc) 2.04 KB

Document preparation

Conversion of docx documents to TEI XML

Used metypeset with parameters --prettytei --puretei. This tool, however, removes the div structure that is important for the sectioning of the work.

Therefore, another attempt with oxgarage which retains the div structure.

General info

The script takes three arguments:

  • teifile: the TEI file
  • bibfile: the bibliography file in bibtex format
  • figdir: the place where the images of that publication are stored

In a first step, some artifacts from the conversion are removed from the file. The output is written to tmp_files/$TEIFILE-cleaned.xml for inspection.

Next, some modifications are done on a string version of the XML tree. These are adjustments on the way the citations and references to figures are entered in the text. The output is written to tmp_files/$TEIFILE-modified.xml for inspection. Some XML parsing errors might show up here which should be taken care of in the original TEI file and the whole script is then to be run again.

In a final step, a report about missing citations, missing figures and non-parseable page ranges are gathered and other relevant data (infos about document structure, figures, footnotes, citations) are stored in tmp_files/dict.pickle.

The resulting XML file is suffixed with -out and can be then given to the next part of the workflow.

Handling of citations

We use bibtex to store bibliographic data. When producing PDF, we can use the LaTeX tools to format citations and references.

For the HTML view, a similar workflow was used (tralics etc), but the output format of biber has been changed, we have not yet adapted to it.

One can use pandoc in conjunction with pandoc-citeproc to do the formatting.

The prepare_tei.py script produces a markdown file that only contains the references and being run with

pandoc -o ldaston.html -t html  --filter=pandoc-citeproc --bibliography=03_daston.bib 03_daston-citations.md

Will produce an easily parseable html file that we can use to extract the formatted bibliography and references from.