Technical documentation
Project architecture
This project is build from the TEIC html2tei stylesheet. The aim was to use the fact that Hdoc is based on HTML to convert easily to TEI. Trough the development, two other converter were needed. A first one, which will be preformat the input file, changing some tags. And a second after passing though html2tei because some elements were not computed by html2tei like caption (for tables) or object.
Project architechture
build.xml
This file is supposed to help you to launch the main file hdoc_to_tei.ant. It's just a call to the file with defined arguments.
What does hdoc_to_tei.ant do ?
This ant file contains 8 targets which are called one, by one in order. Each target has a particular role it has to do.
initialization Creates the output directory, clear the output path, create the directory referring the output path, create temp directory
unzip Unzip the hdoc contents into the temp directory
extract_resources Copy resources from extracted hdoc directory to the output.
copy_schema Copy TEI schema into the output directory
build_before_ant Find the content file into extracted hdoc directory and then build a ant file which launch XSLT on it with before.xsl and outputs it to a temp file : past_before.xml
apply_html2tei Apply html2tei.xsl to past_before.xml and outputs it to a file named past_convert.xml
apply_after Apply after.xsl to past_convert.xml and outputs it to the output path with the name content.xml
clear Deletes temp directory, and all other temp files
before.xsl
This is the preformatting of the content file. It allow to change some xml tags which will be erased or ignored by the html2tei converter. It also removes the hdoc namespace and add html namespace.
Tags related are header and footer, those tags were just removed (but not their contents) which caused some validation errors from the TEI Lite Schema.
html2tei files (html2tei.xsl, commonTEIstructures.xsl, functions.xsl)
This is some files from the TEI Stylesheets projects. Their use is to convert a HTML file into a TEI xml file.
One change was made, it's about how the unknown tags are handled. The default behavior is to comment the tag himself which was not a good solution because it was a net loss of information. That's why it's changed there. Now it just let the tag as it was.
after.xsl
This stylesheet is there to fix all tags that aren't handled by html2tei.
caption This tag was not handled. It's the heading for a table. The solution was to change the tag to head which is a TEI tag.
object Because there's no tag which handle digital documents into the TEI Recommendations it's a figure tag which is used. It's filled with the data attribute wrapped in a url attribute.
extract_contents.xsl
This file is used to find the content.xml and create a ant file which will be used to apply before.xsl to content.xml.