Technical documentation

HTML2HDOC folder

  • a ant file : html_to_hdoc.ant

  • a file build.properties

  • a folder XSLSources with 2 files .xsl :

    • xhtmlToHdoc.xsl

    • removeEmptyTag.xsl

  • a folder JavaSources with 3 files .jar

    • htmlcleaner-2.5.jar

    • XMLParser.jar

The sources of the XML parser are in src folder of the JavaSources folder. The sources for htmlcleaner can be found on the website : http://htmlcleaner.sourceforge.net/

  • a folder src containing a sample file :

    • file.html

  • a folder out

Script utility

  • html_to_hdoc.ant

    Calls the other scripts to transform the .html to .hdoc

  • build.properties

    File which contains the path to the various used files

  • xhtmlToHdoc.xsl

    XSL file which do the main transformation from xhtml to hdoc

  • removeEmptyTag.xsl

    File which remove the empty tags

  • htmlcleaner-2.5.jar

    Library which transform the code from html to xhtml.

  • XMLParser

    Sax parser which converts the xhtml code and permits to manage the problems that we met with list that include "p" tag, or with content of type "flow" (see hdoc-content.rng) containing in the p tags.

Script fonctioning

  1. The file html_to_hdoc.ant calls a first time htmlcleaner-2.5.jar to convert the html to a well formed xhtml

  2. Then it calls XMLParser which remove the list and the tags as img, table, video which cannot be included in p tag from p tag thanks to a stack

  3. Then it calls htmlcleaner-2.5.jar a second time, because there is a confusion of the saxon parser on "&", "<" and htmlcleaner solved this problem (there is a better solution, for sure, but I didn't have the time to find and implement it)

  4. Then it calls our xhtmltohdoc.xsl file which do the main transformation of html content to hdoc format. It transforms all the not empty "div" tags of "body" tag and all their children into "div". The content which is considered is only the content of this "div" tag.

  5. Then we call again the XMLParser to solve the same kind of problem we have the first time

  6. Once again htmlcleaner is called

  7. Eventually removeEmptyTag.xsl is called, it permits to remove all empty tags