Technical documentation
HTML2HDOC folder
a ant file : html_to_hdoc.ant
a file build.properties
a folder XSLSources with 2 files .xsl :
xhtmlToHdoc.xsl
removeEmptyTag.xsl
a folder JavaSources with 3 files .jar
htmlcleaner-2.5.jar
XMLParser.jar
The sources of the XML parser are in src folder of the JavaSources folder. The sources for htmlcleaner can be found on the website : http://htmlcleaner.sourceforge.net/
a folder src containing a sample file :
file.html
a folder out
Script utility
html_to_hdoc.ant
Calls the other scripts to transform the .html to .hdoc
build.properties
File which contains the path to the various used files
xhtmlToHdoc.xsl
XSL file which do the main transformation from xhtml to hdoc
removeEmptyTag.xsl
File which remove the empty tags
htmlcleaner-2.5.jar
Library which transform the code from html to xhtml.
XMLParser
Sax parser which converts the xhtml code and permits to manage the problems that we met with list that include "p" tag, or with content of type "flow" (see hdoc-content.rng) containing in the p tags.
Script fonctioning
The file html_to_hdoc.ant calls a first time htmlcleaner-2.5.jar to convert the html to a well formed xhtml
Then it calls XMLParser which remove the list and the tags as img, table, video which cannot be included in p tag from p tag thanks to a stack
Then it calls htmlcleaner-2.5.jar a second time, because there is a confusion of the saxon parser on "&", "<" and htmlcleaner solved this problem (there is a better solution, for sure, but I didn't have the time to find and implement it)
Then it calls our xhtmltohdoc.xsl file which do the main transformation of html content to hdoc format. It transforms all the not empty "div" tags of "body" tag and all their children into "div". The content which is considered is only the content of this "div" tag.
Then we call again the XMLParser to solve the same kind of problem we have the first time
Once again htmlcleaner is called
Eventually removeEmptyTag.xsl is called, it permits to remove all empty tags