Technical documentation [Hdoc Converter Project]

Technical documentation

Test website

The test website itself is available at the following address: tuxa.sme.utc/~nf29a012/nf29/

Steps explanations

Inside the main folder, there are only a few folders (ant and uploads) which are directly related to my project, others are part of the website itself and have not been modified.

The “uploads” folder contains both uploaded source/test files and result/converted files. Those who have the .etherpad extension are renamed exports from the Etherpad platform, and those who have the hdoc extension are converted versions of these files.

The “ant” folder contains one sub-folder per converter, mine is “etherpad_to_hdoc”. Inside, we can see 2 files and 5 folders, which are the following:

etherpad_to_hdoc.ant: The ant build file launched by the server when we start converting.
build.properties: A file containing variables and paths used in etherpad_to_hdoc.ant.
input: A storage folder I'm using to keep test files (Etherpad documents which can be uploaded and converted).
src: The source folder of this project. It contains a perl script (normalization.pl) and an XSL transformation file (etherpadTohdoc.xsl).
tmp: This folder is used to store an intermediate file while converting, which is the output from the Perl script (launched first) and is then used as an input for the XSL transformation.
to_zip: The resulting folder from every task of the Ant script, except for the final zipping. It contains the structure of an Hdoc package, and the converted XML file for the document.

The steps of the Ant build file are the following:

fill_container, creates the “empty” structure of an Hdoc package in to_zip folder.
perl_exec, launches the first transformation on the .etherpad file (which is actually an html file renamed to .etherpad extension), outputs to tmp/intermediaire.xml.
xslt_exect, launches the second transformation on tmp/intermediaire.xml file, outputs the result xml file to the right sub-directory in to_zip folder.
zip, makes compressed folder for the hdoc package, based on the to_zip folder.

We might wonder why there is a Perl script in this conversion process, and a total of two transformations. Actually, there are two reasons for using Perl as an intermediate:

A Perl script allows us to clean easily the raw output given by Etherpad html export, with loops inside the input file and regex filtering.
Formatting markup such as specific divs or titles had to be handled and transformed into valid XML tags, or the file could not be used by XSLT.