The Java parser

The first task converts the HTML content to some HDoc content, without the high level structure (in fact, the generated content will be valid only starting at the “body” tag.

  • The compiled script is a jar named XMLParser1.jar and located in the java folder.

The Java source is provided in the XMLParser folder. It contains a Main class and two SAX handler classes : TitlesToHeadersHandler and TitlesToSectionsHandler.

The Main class

The Main class just launches the two handlers, after having added a root tag named “root” at the HTML content, so that it is XML-valid (otherwise it would be impossible to parse it using SAX). This is done with the addRoot() method.

The TitlesToSectionsHandler

The first handler is the TitlesToSectionsHandler. This handler is able to transform linear sets of HTML titles and paragraphs to hierarchized Hdoc sections. It works by opening and closing section tags if necessary when encountering title tags (h1, h2 and h3).

Also, the parser try to manage hierarchy order. For example, for this set of tags H1 H2 H1 H2 H3 we will have :

Section (H1)

------Section(H2)

Section(H1)

------Section (H2)

------------Section(H3)

The TitlesToHeadersHandler

The second handler is the TitlesToHeadersHandler. This handler transforms the HTML title tags into valid section headers. In Hdoc, a title is represented by an h1 tag wrapped into a header tag located at the beginning of a section. So to create valid titles, we just replace every title tag by a header tag and an h1 tag. Also, the Hdoc schema tells us that the content of a section (except sub sections) must be wrapped in a div tag, so the handler takes care of wrapping every content that is not a subsection into a div tag. Please note that this handler also has the ability to ignore HTML tags that are not supported by HDoc, via the doWeIgnore() method.

That's it for the Java parser. At the end of the task, the Main class has created a file named intermediate_addroot.xml in the java/out folder, the first handler has created a file named intermediate.xml at the same place, and the second and last handler has created a file following the second argument passed by the ANT task to the Java parser, ie content.xml in the java/out folder.