Capitalisation

Thanks to this project I had the opportunity to better my knowledge in the following technologies :

  • Sax

  • Ant

  • XSLT

First, I tried to convert using XSLT, but I quickly realized that it was rather complicated to manage all the problems. So I had to make a Saxon parser and a ant java class. By the combination of the two technologies it is working quite well.

Html content is very substantial and variable, using Sax seems the more efficient to do the convertion.

Html content and way of using tag change a lot from one site to another. For this reason it is very hard to make a good html2hdoc converter. In Html file there are many information within class attribute that we can't keep, or determinate. For instance very often it is possible to meet "div" tag with the attribute id="footer", we can say it is a footer and try to use that attribute for the hdoc convertion. But it is not a good way to think or maybe in a far more advanced html2hdoc library.

For what I learned by working on this project, the main aim is to try integrating eveything little by little without caring if in some html files we are getting empty files. For instance when the table was not at all considered, the wikipedia page which use a lot of table was very poor in content after the conversion. I was very disapointed by that, so I directly try to make the use of table possible instead of concentrating on "div" content for instance.

If I had to redo this work. I will take a tag, choose which kind of tag I want it to be in hdoc and try to get is content in text without caring of the tag in it. Then I will try to deal with its children for this specific tag. The problem with html, when trying to do something a litte bit more general is that there is always an exeption, so I think it is better to go more specific.