Technical documentation

Processing diagram and explanation

  • wp.xml : input file ;

  • Transform shortcodes + dates : Since we cannot parse shortcodes with xslt we have used java program in order to parse them and to convert them to the approriate hdoc tags. In addition to that, we have used our java program to convert the dates since in hdoc we need to have ISO Dates. These tasks are done by java_converter.jar;

  • Remove CDATA from content tag : By default in wp.xml the publication content is protected by the CDATA so it will not be parsed. Since we want to parse this content in order to extract the tags resulted from the "Transform shortcodes + dates" for the result.xml file or to keep the whole publication content html for the result_for_html-to-hdoc.html, we use the wp_to_hdoc_clean.xsl file in order to remove the CDATA ;

  • Transform all the meta-data : We transform the global, posts and pages meta-data (see Supported section for more details). In addition to this transformation, wp_to_hdoc.xsl keeps the tags resulted from "Transform shortcodes + dates" and exclude the content html. The file created at the end of this process is result.xml which is a file that respects the hdoc schema (hdoc1-xhtml.rng). The same transformation is made by wp_to_hdoc_htmltohdocVersion.xsl at the difference that this time we keep the whole publication content html. The file created at the end of this second process is result_for_html-to-hdoc.html which does not respect necessarily the hdoc schema. This file is intended to be converted by the html_to_hdoc converter

Sources description

The wp_to_hdoc.zip file contains the following files and folders:

  • html_to_hdoc (Folder that contains the html_to_hdoc converter) ;

  • jars (Folder that contains all the jar files) :

    • java_converter.jar (File that converts the dates and content shortcodes).

  • javaSources (Contains the java sources files) :

    • convertMedia.java (Java source code of the java_converter.jar).

  • xsl (Folder that contains all the xsl files) :

    • wp_to_hdoc.xsl (the xsl file that creates an xml file which respects the hdoc schema by excluding the publication content html) ;

    • wp_to_hdoc_clean.xsl (the xsl file that removes the CDATA from the <content:encoded> tag in order the enable html parsing) ;

    • wp_to_hdoc_htmltohdocVersion.xsl (the xsl file that creates an html file which does not respect necessarilly the hdoc schema by including the publication content html also).

  • Launch.bat (batch file that launch the converter) ;

  • wp.xml (sample used by default by the converter) ;

  • wp_to_hdoc.ant (ant file that manages the conversion process by calling the approriate files at the right time) ;

  • wp_to_hdoc.properties (properties file used by wp_to_hdoc.ant that countains essentially the paths) ;

You can find detailed explanations of each source file in the comments.