Wiki

Clone wiki

info.bliki.wiki / Mediawiki2HTML

How to convert Mediawiki text to HTML

Introduction

In this wiki text you will find some examples how to convert Wikipedia markup text into HTML text.

For the following examples make sure, that you have the current bliki-core-3.0.xx.jar (version >= 3.0.19) included in your classpath, which is available in the latest https://code.google.com/p/gwtwiki/downloads/list download.

The bliki core depends on the following JARs, which are downloadable here https://gwtwiki.googlecode.com/files/bliki.core.libs.001.zip:

  • commons-logging-1.0.4.jar
  • commons-lang-2.4.jar
  • commons-codec-1.2.jar
  • commons-httpclient-3.1.jar for the [MediaWikiAPISupport MediaWiki api.php support]
  • commons-compress-1.0.jar for the [MediaWikiDumpSupport MediaWiki XML dump support]
  • junit-4.5.jar (only for JUnit tests)

If you use https://www.eclipse.org/ make sure that your projects text file encoding is UTF-8

Basic Usage

The following Java code snippet:

#!java
...
  import info.bliki.wiki.model.WikiModel;
...


...
  String htmlText = WikiModel.toHtml("This is a simple [[Hello World]] wiki tag");
...

returns the following HTML text string:

#!html
<p>This is a simple <a href="/Hello_World" title="Hello World">Hello World</a> wiki tag</p>

If you would like to write to a java.io.Writer, you can use the toHtml() method like this:

#!java
  java.io.StringWriter writer = new java.io.StringWriter();

  try {
    WikiModel.toHtml("This is a simple [[Hello World]] wiki tag", writer);
    writer.flush();
    ...

    writer.close();
  } catch (IOException e) {
    e.printStackTrace();
  }

Customizing the conversion output

The general idea for the wiki to html IWikiModel.java interface is, that the common wiki syntax rendering is hidden in the internal WikipediaParser. Users of the API should derive a new wiki model class from WikiModel.java or AbstractWikiModel.java, so that the conversion could be customized by overriding predefined class methods.

A simple wiki text to HTML conversion looks like this:

#!java
public static void main(String[] args)
    {
        WikiModel wikiModel =
                            new WikiModel("https://www.mywiki.com/wiki/${image}",
                                          "https://www.mywiki.com/wiki/${title}");
        String htmlStr = wikiModel.render("This is a simple [[Hello World]] wiki tag");
        System.out.print(htmlStr);
    }

and creates the following HTML snippet:

#!html
<p>This is a simple <a href="https://www.mywiki.com/wiki/Hello_World" title="Hello World">Hello World</a> wiki tag</p>

As you can see the ${title} variable is replaced by the text of the wikilink according to the rules specified in the https://meta.wikimedia.org/wiki/Help:Link article.

The preferred way to use your own implemented myWikiModel is to create a new MyWikiModel(...) for every wiki text you would like to render.

If you would like to reuse your own myWikiModel you should call the setUp() method before and the tearDown() method after finishing the rendering.

#!java
    try {
      myWikiModel.setUp();

      String htmlStr = myWikiModel.render(<some text>);
      System.out.print(htmlStr);
    } finally {
      myWikiModel.tearDown();
    }

In the setUp() or tearDown() method of your own MyWikiModel you have to call the AbstractWikiModel#initialize() method or reinitialize the protected attributes in AbstractWikiModel, WikiModel to avoid memory leaks.

You can for example overwrite the WikiModel#parseInternalImageLink() method to change the default rendering behaviour of the [[Image:...]] tag.

#!java
public class WikiTestModel extends WikiModel {
  public WikiTestModel(String imageBaseURL, String linkBaseURL) {
    super(imageBaseURL, linkBaseURL);
  }
  public void parseInternalImageLink(String imageNamespace, String rawImageLink) {
    ...

    ...
  }
}

By default the rendering engine doesn't allow the style attribute to avoid cross-site scripting risks. You can define the style attribute as allowed in a static block of your WikiModel implementation.

#!java
  static {
    TagNode.addAllowedAttribute("style");
    ...
  }

Look in the WikiModel.java and AbstractWikiModel.java sources for an example.

Advanced example for converting Wikipedia texts to HTML

A more advanced example can be found in the HTMLCreatorExample.java file. If you run this example the first time, the Tom Hanks wiki source from Wikipedia is downloaded through the Wikipedia API. The downloaded wiki texts and templates are stored in an Apache Derby database, and associated images are downloaded in an already existing image directory C:\temp\WikiImages (see the APIWikiModel#getRawWikiContent() method). After the first run there's a new Derby database created in the directory C:\temp\WikiDB. Every subsequent run of this code snippet will only download the Tom Hanks wiki source. The associated templates and images are already cached in the Derby database and in the images directory.

#!java
    public static void testWikipediaENAPI(String title) {
        String[] listOfTitleStrings = {
            title
        };
        String titleURL = Encoder.encodeTitleLocalUrl(title);
        User user = new User("", "", "https://en.wikipedia.org/w/api.php");
        user.login();
        String mainDirectory = "c:/temp/";
        // the following subdirectory should not exist if you would like to create a
        // new database
        String databaseSubdirectory = "WikiDB";
        // the following directory must exist for image downloads
        String imageDirectory = "c:/temp/WikiImages";
        // the generated HTML will be stored in this file name:
        String generatedHTMLFilename = mainDirectory + titleURL + ".html";

        WikiDB db = null;

        try {
            db = new WikiDB(mainDirectory, databaseSubdirectory);
            APIWikiModel wikiModel = new APIWikiModel(user, db, "${image}", "${title}", imageDirectory);
            DocumentCreator creator = new DocumentCreator(wikiModel, user, listOfTitleStrings);
            creator.setHeader(HTMLConstants.HTML_HEADER1 + HTMLConstants.CSS_SCREEN_STYLE + HTMLConstants.HTML_HEADER2);
            creator.setFooter(HTMLConstants.HTML_FOOTER);
            wikiModel.setUp();
            creator.renderToFile(generatedHTMLFilename);

        } catch (IOException e) {
            e.printStackTrace();
        } catch (Exception e1) {
            e1.printStackTrace();
        } finally {
            if (db != null) {
                try {
                    db.tearDown();
                } catch (Exception e) {
                    e.printStackTrace();
                }
            }
        }
    }

    public static void testCreator001() {
        testWikipediaENAPI("Tom Hanks");
    }

    public static void main(String[] args) {
        testCreator001();
    }

Example how to keep IMG tag and ignore INPUTBOX in Wikipedia texts converted to HTML

Wiki text might contain tags if MediaWiki allows raw html for images. Also you might want to ignore some MW specific tags you can not convert properly e.g. <INPUTBOX>. Use your custom configuration and Configuration.addTokenTag() with HTMLTag/IgnoreTag handlers.

#!java
import info.bliki.wiki.model.WikiModel;
import info.bliki.wiki.model.Configuration;
import info.bliki.wiki.tags.*;
import java.io.*;

public class WikiParser {
    public static void main(String args[]) {
        Configuration conf = Configuration.DEFAULT_CONFIGURATION;
        // Allow custom user <IMG> tags
        conf.addTokenTag("img", new HTMLTag("img"));

        // Ignore custom <INPUTBOX> tags
        conf.addTokenTag("inputbox", new IgnoreTag("inputbox"));

        WikiModel wiki = new WikiModel(conf, "${image}", "${title}");

        try {
            BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
            StringBuilder sb = new StringBuilder();
            String newLine = System.getProperty("line.separator");
            String line;
            while ((line = reader.readLine()) != null) {
                sb.append(line);
                sb.append(newLine);
            }
            String htmlText = wiki.render(sb.toString());
            System.out.print(htmlText);

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

iAccept test

Updated