Clone wiki

info.bliki.wiki / MediaWikiDumpSupport

Helper classes to work with MediaWiki XML dump files.

How to get Wikipedia dumps

You can download a dump of all Wikipedia articles from the Database dump progress page.

I tested these helper classes with a dump with this headline:

Articles, templates, image descriptions, and primary meta-pages.

This contains current versions of article content, and is the archive most mirror sites will probably want.

Example how to print the wiki articles title and raw text

There's a demo application DumpExample.java, which iterates through a compressed or uncompressed Wikipedia XML dump file (depending on the given file extension .gz, .bz2 or .xml) and prints the title and raw wiki text of the articles included in the XML Dump.

...
        static class DemoArticleFilter implements IArticleFilter {

                public boolean process(WikiArticle page) {
                        System.out.println("----------------------------------------");
                        System.out.println(page.getTitle());
                        System.out.println("----------------------------------------");
                        System.out.println(page.getText());
                        return true;
                }

        }
...
        public static void main(String[] args) {
                if (args.length != 1) {
                        System.err.println("Usage: Parser <XML-FILE>");
                        System.exit(-1);
                }
                // Example:
                // String bz2Filename = "c:\\temp\\<the dump file name>.xml.bz2";
                String bz2Filename = args[0];
                try {
                        IArticleFilter handler = new DemoArticleFilter();
                        WikiXMLParser wxp = new WikiXMLParser(bz2Filename, handler);
                        wxp.parse();
                } catch (Exception e) {
                        e.printStackTrace();
                }
        }
...

Example how to print all wiki titles with no namespace prefix

    static class DemoMainArticleFilter implements IArticleFilter {

        public void process(WikiArticle page, Siteinfo siteinfo) throws SAXException {
            if (page.isMain()) {
                System.out.println(page.getTitle());
            }
        }

    }

Example how to print all wiki titles with template namespace

    static class DemoTemplateArticleFilter implements IArticleFilter {

        public void process(WikiArticle page, Siteinfo siteinfo) throws SAXException {
            if (page.isTemplate()) {
                System.out.println(page.getTitle());
            }
        }

    }

Example how to print all wiki titles with category namespace

    static class DemoCategoryArticleFilter implements IArticleFilter {

        public void process(WikiArticle page, Siteinfo siteinfo) throws SAXException {
            if (page.isCategory()) {
                System.out.println(page.getTitle());
            }
        }

    }

Example how to convert all wiki pages from a dump into static HTML files

The Dump2HTMLCreatorExample.java example converts all pages from a Mediawiki dump into static HTML files by iterating through the dump two times:

  • in the first pass all templates are stored in a Derby database for faster retrieval
  • in the second pass all static HTML files are generated for each title

Updated