This is a small set of tools to help folks who want to import
XML dumps into a local instance of MediaWiki.

Some scripts are provided as well in the 'scripts' subdirectory;
please see the separate README.scripts there for more information.

Tools:

mwxml2sql

This is a bare bones converter of MediaWiki XML dumps to sql.
It does one thing and hopefully at least doesn't suck at it.
If you want the swiss army knife of converters, better shop
elsewhere. This program is not intended to process upload xml
elements, nor logging xml elements.  It works with the
xml dumps from the Wikimedia projects and that is all
that it's intended to do.

To install and run this program, you will need to have the libz and
bz2 development libraries installed, as well as the gcc
toolchain or an equivalent C compiler and its supporting
libraries. You'll also need the 'make' utility.

sql2txt

This converts sql produced from mysqldump or from an XML dump to
sql file converter, into a format that can be read by MySQL via
LOAD DATA INFILE, a much faster process than having mysql do a pile
of INSERTS. If you feed it data formatted differently from that
output, or with rows broken across lines etc, it may produce garbage.

To install and run this program, you will need to have the libz and
bz2 development libraries installed, as well as the gcc
toolchain or an equivalent C compiler and its supporting
libraries. You'll also need the 'make' utility.

INSTALLATION

Unpack the distribution someplace, cd into it, 'make'
and if you like 'make install'.  The binaries will be
installed into /usr/local/bin.  If you want them someplace
else, edit the Makefile and change the line
'PREFIX=/usr/local'  as desired.

sqlfilter

This filters an sql dump against sets of values per specified columns
in each sql tuple (row) and writes out only the tuples that match.
You can use this to write out all rows in the pages table that are
in a given namespace, or all rows in the category links table
that have page ids in a given list, etc.

To install and run this program, you will need to have the libz and
bz2 development libraries installed, as well as the gcc
toolchain or an equivalent C compiler and its supporting
libraries. You'll also need the 'make' utility.

RUNNING

A. mwxml2sql

'mwxml2sql --help' will give you a comprehensive usage
message.

TL;DR version:  you need a page xml.bz2/gz file and a
stubs xml bz2/gz file for input, sql files for the page,
revision and text tables will be produced on output.

Example use:

mwxml2sql -v -s testing/enwiki-stubs-1114-1300.gz \
         -t testing/enwiki-pages-1114-1300.gz  \
         -f testing/sql_enwiki-1114-1300.gz -p t_ -m 1.20

In this example, we have:
-v: print progress messages
-s: stubs file for input
-t: text (page content) file for input
-f: filename used to build the page/revision/text output filenames
    in this case the files sql_enwiki-1114-1300-page.sql.gz,
    sql_enwiki-1114-1300-revs.sql.gz and sql_enwiki-1114-1300-text.sql.gz
    are produced
-p: prefix of tables in your database (correpsonds to $wgDBprefix
    in LocalSetting.php, leave out this option if you have no prefix)
-m: MediaWiki version for which to produce output

This program took about 1.5 hours to run on the enwikipedia
stubs and page files from December 2012, producing gzipped output
files for MediaWiki version 1.20.  Total pages (and therefore
revisions) processed: 12680605. Hardware: Sony laptop with
a linux desktop running (Xfce, gnome, evolution and pidgin running
but not actively doing other tasks).  Dual core P8700, 2.53GHz,
8GB memory, HTS723232L9SA60 disk with 72000 rpm.  Size of the
gzipped output files: 460M for page table, 905M for revision table,
9.6GB for text table.

I ran mwdumper on this same file on the same hardware and it took
just over two hours.  Somewhere there is a small memory leak,
not enough to concern someone working with the current page content
only.

A run of mwimport (perl script) on the same hardware took slightly
over two hours.

B. sql2txt

'sql2txt --help' will give you a comprehensive usage
message.

TL;DR version:  you need a possibly compressed (gz or bz2) file
containing sql insert statements; provide this to sql2txt and
save the tab delimited and escaped results to a (possibly
compressed) file.

Example use:

sql2txt  -v -s testing/enwiki-iwlinks.sql.bz2 \
         -t testing/enwiki-iwlinks.tabs.bz2

In this example, we have:
-v: print progress messages
-s: sql file for input
-t: tab delimited text file for output

C. sqlfilter

'sqlfiler --help' will give you a comprehensive usage
message.

TL;DR version:  you need a possibly compressed (gz or bz2) file
containing sql insert statements, along with a file of columns and
values to match; provide this to sqlfilter and save the results to
a (possibly compressed) file.

Example use:

sqlfilter -s enwqdump/enwikiquote-20130405-pagelinks.sql.gz \
	  -o testout.gz -f enwq/wikiquote-en-2013-04-09-192958-pageids.gz

In this example, we have:
-s: sql file for input
-o: filtered sql output
-v: print progress messages

Warning: this program expects one INSERT statement per line,
possibly with multiple tuples in it. You cannot split the statement
across multiple lines and get anything reasonable out.  Anything
not an INSERT statement will be passed through and written unaltered.

USING THE OUTPUT

Making a local mirror with all current pages

The sql files produced keep the page, revision and text ids of the
input XML files.  This means that if your database uses only XML
files from a single source and project (e.g. only the WMF en wikipedia
dumps), you will not have an id conflict but if you edit locally
between imports or you import on top of an existing database you
will encounter problems.

Convert it and all other needed sql tables to the format needed
for LOAD DATA INFILE, using the command

zcat blah.sql.gz | sql2txt > blah.gz
(see 'RUNNING' for more on this command).  You may decide you would
rather use bz2 which will be slower but will save a lot of disk space.

You can skip converting the image, imagelinks, oldimage and
user_groups tables.  Loading in the other tables saves you from
needing to rebuild things like the links tables afterwards
(horribly slow).

Make sure the database is set up with the right character set
(we like 'binary'), set your client character set (utf-8) and
change any settings you want to for speed (turning off for
example foreign_key_checks and/or unique_checks).

Now load your files in one at a time. (Tested on MySQL 5.5,
probably ok on 5.1, probably broken on 4.x.) If yu are working
with large (> 500MB) files you may want to feed them to mysql
in chunks.  You can do this by using the fifo_to_mysql.pl
perl script; see the README.scripts in the scripts directory for
more on this command.

If you don't have a ton of space you may want to convert your
sql files one at a time and remove the output file after
MySQL has read it in.

Note that the other sql tables are not guaranteed to be 100% consistent
with the XML files, since they are produced at different times and we
don't lock the tables before dumping.  Null edits to a page ought to
fix any rendering issues it may have based on out of sync tables.

Making a local mirror with a subset of current pages

If you want to import only a subset of the current pages from a wiki
instead of the entire thing, see the scripts in the 'scripts' subdirectory
for how you might do that.  Another alternative is to use importDump.php
depending on how much content you want to import.

If you want to import files while keeping pre-existing content, these
are NOT the tools for you; they will cause mysql to whine
about index conflicts due to page, revision or text ids already in
your database that are included in your sql files to be loaded. Consider
using maintenance/importDump.php (in your MediaWiki installation
directory) instead.

WARNINGS

This has been tested only for output MediaWiki 1.20 and input xsd 0.7
and not very comprehensively.  Please help discover bugs!

Report bugs to <https://bugzilla.wikimedia.org/>.

This does NOT support dumps from wikis with LiquidThread enabled.
That's a feature set for a future version (or not).

Other untested features: importing output with gzipped text revisions,
using history xml dumps.

Changes to the xml schema, in particular new tags or a switch in
the order of items, will break the script.

LICENSE

The files sha1.c and sha1.h are released by Christophe Levine under
GPLv2 (see the file COPYING in this directory).  His web site is
no longer available and the code has since been folded into many
other projects but you can find it via archive.org:
http://web.archive.org/web/20031123112259/http://www.cr0.net:8040/code/crypto/sha1/

The file uthash.h is Copyright (c) 2003-2012, Troy D. Hanson, with
permission to use, distribute, and modify subject to attribution
and replication of the license; see the top of the file for license
details. The web site for his code is http://uthash.sourceforge.net

The remaining files are copyright Ariel Glenn 2013 and also released
under the GPLv2 (see again the file COPYING in this directory).
