Wikipedia:Reference desk/Archives/Computing/Early/ParseMediaWikiDump

This page is currently inactive and is retained for historical reference.
Either the page is no longer relevant or consensus on its purpose has become unclear. To revive discussion, seek broader input via a forum such as the village pump.

Parse::MediaWikiDump is a Perl module created by Triddle that makes accessing the information in a MediaWiki dump file easy. Its successor MediaWiki::DumpFile is written by the same author and also available on the CPAN.

Download

The latest versions of Parse::MediaWikiDump and MediaWiki::DumpFile are available at https://metacpan.org/pod/Parse::MediaWikiDump and https://metacpan.org/pod/MediaWiki::DumpFile

Examples

Find uncategorized articles in the main name space

#!/usr/bin/perl -w

use strict;
use Parse::MediaWikiDump;

my $file = shift(@ARGV) or die "must specify a Mediawiki dump file";
my $pages = Parse::MediaWikiDump::Pages->new($file);
my $page;

while(defined($page = $pages->next)) {
    #main namespace only
    next unless $page->namespace eq '';

    print $page->title, "\n" unless defined($page->categories);
}

Find double redirects in the main name space

This program does not follow the proper case sensitivity rules for matching article titles; see the documentation that comes with the module for a much more complete version of this program.

#!/usr/bin/perl -w

use strict;
use Parse::MediaWikiDump;

my $file = shift or die "must specify a Mediawiki dump file";
my $pages = Parse::MediaWikiDump::Pages->new($file);
my %redirs;

while(defined(my $page = $pages->page)) {
    next unless $page->namespace eq '';
    next unless defined($page->redirect);

    my $title = $page->title;

    $redirs{$title} = $page->redirect;
}

while (my ($key, $redirect) = each(%redirs)) {
    if (defined($redirs{$redirect})) {
        print "$key\n";
    }
}

Import only a certain category of pages

#!/usr/bin/perl

use Parse::MediaWikiDump;
use DBI;
use DBD::mysql;

$server         = "localhost";
$name           = "dbname";
$user           = "admin";
$password       = "pass";

$dsn = "DBI:mysql:database=$name;host=$server;";
$dbh = DBI->connect($dsn, $user, $password);

$source = 'pages_articles.xml';

$pages = Parse::MediaWikiDump::Pages->new($source);
print "Done parsing.\n";

while(defined($page = $pages->page)) {
    $c = $page->categories;
    if (grep {/Mathematics/} @$c) {  # all categories with the string "Mathematics" anywhere in their text. 
                                     # For exact match, use {$_ eq "Mathematics"}

        $id = $page->id;
        $title = $page->title;
        $text = $page->text;

        #$dbh->do("insert ..."); #details of SQL depend on the database setup

        print "title '$title' id $id was inserted.\n";
    }
}

Extract articles linked to important Wikis but not to a specific one

The script checks if an article contains interwikis to :de, :es, :it, :ja and :nl BUT not :fr. It is useful to link "popular" articles to a specific wiki. It may also give useful hints about articles that should be translated in priority.

#!/usr/bin/perl -w

# Code : Dake
use strict;
use Parse::MediaWikiDump;
use utf8;
    
my $file = shift(@ARGV) or die "must specify a Mediawiki dump file";
my $pages = Parse::MediaWikiDump::Pages->new($file);
my $page;
    
binmode STDOUT, ":utf8";

while(defined($page = $pages->next)) {
    #main namespace only
    next unless $page->namespace eq '';

    my $text = $page->text;
    if (($$text =~ /\[\[de:/i) && ($$text =~ /\[\[es:/i) &&
        ($$text =~ /\[\[nl:/i) && ($$text =~ /\[\[ja:/i) &&
        ($$text =~ /\[\[it:/i) && !($$text =~ /\[\[fr:/i))
     {
         print $page->title, "\n";
     }
}

Related software

Wikipedia preprocessor (wikiprep.pl) is a Perl script that preprocesses raw XML dumps and builds link tables, category hierarchies, collects anchor text for each article etc.
Wikipedia:WikiProject Interlanguage Links/Ideas from the Hebrew Wikipedia - a project in the Hebrew Wikipedia to add relevant interwiki (interlanguage) links to as many articles as possible. It uses Parse::MediaWikiDump for searching for pages without links. It is now being exported to other Wikipedias.

Notes