Parse::MediaWikiDump.3pm

Langue: en

Version: 2008-10-29 (ubuntu - 08/07/09)

Section: 3 (Bibliothèques de fonctions)

NAME

Parse::MediaWikiDump - Tools to process MediaWiki dump files

SYNOPSIS

   use Parse::MediaWikiDump;
 
   $source = 'dump_filename.ext';
   $source = \*FILEHANDLE;
 
   $pages = Parse::MediaWikiDump::Pages->new($source);
   $links = Parse::MediaWikiDump::Links->new($source);
 
   #get all the records from the dump files, one record at a time
   while(defined($page = $pages->next)) {
     print "title '", $page->title, "' id ", $page->id, "\n";
   }
 
   while(defined($link = $links->next)) {
     print "link from ", $link->from, " to ", $link->to, "\n";
   }
 
   #information about the page dump file
   $pages->sitename;
   $pages->base;
   $pages->generator;
   $pages->case;
   $pages->namespaces;
   $pages->namespaces_names;
   $pages->current_byte;
   $pages->size;
 
   #information about a page record
   $page->redirect;
   $page->categories;
   $page->title;
   $page->namespace;
   $page->id;
   $page->revision_id;
   $page->timestamp;
   $page->username;
   $page->userid;
   $page->minor;
   $page->text;
 
   #information about a link
   $link->from;
   $link->to;
   $link->namespace;
 
 

DESCRIPTION

This module provides the tools needed to process the contents of various MediaWiki dump files.

USAGE

To use this module you must create an instance of a parser for the type of dump file you are trying to parse. The current parsers are:
Parse::MediaWikiDump::Pages
Parse the contents of the page archive.
Parse::MediaWikiDump::Links
Parse the contents of the links dump file.

General

Both parsers require an argument to new that is a location of source data to parse; this argument can be either a filename or a reference to an already open filehandle. This entire software suite will die() upon errors in the file, inconsistencies on the stack, etc. If this concerns you then you can wrap the portion of your code that uses these calls with eval().

Parse::MediaWikiDump::Pages

It is possible to create a Parse::MediaWikiDump::Pages object two ways:
$pages = Parse::MediaWikiDump::Pages->new($filename);
$pages = Parse::MediaWikiDump::Pages->new(\*FH);

After creation the folowing methods are avalable:

$pages->next
Returns the next available record from the dump file if it is available, otherwise returns undef. Records returned are instances of Parse::MediaWikiDump::page; see below for information on those objects.
$pages->sitename
Returns the plain-text name of the instance the dump is from.
$pages->base
Returns the base url to the website of the instance.
$pages->generator
Returns the version of the software that generated the file.
$pages->case
Returns the case-sensitivity configuration of the instance.
$pages->namespaces
Returns an array reference to the list of namespaces in the instance. Each namespace is stored as an array reference which has two items; the first is the namespace number and the second is the namespace name. In the case of namespace 0 the text stored for the name is ''.
$pages->namespaces_names
Returns an array reference to a list of namspace names only; this is a single dimensional array with plain text string values.
$pages->namespaces
Returns an array reference to the list of namespaces names in the instance, without namespaces numbers. Main namespace name is ''.
$pages->current_byte
Returns the number of bytes parsed so far.
$pages->size
Returns the size of the dump file in bytes.

Parse::MediaWikiDump::page

The Parse::MediaWikiDump::page object represents a distinct MediaWiki page, article, module, what have you. These objects are returned by the next() method of a Parse::MediaWikiDump::Pages instance. The scalar returned is a reference to a hash that contains all the data of the page in a straightforward manor. While it is possible to access this hash directly, and it involves less overhead than using the methods below, it is beyond the scope of the interface and is undocumented.

Some of the methods below require additional processing, such as namespaces, redirect, and categories, to name a few. In these cases the returned result is cached and stored inside the object so the processing does not have to be redone. This is transparent to you; just know that you don't have to worry about optimizing calls to these functions to limit processing overhead.

The following methods are available:

$page->id
$page->title
$page->namespace
Returns an empty string (such as '') for the main namespace or a string containing the name of the namespace.
$page->text
A reference to a scalar containing the plaintext of the page.
$page->redirect
The plain text name of the article redirected to or undef if the page is not a redirect.
$page->categories
Returns a reference to an array that contains a list of categories or undef if there are no categories. This method does not understand templates and may not return all the categories the article actually belongs in.
$page->revision_id
$page->timestamp
$page->username
$page->userid
$page->minor
This module also takes either a filename or a reference to an already open filehandle. For example:
   $links = Parse::MediaWikiDump::Links->new($filename);
   $links = Parse::MediaWikiDump::Links->new(\*FH);
 
 

It is then possible to extract the links a single link at a time using the next method, which returns an instance of Parse::MediaWikiDump::link or undef when there is no more data. For instance:

   while(defined($link = $links->next)) {
     print 'from ', $link->from, ' to ', $link->to, "\n";
   }
 
 

Parse::MediaWikiDump::link

Instances of this class are returned by the link method of a Parse::MediaWikiDump::Links instance. The following methods are available:

$link->from
The numerical id the link was in.
$link->to
The plain text name the link is to, minus the namespace.
$link->namespace
The numerical id of the namespace the link points to.

EXAMPLES

Extract the article text for a given title

   #!/usr/bin/perl
   
   use strict;
   use warnings;
   use Parse::MediaWikiDump;
   
   my $file = shift(@ARGV) or die "must specify a MediaWiki dump of the current pages";
   my $title = shift(@ARGV) or die "must specify an article title";
   my $dump = Parse::MediaWikiDump::Pages->new($file);
   
   binmode(STDOUT, ':utf8');
   binmode(STDERR, ':utf8');
   
   #this is the only currently known value but there could be more in the future
   if ($dump->case ne 'first-letter') {
     die "unable to handle any case setting besides 'first-letter'";
   }
   
   $title = case_fixer($title);
   
   while(my $page = $dump->next) {
     if ($page->title eq $title) {
       print STDERR "Located text for $title\n";
       my $text = $page->text;
       print $$text;
       exit 0;
     }
   }
   
   print STDERR "Unable to find article text for $title\n";
   exit 1;
   
   #removes any case sensativity from the very first letter of the title
   #but not from the optional namespace name
   sub case_fixer {
     my $title = shift;
   
     #check for namespace
     if ($title =~ /^(.+?):(.+)/) {
       $title = $1 . ':' . ucfirst($2);
     } else {
       $title = ucfirst($title);
     }
   
     return $title;
   }
 
 

Scan the dump file for double redirects

   #!/usr/bin/perl
   
   #progress information goes to STDERR, a list of double redirects found
   #goes to STDOUT
   
   binmode(STDOUT, ":utf8");
   binmode(STDERR, ":utf8");
   
   use strict;
   use warnings;
   use Parse::MediaWikiDump;
   
   my $file = shift(@ARGV);
   my $pages;
   my $page;
   my %redirs;
   my $artcount = 0;
   my $file_size;
   my $start = time;
   
   if (defined($file)) {
         $file_size = (stat($file))[7];
         $pages = Parse::MediaWikiDump::Pages->new($file);
   } else {
         print STDERR "No file specified, using standard input\n";
         $pages = Parse::MediaWikiDump::Pages->new(\*STDIN);
   }
   
   #the case of the first letter of titles is ignored - force this option
   #because the other values of the case setting are unknown
   die 'this program only supports the first-letter case setting' unless
         $pages->case eq 'first-letter';
   
   print STDERR "Analyzing articles:\n";
   
   while(defined($page = $pages->next)) {
     update_ui() if ++$artcount % 500 == 0;
   
     #main namespace only
     next unless $page->namespace eq '';
     next unless defined($page->redirect);
   
     my $title = case_fixer($page->title);
     #create a list of redirects indexed by their original name
     $redirs{$title} = case_fixer($page->redirect);
   }
   
   my $redir_count = scalar(keys(%redirs));
   print STDERR "done; searching $redir_count redirects:\n";
   
   my $count = 0;
   
   #if a redirect location is also a key to the index we have a double redirect
   foreach my $key (keys(%redirs)) {
     my $redirect = $redirs{$key};
   
     if (defined($redirs{$redirect})) {
       print "$key\n";
       $count++;
     }
   }
   
   print STDERR "discovered $count double redirects\n";
   
   #removes any case sensativity from the very first letter of the title
   #but not from the optional namespace name
   sub case_fixer {
     my $title = shift;
   
     #check for namespace
     if ($title =~ /^(.+?):(.+)/) {
       $title = $1 . ':' . ucfirst($2);
     } else {
       $title = ucfirst($title);
     }
   
     return $title;
   }
   
   sub pretty_bytes {
     my $bytes = shift;
     my $pretty = int($bytes) . ' bytes';
   
     if (($bytes = $bytes / 1024) > 1) {
       $pretty = int($bytes) . ' kilobytes';
     }
   
     if (($bytes = $bytes / 1024) > 1) {
       $pretty = sprintf("%0.2f", $bytes) . ' megabytes';
     }
   
     if (($bytes = $bytes / 1024) > 1) {
       $pretty = sprintf("%0.4f", $bytes) . ' gigabytes';
     }
   
     return $pretty;
   }
   
   sub pretty_number {
     my $number = reverse(shift);
     $number =~ s/(...)/$1,/g;
     $number = reverse($number);
     $number =~ s/^,//;
   
     return $number;
   }
   
   sub update_ui {
     my $seconds = time - $start;
     my $bytes = $pages->current_byte;
   
     print STDERR "  ", pretty_number($artcount),  " articles; "; 
     print STDERR pretty_bytes($bytes), " processed; ";
   
     if (defined($file_size)) {
       my $percent = int($bytes / $file_size * 100);
   
       print STDERR "$percent% completed\n"; 
     } else {
       my $bytes_per_second = int($bytes / $seconds);
       print STDERR pretty_bytes($bytes_per_second), " per second\n";
     }
   }
 
 

TODO

Support comprehensive dump files
Currently the full page dump files (such as 20050909_pages_full.xml.gz) are not supported.
Optimization
It would be nice to increase the processing speed of the XML files. Current ideas:
Move to arrays instead of hashes for base objects
Currently the base types for the majority of the classes are hashes. The majority of these could be changed to arrays and numerical constants instead of using hashes.
Stackless parsing
placing each XML token on the stack is probably quite time consuming. It may be beter to move to a stackless system where the XML parser is given a new set of callbacks to use when it encounters each specific token.

AUTHOR

This module was created, documented, and is maintained by Tyler Riddle <triddle@gmail.com>.

Fix for bug 36255 ``Parse::MediaWikiDump::page::namespace may return a string which is not really a namespace'' provided by Amir E. Aharoni.

BUGS

Please report any bugs or feature requests to "bug-parse-mediawikidump@rt.cpan.org", or through the web interface at <http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Parse-MediaWikiDump>. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

Known Bugs

#38206 "Parse::MediaWikiDump XML dump file not closed on DESTROY"
There is a memory leak in the XML dump file parser that causes an instance of the parser to never get garbage collected even if it goes completley out of scope. This bug shows it's head when you open and close many different dump files or if you are trying to free all memory used by this module. Resolution time for this bug is currently unestimated.
Copyright 2005 Tyler Riddle, all rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.