Jump to content

User:Sj/wp-api

From Wikipedia, the free encyclopedia
Wikisnap API v0.1

How to create a Wikipedia snapshot config file:

Basic idea

[edit]

To generate a list of articles to go into the snapshot, we read in the wiki markup on a 'config' wiki page and ignore everything that is not in an [un]ordered list. This allows liberal commenting and overlay on top of whatever other formatting is on the wiki.


Interpreted markup

[edit]
*
#
''
'''


Supplemental pages

[edit]

Image blacklist

[edit]
  • two_MB_of_grass_growing.gif

A description/comment, ignored

That last item would kill "chart.png"


This format would work for any media type.


i18n page

[edit]
* '''en''' word or phrase
* '''es''' palabra o frase

Each list (as seperated by paragraph/double newlines) describes localizations for a word/phrase. Phrases will be matched before substrings. There is no key language to each list, although using a consistent language as the first entry makes sense as a means of alphabetically organizing the lists on a page.

Having an explicit page like this avoids interlanguage link ambiguities, or inconsistencies between the one-way links between two languages. Such a page could easily be seeded by a bot from a list on one language, and tweaked. Note: this page will have ~one line for every article in the all-language snapshot.

examples

[edit]

Ex. 1

  • en Wikipedia
  • simple Wikipedia
  • pt Wikipédia

is preferred to

  • en Wikipedia
    • simple Wikipedia
      • pt Wikipédia

though both are equally valid

Ex. 2

  • es Wikipedia, la enciclopedia libre
  • en Wikipedia, the free encylopedia
Italics as a flag for case-sensitive/exact matches
  • ar
  • th


Ex. 3

  1. en disambiguation
    1. pt desambigua%C3%A7%C3%A3o

(equiv to)

    1. pt desambiguação

While mediawiki creatively interprets the markup (starting the list over at 1.) the above looks fine from our script's point of view.

Ex. 4

  1. en Physics
  2. th none

(to blacklist the thai page)

  1. es Fisica

Other needed pages

[edit]

We need pages for:

  1. Mediawiki verbage (eg 'navigation', 'search')
  2. Common in-article verbage (eg 'See also', 'External links')
  3. Our verbage (eg 'OLPC Digital Library')
  4. Our index page article catagory headers
  5. Header/footer text and formatting, other envelope text, and page design/css per language