turn files with wikipedia syntax to html (simple python script using mediawiki api)

Fr, 10/29/2010 - 20:19 — Draketo

I needed to convert a huge batch of mediawiki-files to html (had a 2010-03 copy of the now dead limewire wiki lying around). With a tip from RoanKattouw in #mediawiki@freenode.net I created a simple python script to convert arbitrary files from mediawiki syntax to html.

Usage:

Download the script and install the dependencies (yaml and python 3).
./parse_wikipedia_files_to_html.py <files>

This script is neither written for speed or anything (do you know how slow a webrequest is, compared to even horribly inefficient code? …): The only optimization is for programming convenience — the advantage of that is that it’s just 47 lines of code :)

It also isn’t perfect: it breaks at some pages (and informs you about that).

It requires yaml and Python 3.x.

#!/usr/bin/env python3

"""Simply turn all input files to html. 
No errorchecking, so keep backups. 
It uses the mediawiki webapi, 
so you need to be online.

Copyright: 2010 © Arne Babenhauserheide
License: You can use this under the GPLv3 or later, 
         if you add the appropriate license files
         → http://gnu.org/licenses/gpl.html
"""

from urllib.request import urlopen
from urllib.parse import quote
from urllib.error import HTTPError, URLError
from time import sleep
from random import random
from yaml import load
from sys import argv

mediawiki_files = argv[1:]

def wikitext_to_html(text):
    """parse text in mediawiki markup to html."""
    url = "http://en.wikipedia.org/w/api.php?action=parse&format=yaml&text=" + quote(text, safe="") + " "
    f = urlopen(url)
    y = f.read()
    f.close()
    text = load(y)["parse"]["text"]["*"]
    return text

for mf in mediawiki_files:
    with open(mf) as f:
        text = f.read()
    HTML_HEADER = "<html><head><title>" + mf + "</title></head><body>"
    HTML_FOOTER = "</body></html>"
    try: 
        text = wikitext_to_html(text)
        with open(mf, "w") as f:
            f.write(HTML_HEADER)
            f.write(text)
            f.write(HTML_FOOTER)
    except HTTPError:
        print("Error converting file", mf)
    except URLError:
        print("Server doesn’t like us :(", mf)
        sleep(10*random())
    # add a random wait, so the api server doesn’t kick us
    sleep(3*random())

Anhang	Größe
parse_wikipedia_files_to_html.py.txt	1.47 KB

Druckversion
Login to post comments

Use Node:

⚙ Babcom is trying to load the comments ⚙

This textbox will disappear when the comments have been loaded.

If the box below shows an error-page, you need to install Freenet with the Sone-Plugin or set the node-path to your freenet node and click the Reload Comments button (or return).

If you see something like Invalid key: java.net.MalformedURLException: There is no @ in that URI! (Sone/search.html), you need to setup Sone and the Web of Trust

If you had Javascript enabled, you would see comments for this page instead of the Sone page of the sites author.

Note: To make a comment which isn’t a reply visible to others here, include a link to this site somewhere in the text of your comment. It will then show up here. To ensure that I get notified of your comment, also include my Sone-ID.

Link to this site and my Sone ID: sone://6~ZDYdvAgMoUfG6M5Kwi7SQqyS-gTcyFeaNN1Pf3FvY

This spam-resistant comment-field is made with babcom.

turn files with wikipedia syntax to html (simple python script using mediawiki api)

Beliebte Inhalte

Heute:

Zuletzt angezeigt:

Draketo neu: Beiträge

Draketo neu: Kommentare

Ein Würfel System

Sep. 12th decides the fate of the internet in the EU!