Hacking on Wikipedia's Link Graph

This was taken from my WikiGraph GitHub project's README file.

I've been meaning to hack on Wikipedia's link data so this is kind of the result of that. It's an ongoing process since there's so much data and many different things to do with it.

Note: I'm running this project on a Thinkpad T410 Intel i5 with 8GB of RAM. If you have less RAM, it's possible that the final graph data won't fit in memory. I'm hoping to spend some time researching & developing a kind of in-memory and on-disk graph structure. A graph database would probably be valuable for this dataset.

Also note: If you don't want to parse and generate gigabytes of data, I uploaded the core graph data that this project uses. You can download the files here.

You'll need both graph.tar.bz2 and keys.tar.bz2 and they need to be extracted to the wikigraph/data/ directory. If you just use those two files, the only script you need to use is graph.py.

Starting From Scratch

This project uses the April 2011 pages-articles.xml.bz2 database dump from http://en.wikipedia.org/wiki/Wikipedia:Database_download

After extracting the XML file, the first thing to do is replace the <mediawiki></mediawiki> tags with <xml></xml> tags so the SAX parser can do its job properly. This is how I did it:

$ perl -pi -e 's/<mediawiki[^>]+>/<xml>/' enwiki.xml
$ perl -pi -e 's/<\/mediawiki>/<\/xml>/' enwiki.xml

enwiki.xml is just a renamed pages-articles.xml file.

The next step is to run the SAX parser on the XML file. Since it's massive, it'll take quite awhile. It takes my T410 Intel i5 laptop about 40 minutes or so. The RAM footprint is negligible since it's a SAX parser. I was running at 100% CPU with 0.1% RAM used.

You'll have to modify the filename at the bottom of saxparser.py since it's hardcoded for my path at the moment.

Once it's done parsing, there will be a file called "pages_links". That's the raw graph data formatted like so:

"page0","link00","link01","link02",...,"link0N"\n
"page1","link10","link11","link12",...,"link1N"\n
...

Each page has a list of links following it from that page. This is a page's adjacency list.

Parse Page Titles

The next script to run is keys.py. It will parse pages_links for page titles by using csv.reader(f, delimiter=',', quotechar='"') and reading line by line. The first element in a given row/line is the page title. Each page title is cleaned up a bit and some pages (Files:/Wikipedia:/Category:) are ignored completely. The page title is then used as a key into a defaultdict(int) object so I only have unique keys. I then sort all the keys and write them to disk in the file "data/keys".

Parse Adjacent Pages/Links

You then need to run values.py to parse each page's links and generate a usable link graph. This script starts by loading the data/keys file so I can do vertex lookups by page title. Each vertex has an adjacency set because there will be some duplicates and a set is an easy way to manage unique indexes.

The script then parses the pages_links file for each page's links and adds them to the appropriate vertex's adjacency set. The adjacency sets use integer indexes looked up from the keytable by page title. This helps keep the memory footprint to a more manageable level. Only links that are stored in the keytable are allowed to be added to an adjacency set for a given vertex.

Each vertex's adjacency set is then written to disk as "data/graph". A single line in the graph file is formatted like so:

v1 v2 v3 v4\n

Each adjacent vertex is an integer and separated by a space. A newline char indicates the end of that adjacency list. Each line maps to a vertex in the graph: line 0 in the graph file maps to vertex 0 in the graph. That means that line 0 contains vertex 0's adjacency list.

Running Graph Algorithms

You can either run graph.py directly:

$ python graph.py

or you can run it in the Python interpretor:

$ python
>>> from graph import WikiGraph
>>> g = WikiGraph()
>>> src = g.keytable["nazism"]
>>> dst = g.keytable["philosophy"]
>>> g.bfs(src)
>>> g.path(src, dst)

Some ideas for future work

1. Determine backlinks of each wikipedia page.

2. Web-based interface for running graph algorithms on the dataset.

3. Some kind of graph database. I calculated that pre-computing all possible paths in the link graph would require storage on the order of 300 TBs. That means for now I have to run algorithms on-the-fly and that can be slow if I'm trying to service web requests. Not to mention that each time a BFS is run, it has to allocate storage for predecessor trees. A few requests in a row would kill a server.

4. ???

5. PROFIT!

 

Applied Single-Source Longest-Path Algorithms

Or, determine the maximum sum of numbers from top to bottom of a triangle by following a path defined by adjacent numbers on the line below.

Problem

Given a triangle like below, you need to write an algorithm that maximizes the cost of traveling from point A to point B, where point A is the top of the triangle and point B is any one of the nodes at the bottom of the triangle.

Sample triangle:

         5*
        / \
       /   \
      9*    6
     / \   / \
    /   \ /   \
   4     6*    8
  / \   / \   / \
 /   \ /   \ /   \
0     7*    1     5

The maximum sum of the path taken above is 5+9+6+7 = 27.

The triangle above is stored in a text file with the following format:

5
9 6
4 6 8
0 7 1 5

Each line is just separated by a '\n' and each number on a line is separated by a space. Nothing too fancy.

Where to start?

One of the tricks to solving this problem is recognizing that it's a graph problem. Some hints in the problem statement make this pretty clear: adjacent numbers, maximum sum/cost, moving from the top number to a bottom number, etc. The first thing you need to do is parse the text file and build the graph representation from the data. The way I do this in Python is:

lines = open(filename).readlines()
lines = [line.strip() for line in lines]

# `lines` is now a list of space-delimited strings of numbers:
# ['5', '9 6', '4 6 8', '0 7 1 5']

# create a graph from the lines:
G = TriangleGraph(lines)

Implied in the problem, the graph is a Directed, Weighted Graph

The next order of business is taking the `lines` list and turning it into a usable directed, weighted graph representation. I created a TriangleGraph class that takes the `lines` list as an init argument and handles all of the graph stuff. Here's the relevant part of the TriangleGraph initialization:

class TriangleGraph(object):

    def __init__(self, lines=[]):
        self.lines = lines
        self.graph = []
        self.numNodes = 0

        # each node has some data associated with it, such as
        # its value, the distance from the source vertex, and
        # its immediate predecessor. A vertex's index i will map
        # to graph, values, distances, and predecessors.
        self.values = []
        self.distances = []
        self.predecessors = []

        # I use adjacency lists to represent the triangle graph
        # and its built during initialization.
        self.build_adjacency_lists()

    def build_adjacency_lists(self):
        """
        Build a graph representation from self.lines using adjacency lists.
        """
        # account for root node first because we need to track the other
        # node's indexes as they're added to the graph.
        self.numNodes = 1

        # add root node to self.values first because the loop below
        # only adds next lines.
        self.values.append(int(self.lines[0]))

        for i in xrange(len(self.lines)-1):
            nextline = self.lines[i+1].split()

            # add next line of values
            self.values.extend([int(n) for n in nextline])

            for j in xrange(len(nextline[:-1])):
                # add adjacent numbers on the next line to the graph
                # for the current vertex.
                self.graph.append({self.numNodes: int(nextline[j]),
                                   self.numNodes+1: int(nextline[j+1])})

                # last adjacent vertex added to current vertex's adjacency
                # list will be the _first_ vertex added to the next vertex's
                # adjacency list, so increment numNodes by only 1.
                self.numNodes += 1

            # A new line is about to be added to the graph, so make sure
            # the first adjacent vertex added to current vertex is indexed
            # as one more than the last adjacent vertex added to the previous
            # vertex. Otherwise, the last vertex on this line will be adjacent
            # to the first vertex on this line.
            self.numNodes += 1

        # The last line of the triangle contains a list of vertices
        # that don't have adjacent vertices, so add empty dictionaries.
        last = self.lines[-1].split()
        self.graph.extend([{} for i in xrange(len(last))])

The graph now has a usable representation in the form of adjacency lists. For the example triangle referenced at the beginning, its graph representation is now this:

# vertex u has adjacent vertices v1 and v2 with values w1 and w2:
# adjacency dict is formatted as u -> {v1: w1, v2: w2}
0 -> {1: 9, 2: 6}
1 -> {3: 4, 4: 6}
2 -> {4: 6, 5: 8}
3 -> {6: 0, 7: 7}
4 -> {8: 1, 7: 7}
5 -> {8: 1, 9: 5}
6 -> {}
7 -> {}
8 -> {}
9 -> {}

With a directed, weighted graph, you need to include each vertex's value as part of the adjacency list. If it wasn't a weighted graph, vertex 0, for example, would have an adjacency list like so: 0 -> [1, 2] because vertex 1 and 2 are adjacent to vertex 0. Using a dict to represent weighted vertices makes sense. The `self.values` list is also a lookup table keyed by vertex index. I use that in my `weight` function to determine the cost of moving from one vertex to another.

Finding the shortest path through a directed, weighted graph

Now I'm not a computer scientist, so I had to reference my copy of Introduction to Algorithms by Cormen, et al. I initially hacked up a Breadth-First Search and was able to solve the first triangle by simply picking the largest adjacent number on the line below. The BFS failed miserably on more complex triangles. I then remembered reading about Single-Source Shortest-Path algorithms on directed, weighted graphs. Dijkstra's algorithm came to mind. There are two important pieces of Dijkstra's algorithm that need to be modified in order to transform it to a Single-Source Longest-Path algorithm.

A normal implementation of Dijkstra's algorithm consists of an initialization function and a relax function. The initialization function initializes the distances and predecessors arrays so they can be used in the actual algorithm.

# I use the heapq library in standard Python because
# Dijkstra calls for a priority queue.
import heapq

def dijkstra(G, s):
    """
    Note: this is valid Python, but it's also pseudo-code
    to describe the algorithm. The actual code is inside a class.
    """
    # initialization
    for v in xrange(len(G)):
        d[v] = float("inf")
        p[v] = -1
    d[s] = 0

    S = set()
    Q = [i for i in xrange(len(G))]
    heapq.heapify(Q)

    while Q:
        u = heapq.heappop(Q)
        S.add(u)
        for v in G[u].keys():
            w = weight(u, v)
            relax(u, v, w)

 

NOTE: Functions below are defined for Single-Source Shortest-Path.

def initialize_single_source(G, s):
    # G is the graph and s is the index of the source vertex
    for v in xrange(len(G)):
        d[v] = float("inf") # infinity
        p[v] = -1
    d[s] = 0

`d[v]` is the distances and `p[v]` is the predecessors. The relax function depends on the `d[v]` values and the edge weight. I define my weight function w(u, v) as `values[u] + values[v]`. That is, the cost of traveling from vertex 0 to vertex 1 is 5+9, because values[0] == 5 and values[1] == 9.

Relaxing an edge just means that if the algorithm can improve the shortest path to some vertex `v` by going through vertex `u`, it will update the distances and predecessors array for vertex `v` to go through `u` instead. The definition of the relax function for shortest-path is:

def relax(u, v, w):
    # `w` is w(u, v) already computed
    if d[v] > d[u] + w:
        d[v] = d[u] + w
        p[v] = u

Neat, but how do I determine the longest path?

There are two simple changes you need to make to the `initialize_single_source` and `relax` functions. First off, rather than initializing the distances to infinity, initialize them to -1. Now you need to modify the relax function to test if the path can be improved by increasing the cost associated with traveling from vertex v to u. The modification is just `if d[v] < d[u] + w`, then reset v's path to go through vertex u.

NOTE: Functions below are modified for Single-Source Longest-Path 

def initialize_single_source(G, s):
    # G is the graph and s is the index of the source vertex
    for v in xrange(len(G)):
        d[v] = -1
        p[v] = -1
    d[s] = 0 
def relax(u, v, w):
    # `w` is w(u, v) already computed
    if d[v] < d[u] + w:
        d[v] = d[u] + w
        p[v] = u

Since there are a number of vertices at the bottom of the triangle, you need to find paths to all of them from the source vertex to determine the longest path. After running `dijkstra(G, 0)` (since vertex 0 is the top of the triangle), you can determine the longest path similar to below:

def path(s, v):
    """
    If there's a valid path from s to v, sum the cost of that path.

    `path(s, v)` follows the path backwards by traversing the
    predecessors of v until it hits the source vertex.
    """
    if s == v:
        return values[s]
    elif p[v] == -1:
        # no path
        return 0
    else:
        return path(s, p[v]) + values[v]

def longest_path():
    total = 0
    # len(G)-len(lines) is the first vertex on the last line,
    # so iterate from that vertex to the last vertex at len(G)
    for i in xrange(len(G)-len(lines), len(G)):
        t = path(0, i)
        if t > total:
            total = t
    return total

If you'd like to see my actual working implementation, it's hosted on Github: Solution to Triangle Graph Problem

Thoughts on Web Development: Automation, Code Generation, and Scheme

Every so often I become frustrated with my tools. I mostly spend my time building web applications and it's painfully obvious how much boilerplate code I write in Python to put live apps on the web. Python isn't alone in this respect. Any language will have some amount of boilerplate. Lately I've been spending some time on automating deployment of web applications. Any web developer worth his or her salt will have some process in place for automated deployments. Another thing worth automating is your initial project generation. The more popular web frameworks have something like this in place. Pylons, Django, Rails, etc are all capable of generating a starting point for your web app.

My own web framework, blueberry, initially had an automated starting point for web apps. It also had some code for managing database migrations. Over time, these features became somewhat constraining so I ripped them out completely. The last few web applications I've worked on used Tornado, which has zero code generation for a starting point. Not having a web app completely generated for me is a big bonus in my opinion. I'd rather build tools for automating web app generation to fit my style, rather than fitting my style to the web framework.

Another topic worth exploring is web form generation. There are tradeoffs to be made in this area and many ways of doing the same thing: building web forms, validating them, and storing the result in a database. Python has some really cool form generation libraries (ToscaWidgets, for example). There's also formencode for validating and inserting defaults/errors into your forms. Using ToscaWidgets and formencode together is pretty cool if you can get it working perfectly. ToscaWidgets is especially complex. At the time I was using it the docs were a little light, so I spent most of my time inside of the code learning how it worked.

Then there's the other side of the coin: Using straight HTML to build your forms. You can 100% use formencode for validating your hand-rolled forms, and it works really well. By the time you get into displaying errors, inserting defaults, etc, you'll start wishing you went with a form generator. Hand-rolled HTML for anything complex is A Bad Idea. I mean, unless you like feeling unproductive when you write your thousandth signup form.

There's also this new-ish trend of using jQuery to generate all of your markup. I'm not sure how I feel about this. Using jQuery is awesome and it's a great piece of technology, but it's possible to make your jQuery markup generation code really complex and it can even take up more of your editor space than your normal markup. That tradeoff may or may not be worth it to you.

Focusing back on server-side technologies, I think there is a lot to be learned from Arc and Scheme in general. I keep coming back to using scheme for my web projects, but a big problem I have is that there's no clear way to do it. With LISP there's Hunchentoot, but even then you have to install all kinds of libraries and it may not even install properly. I used to use PLT Scheme (now Racket) and even wrote a few articles on using it for web development. PLT is in a world of its own. It's an advanced Scheme that has tons of libraries. It even comes with its own web server.

Using languages like Scheme, if done correctly, your web app becomes a set of high-level, short functions that were built from the ground up to limit the amount of boilerplate you have to write. When I'm writing Python I try to incorporate this way of thinking into my own software. To do that really well, you almost have to create an entirely new programming language on top of the language you're using. That's what Arc does. Python doesn't make it easy to do that, but it can be done.

Lately I've been spending some time looking at other Schemes and toying around with Chicken Scheme. I'm hoping to get some web apps running with it one of these days. Just last night I wrote some C code and used SWIG to interface Chicken and libevent. It wasn't a complete interface to libevent though. I made a super simple timing event and called it from Chicken. I'll probably end up writing some simple servers in libevent and using them in Chicken. Maybe even a new web server or tying Chicken Scheme into Mongrel2.