Data Structure Storage Techniques: Copy-On-Write Snapshotting in C

My blog has been leaning more towards data structures and algorithms lately and I'd like to continue with that for now. The last couple posts have been about graph algorithms. I've also been working on storage and retrieval of data structures (hashtrees and graphs).

When researching different ways to store data structures to disk, I came across a technique called copy-on-write snapshotting. The basic idea is you have a data structure in virtual memory that you want to store on disk. You fork(2) a child process and write the data structure to disk. There are a few things you have to do in order to get copy-on-write snapshotting.

The first thing is to use a private, anonymous mmap for storing your data structure in virtual memory.

unsigned long mapsize;
unsigned char *pmap;

/* 1MB */
mapsize = (1 << 20);
pmap = mmap(0, mapsize, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

If you're on OS X you probably need to use MAP_ANON instead of MAP_ANONYMOUS.

Let's say you have a binary tree T stored in virtual memory and it's time to take a snapshot of it. When you're using private virtual memory in one process and issue a fork() call, the parent process's VM pages aren't copied over to the child process. Instead the pages are shared until one of them writes to a shared page. If one process writes to a shared page, a copy of that page is made for the process doing the updating. The changes are not visible to the other process.

Copy-on-write is useful for when you're still issuing updates to the data structure in the parent process while snapshotting a current (as of the fork call) copy of it to disk in the child process. To take a snapshot, first make a call to fork():

pid_t pid = fork();
if (pid == 0) {
    /* in the child process */
    take_snapshot(T);
} else {
    /* in the parent process */
    ...
}

The take_snapshot() function is where you write a snapshot of the tree to disk:

#define RWRR    S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH

static void take_snapshot(struct tree_s *T) {
    int fd = open("snapshot.dat", O_CREAT | O_TRUNC | O_RDWR, RWRR);

    /* Seek to EOF minus one byte and write a single
     * byte at the end to create a valid file map */
    lseek(fd, T->mapsize-1, SEEK_SET);
    write(fd, "", 1);

    /* create file mmap */
    unsigned char *fmap = mmap(0, T->mapsize, PROT_WRITE, MAP_SHARED, fd, 0);
    /* copy bytes from snapshot memory to file map */
    memcpy(fmap, T->pmap, T->mapsize);

    /* transfer all data to disk */
    fsync(fd);

    munmap(fmap, T->mapsize);
    munmap(T->pmap, T->mapsize);
    close(fd);

    /* free the child's copy of the tree */
    free(T);
    /* exit the child process */
    exit(EXIT_SUCCESS);
}

The code above can be thought of as pseudo-code because I haven't defined `struct tree_s` above and it won't work as-is. To see an actual working example, take a look at my GitHub gist on copy-on-write snapshotting of a binary search tree.

I'm also working on Multiversion Concurrency Control so look forward to a blog post about that topic.

Applied Single-Source Longest-Path Algorithms

Or, determine the maximum sum of numbers from top to bottom of a triangle by following a path defined by adjacent numbers on the line below.

Problem

Given a triangle like below, you need to write an algorithm that maximizes the cost of traveling from point A to point B, where point A is the top of the triangle and point B is any one of the nodes at the bottom of the triangle.

Sample triangle:

         5*
        / \
       /   \
      9*    6
     / \   / \
    /   \ /   \
   4     6*    8
  / \   / \   / \
 /   \ /   \ /   \
0     7*    1     5

The maximum sum of the path taken above is 5+9+6+7 = 27.

The triangle above is stored in a text file with the following format:

5
9 6
4 6 8
0 7 1 5

Each line is just separated by a '\n' and each number on a line is separated by a space. Nothing too fancy.

Where to start?

One of the tricks to solving this problem is recognizing that it's a graph problem. Some hints in the problem statement make this pretty clear: adjacent numbers, maximum sum/cost, moving from the top number to a bottom number, etc. The first thing you need to do is parse the text file and build the graph representation from the data. The way I do this in Python is:

lines = open(filename).readlines()
lines = [line.strip() for line in lines]

# `lines` is now a list of space-delimited strings of numbers:
# ['5', '9 6', '4 6 8', '0 7 1 5']

# create a graph from the lines:
G = TriangleGraph(lines)

Implied in the problem, the graph is a Directed, Weighted Graph

The next order of business is taking the `lines` list and turning it into a usable directed, weighted graph representation. I created a TriangleGraph class that takes the `lines` list as an init argument and handles all of the graph stuff. Here's the relevant part of the TriangleGraph initialization:

class TriangleGraph(object):

    def __init__(self, lines=[]):
        self.lines = lines
        self.graph = []
        self.numNodes = 0

        # each node has some data associated with it, such as
        # its value, the distance from the source vertex, and
        # its immediate predecessor. A vertex's index i will map
        # to graph, values, distances, and predecessors.
        self.values = []
        self.distances = []
        self.predecessors = []

        # I use adjacency lists to represent the triangle graph
        # and its built during initialization.
        self.build_adjacency_lists()

    def build_adjacency_lists(self):
        """
        Build a graph representation from self.lines using adjacency lists.
        """
        # account for root node first because we need to track the other
        # node's indexes as they're added to the graph.
        self.numNodes = 1

        # add root node to self.values first because the loop below
        # only adds next lines.
        self.values.append(int(self.lines[0]))

        for i in xrange(len(self.lines)-1):
            nextline = self.lines[i+1].split()

            # add next line of values
            self.values.extend([int(n) for n in nextline])

            for j in xrange(len(nextline[:-1])):
                # add adjacent numbers on the next line to the graph
                # for the current vertex.
                self.graph.append({self.numNodes: int(nextline[j]),
                                   self.numNodes+1: int(nextline[j+1])})

                # last adjacent vertex added to current vertex's adjacency
                # list will be the _first_ vertex added to the next vertex's
                # adjacency list, so increment numNodes by only 1.
                self.numNodes += 1

            # A new line is about to be added to the graph, so make sure
            # the first adjacent vertex added to current vertex is indexed
            # as one more than the last adjacent vertex added to the previous
            # vertex. Otherwise, the last vertex on this line will be adjacent
            # to the first vertex on this line.
            self.numNodes += 1

        # The last line of the triangle contains a list of vertices
        # that don't have adjacent vertices, so add empty dictionaries.
        last = self.lines[-1].split()
        self.graph.extend([{} for i in xrange(len(last))])

The graph now has a usable representation in the form of adjacency lists. For the example triangle referenced at the beginning, its graph representation is now this:

# vertex u has adjacent vertices v1 and v2 with values w1 and w2:
# adjacency dict is formatted as u -> {v1: w1, v2: w2}
0 -> {1: 9, 2: 6}
1 -> {3: 4, 4: 6}
2 -> {4: 6, 5: 8}
3 -> {6: 0, 7: 7}
4 -> {8: 1, 7: 7}
5 -> {8: 1, 9: 5}
6 -> {}
7 -> {}
8 -> {}
9 -> {}

With a directed, weighted graph, you need to include each vertex's value as part of the adjacency list. If it wasn't a weighted graph, vertex 0, for example, would have an adjacency list like so: 0 -> [1, 2] because vertex 1 and 2 are adjacent to vertex 0. Using a dict to represent weighted vertices makes sense. The `self.values` list is also a lookup table keyed by vertex index. I use that in my `weight` function to determine the cost of moving from one vertex to another.

Finding the shortest path through a directed, weighted graph

Now I'm not a computer scientist, so I had to reference my copy of Introduction to Algorithms by Cormen, et al. I initially hacked up a Breadth-First Search and was able to solve the first triangle by simply picking the largest adjacent number on the line below. The BFS failed miserably on more complex triangles. I then remembered reading about Single-Source Shortest-Path algorithms on directed, weighted graphs. Dijkstra's algorithm came to mind. There are two important pieces of Dijkstra's algorithm that need to be modified in order to transform it to a Single-Source Longest-Path algorithm.

A normal implementation of Dijkstra's algorithm consists of an initialization function and a relax function. The initialization function initializes the distances and predecessors arrays so they can be used in the actual algorithm.

# I use the heapq library in standard Python because
# Dijkstra calls for a priority queue.
import heapq

def dijkstra(G, s):
    """
    Note: this is valid Python, but it's also pseudo-code
    to describe the algorithm. The actual code is inside a class.
    """
    # initialization
    for v in xrange(len(G)):
        d[v] = float("inf")
        p[v] = -1
    d[s] = 0

    S = set()
    Q = [i for i in xrange(len(G))]
    heapq.heapify(Q)

    while Q:
        u = heapq.heappop(Q)
        S.add(u)
        for v in G[u].keys():
            w = weight(u, v)
            relax(u, v, w)

 

NOTE: Functions below are defined for Single-Source Shortest-Path.

def initialize_single_source(G, s):
    # G is the graph and s is the index of the source vertex
    for v in xrange(len(G)):
        d[v] = float("inf") # infinity
        p[v] = -1
    d[s] = 0

`d[v]` is the distances and `p[v]` is the predecessors. The relax function depends on the `d[v]` values and the edge weight. I define my weight function w(u, v) as `values[u] + values[v]`. That is, the cost of traveling from vertex 0 to vertex 1 is 5+9, because values[0] == 5 and values[1] == 9.

Relaxing an edge just means that if the algorithm can improve the shortest path to some vertex `v` by going through vertex `u`, it will update the distances and predecessors array for vertex `v` to go through `u` instead. The definition of the relax function for shortest-path is:

def relax(u, v, w):
    # `w` is w(u, v) already computed
    if d[v] > d[u] + w:
        d[v] = d[u] + w
        p[v] = u

Neat, but how do I determine the longest path?

There are two simple changes you need to make to the `initialize_single_source` and `relax` functions. First off, rather than initializing the distances to infinity, initialize them to -1. Now you need to modify the relax function to test if the path can be improved by increasing the cost associated with traveling from vertex v to u. The modification is just `if d[v] < d[u] + w`, then reset v's path to go through vertex u.

NOTE: Functions below are modified for Single-Source Longest-Path 

def initialize_single_source(G, s):
    # G is the graph and s is the index of the source vertex
    for v in xrange(len(G)):
        d[v] = -1
        p[v] = -1
    d[s] = 0 
def relax(u, v, w):
    # `w` is w(u, v) already computed
    if d[v] < d[u] + w:
        d[v] = d[u] + w
        p[v] = u

Since there are a number of vertices at the bottom of the triangle, you need to find paths to all of them from the source vertex to determine the longest path. After running `dijkstra(G, 0)` (since vertex 0 is the top of the triangle), you can determine the longest path similar to below:

def path(s, v):
    """
    If there's a valid path from s to v, sum the cost of that path.

    `path(s, v)` follows the path backwards by traversing the
    predecessors of v until it hits the source vertex.
    """
    if s == v:
        return values[s]
    elif p[v] == -1:
        # no path
        return 0
    else:
        return path(s, p[v]) + values[v]

def longest_path():
    total = 0
    # len(G)-len(lines) is the first vertex on the last line,
    # so iterate from that vertex to the last vertex at len(G)
    for i in xrange(len(G)-len(lines), len(G)):
        t = path(0, i)
        if t > total:
            total = t
    return total

If you'd like to see my actual working implementation, it's hosted on Github: Solution to Triangle Graph Problem