22C:21: Computer Science II: Data Structures
Project 1. Posted 2/16. Due 3/12
Introduction
These days we hear a lot about small-world networks, which are graphs with certain
characteristics that seem to occur in a surprisingly large number of natural phenomena.
Informally speaking, a small-world network is a graph with three properties:
- The average degree of a vertex is small relative to the number of vertices
in the graph.
- For any vertex v, the neighbors of v are very likely to themselves be neighbors.
Graphs with this property are said to be highly clustered.
- The average path length between pairs of vertices, measured in hops, is
small relative the size of the graph.
In other words, small-world networks are sparse, highly clustered graphs with small
average path length.
Small-world networks represent the small-world phenomenon, which is the hypothesis
that everyone in the world can be reached through a short chain of social acquaintances.
The concept gave rise to the famous phrase "six degrees of separation" after a 1967 small
world experiment by social psychologist
Stanley Milgram which suggested that two random US citizens were connected on average by
a chain of six acquaintances.
Other popular examples of small-world networks and the small-world phenomena are
the Kevin Bacon game,
the Erdos number project,
connectivity of the internet,
the power grid in the western part of the U.S.,
gene networks,
online social networking communities, etc.
Wikipedia has many related entries; see
Small-world network,
Small-world phenomenon,
Social network, etc., for
more interesting details and references.
Your task in this project is to experiment with methods for generating large,
random, small-world networks.
You are expected to write code, conduct experiments by running your code,
gather data from your experiments, and finally write a report
summarizing and explaining your findings.
Some Definitions
Let G be a graph with vertex set V and edge set E.
Average degree. The degree of a vertex is the number of neighbors it
has in the graph and the average degree of the graph is the average of all the vertex
degrees. We will use d(v) to denote the degree of v and d(G) to denote the average degree
of G.
Clustering coefficient.
For any vertex v in G, the clustering coefficient of v, denoted C(v),
is the ratio of the number of edges between pairs of neighbors of v to
the total number of pairs of neighbors of v.
For example, say vertex v has 10 neighbors.
There are (10 choose 2) = 45 pairs of neighbors and suppose that of these pairs of neighbors
15 pairs are connected by edges. Then the clustering coefficient of v is 1/3.
The clustering coefficient of G, denoted C(G), is the average
of the clustering coefficients of all the vertices in G.
The value of C(v), for any v, is between 0 and 1 and therefore the value of
C(G) is also between 0 and 1.
Average path length.
The distance between a pair of vertices u and v in G is the number
of hops on a shortest path between u and v.
The average path length of G, denoted L(G), is the average of
the distances between all pairs of vertices in G.
Small-world networks are graphs G that have a relatively small value for d(G)
and L(G) and relatively high value for C(G).
Random Graph Models
It is important to be able generate random input to test algorithms and to prove properties
of these.
For example, if we design a new sorting algorithm, we would want to be able to generate
large sequences of randomly ordered numbers, to test the algorithm on.
Similarly, in order to test graph algorithms, we would like to generate
random graphs and there
is a simple, well-known way to generate random graphs.
Let n be a positive integer and let p be a real number between 0 and 1 (inclusive).
First generate n vertices labeled 1 through n and then for each pair of vertices
i and j, connect the vertices by an edge with probability p.
In other words, we visit every pair {i, j} of vertices and flip a biased coin.
With probability p the coin toss has outcome "heads" and with probability 1-p
the coin toss has outcome "tails."
If the coin toss has outcome "heads" we add edge {i, j}; otherwise we don't.
This model of a random graph was first discussed in a series of papers by
mathematicians Paul Erdos and Alfred Renyi, starting in 1959.
We will call this the Erdos-Renyi model of random graphs.
However, it turns out that the Erdos-Renyi random graphs have very small clustering
coefficient relative to many of the well-known small-world networks.
For example, the Kevin Bacon graph has a clustering coefficient of 0.79,
whereas an Erdos-Renyi random graph with the same number of vertices has
clustering coefficient 0.00027.
This is not surprising because in the Erdos-Renyi model,
two vertices have the same probability of being connected by an edge, whether they have
a common neighbor or not.
The clustering coefficient numbers mentioned above appear in an influential
paper
by Watts and Strogatz that appeared in Nature in June 1998.
In this paper, Watts and Strogatz also suggest a simple, alternate random graph model that
yields graphs with small average degree, high clustering coefficient, and small average path length.
For specific details of the Watts-Strogatz model look at Figure 1 in this paper.
However, the Watts-Strogatz model seems somewhat simplistic and your
job is to experiment with an alternate, maybe more general, way of generating small-world networks.
This is described in detail in the next section.
A More General Model
Let d and n be positive integers and let p be a real number in the range
0 through 1 (inclusive).
The construction of a small-world network occurs in two steps:
- Fix a d by d square in the plane.
Distribute n points uniformly at random into
the square. These are the vertices of the graph being generated.
Connect by edges, pairs of points that are at distance at most 1.
At this point we have a graph that is called a unit disk graph.
This is because for any vertex v, the neighbors of v are exactly the
vertices in the disk of radius one (unit disk) centered at v.
-
Now we will "rewire" some of the edges in the unit disk graph
constructed in Step (1).
Suppose the vertices are labeled 1 through n.
Consider vertices in the order 1 through n and suppose that we are currently considering
a vertex v. Consider each edge {v, w} connecting v to a higher labeled vertex w.
With probability p, replace edge {v, w} by an edge {v, z} where z is picked
picked uniformly at random from the set of all vertices that are
not neighbors of v (including w).
We would like to believe that the random graph generated in the manner described
above is a small-world network.
The construction attempts to model the fact that people are typically connected
to a few other people in physical proximity, but have some "long-range" connections
as well.
Here are some basic observations about the graphs generated by the above
construction.
- The average degree of vertices in the unit disk graph,
constructed in Step (1), is determined by the relative values of n and d.
For example, let n = 10,000.
If we drop 10,000 points uniformly at random in a 100 by 100 square, then
we would expect one point per square unit.
It is also easy to see that we would expect about 3-4 points in any unit
disk and therefore we would expect the average degree of a vertex to be about
3-4.
On the other hand, if we drop 10,000 points uniformly at random in a
10 by 10 square, then we would expect about 100 points per unit square.
Since the square is much smaller, the points are much more densely
distributed.
Further more, in a unit disk we would expect about 315 points, implying
that the average degree of a vertex in the graph would be about 315.
By appropriately choosing d (relative to n) we will be able to generate
sparse graphs (i.e., graphs with small average degree).
-
It is easy to see that the "rewiring" in Step (2) does not change the
total number of edges in the graph.
This implies that the average degree of the graph also does not change due to
Step (2).
Note that we are not claiming that degrees of individual vertices do
not change; in fact, it is highly likely that the number of neighbors of
many vertices will change due to Step (2).
-
When p = 0, Step (2) does not disturb the edges in the graph at all and
our construction just gives us a unit disk graph.
The unit disk graph is known to have fairly high clustering
coefficient and fairly high average path length.
At the other extreme, when p = 1, Step (2) disturbs all the edges
in the graph and we get the Erdos-Renyi graph.
The Erdos-Renyi graph is known to have fairly small clustering
coefficient and fairly small average path length.
The hope is that for values of p strictly in-between p = 0 and p = 1
we will be able to get graphs with high clustering coefficient
and small average path length.
This will be the hypothesis that will guide our experiments.
Experiments
You should run experiments to test the following hypothesis:
As p increases starting from 0, the average path length falls quickly.
The clustering coefficient also falls, but more more slowly.
Hence, we will get to a value of p with small average path length, but fairly high
clustering coefficient.
To run experiments to test this hypothesis, let us fix n = 10,000 and average vertex degree
at 10.
So each vertex is connected to, on average, 1 in 1000 vertices.
First find by a simple "back-of-the-envelope" calculation what value of d
you should be using to get an average vertex degree of 10.
You might want experiment with different values of d to get this right.
Now increase p slowly from 0 to 1.
For each value of p, generate a graph, using the two-step process
described above and calculate its (i) clustering coefficient and (ii)
average path length.
Let G(p) denote the graph generated for a particular value of p.
Plot the ratios C(G(p))/C(G(0)) and L(G(p))/L(G(0))
so that we get a sense of whether our hypothesis is true or not.
You can decide how to vary p.
For example, you may increase p in increments of 1/10 or in increments of 1/100, etc.
But, the more points you have in your plot, the more convincing your results will
be.
You might want to look at Figure 2 of the Watts-Strogatz paper to get an idea of exactly what the x-axis and y-axis of your plot
should be and roughly what shape your plots should have.
To strengthen your experimental results, you might want to run these for other, larger
values of n and other slightly larger values of average degree.
You might also want to see if you can generate graphs for which L(G) and C(G) match
the numbers in Table 1 in
Watts-Strogatz paper
for the film actors graph, power grid graph, and the C. elegans graph.
As you run your experiments you may encounter issues that require tweaking your
parameters. For example, for the values of n = 10,000 and average degree = 10
it may be that the constructed graph is not connected.
In such a case, you might want to
increase the value of the average degree.
Implementation
Here is a high level description of the steps in your implementation.
There are many details to think about before you get started.
-
Start with an adjacency list implementation of the graph class.
This is available here.
For this project the adjacency list representation
is much more efficient than the adjacency matrix representation
because the graphs we are concerned with, are all sparse.
Turn this class into a generic class.
-
Add a constructor to the graph class that
takes arguments d, n, and p and returns a graph constructed
in the way described above.
-
Implement a method in the graph class that takes a vertex and
returns its cluster coefficient.
The header of this method would be
double clusteringCoefficient(vertexType v)
Then implement a method in the graph class that returns the clustering
coefficient of the graph.
This method would have the header:
double clusteringCoefficient()
-
Implements a program called Experiments that runs the
experiments and outputs the data.
This should be in a file called Experiments.java.
-
Implement a method to compute the average path length in the graph.
The header for this method would be
double averagePathLength()
-
Implement a breadth first search method that takes no arguments
and returns a breadth first search tree of the graph.
The header for this method would be
treeType bfs()
What to submit
Working code distributed in various files, a clearly written README, and
a report in pdf summarizing and explaining your results.
All of these in a directory called Project1.
More details on the exact names of the code files, size of the report, etc.
will be posted soon.