Category Archives: Social Network Analysis

Infectious Disease, R, Social Network Analysis

Platform for Developing Mathematical Models of Infectious Disease

14 April 2009 jhj1 3 Comments

Every once in a while someone asks me for advice on the platform to use for developing models of infectious disease. I typically make the same recommendations -- unless the person asking has something very specific in mind. This happened again today and I figured I would turn it into a blog post.

The answer depends largely on (1) what types of models you want to run, (2) how comfortable you are with programming, and (3) what local resources (or lack thereof) you might have to help you when you inevitably get stuck. If you are not comfortable with programming and you want to stick to fairly basic compartmental models, then something like Stella or Berkeley Madonna would work just fine. There are a few books that provide guidance on developing models in these systems. I have a book by Hannon and Ruth that is ten years old now but, if memory serves me correctly, was a pretty good introduction both to STELLA and to ecological modeling. They have a slightly newer book as well. Models created in both systems appear in the primary scientific literature, which is always a good sign for the (scientific) utility of a piece of software. These graphical systems lack a great deal of flexibility and I personally find them cumbersome to use, but they match the cognitive style of many people quite nicely, I think, and probably serve as an excellent introduction to mathematical modeling.

Moving on to more powerful, general-purpose numerical software...

Based on my unscientific convenience sample, I'd say that most mathematical epidemiologists use Maple. Maple is extremely powerful software for doing symbolic calculations. I've tried Maple a few times but for whatever reason, it never clicked for me. Because I am mostly self-taught, the big obstacle for me using Maple has always been the lack of resources either print or internet for doing ecological/epidemiological models in this system. Evolutionary anthropologist Alan Rogers does have some excellent notes for doing population biology in Maple.

Mathematica has lots of advantages but, for the beginner, I think these are heavily outweighed by the start-ups costs (in terms of learning curve). I use Mathematica some and even took one of their courses (which was excellent if a little pricey), but I do think that Mathematica handles dynamic models in a rather clunky way. Linear algebra is worse. I would like Mathematica more if the notebook interface didn't seem so much like Microsoft Word. Other platforms (see below) either allow Emacs key bindings or can even be run through Emacs (this is not a great selling point for everyone, I realize, but given the likely audience for Mathematica, I have always been surprised by the interface). The real power of Mathematica comes from symbolic computation and some of the very neat and eclectic programming tools that are part of the Mathematica system. I suspect I will use Mathematica more as time goes on.

Matlab, for those comfortable with a degree of procedural-style programming, is probably the easiest platform to use to get into modeling. Again, based on my unscientific convenience sample, my sense is that most quantitative population biologists and demographers use Matlab. There are some excellent resources. For infectious disease modeling in particular, Keeling and Rohani have a relatively new book that contains extensive Matlab code. In population biology, books by Caswell and Morris and Doak, both contain extensive Matlab code. Matlab's routines for linear algebra and solving systems of differential equations are highly optimized so code is typically pretty fast and these calculations are relatively simple to perform. There is a option in the preferences that allows you to set Emacs key bindings. In fact, there is code that allows you to run Matlab from Emacs as a minor mode. Matlab is notably bad at dealing with actual data. For instance, you can't mix and match data types in a data frame (spreadsheet-like structure) very easily and forget about labeling columns of a data frame or rows and columns of a matrix. While its matrix capabilities are unrivaled, there is surprisingly little development of network models, a real growth area in infectious disease modeling. It would be really nice to have some capabilities in Matlab to import and export various network formats, thereby leveraging Matlab's terrific implementation of sparse matrix methods.

Perhaps not surprisingly, the best general tool, I think, is R. This is where the best network tools can be found (outside of pure Java). R packages for dealing with social networks include the statnet suite (sna, network, ergm), igraph, graph, blockmodeling, RGBL, etc. (the list goes on). It handles compartmental models in a manner similar to Matlab using the deSolve package, though I think Matlab is generally a little easier for this. One of the great things about R is that it makes it very easy to incorporate C or Fortran code. Keeling and Rohani's book also contains C++ and Fortran code for running their models (and such code is generally often available). R and Matlab are about equally easy/difficult (depending on how you see it) to learn. Matlab is somewhat better with numerically solving systems of differential equations and R is much better at dealing with data and modeling networks. R can be run through Emacs using ESS (Emacs Speaks Statistics). This gives you all the text-editing benefits of a state-of-the-art text editor plus an effectively unlimited buffer size. It can be very frustrating indeed to lose your early commands in a Matlab session only to realize that you forgot to turn on the diary function. No such worries when your run R through Emacs using ESS.

One of the greatest benefits of R is its massive online (and, increasingly, print publishing) help community. I think that this is how R really trumps all the other platforms to be the obvious choice for the autodidacts out there.

I moved from doing nearly all my work in Matlab to doing most work in R, with some in Mathematica and a little still in Matlab. These are all amazingly powerful tools. Ultimately, it's really a matter of taste and the availability of help resources that push people to use one particular tool as much as anything else. This whole discussion has been predicated on the notion that one wants to use numerical software. There are, of course, compelling reasons to use totally general programming tools like C, Java, or Python, but this route is definitely not for everyone, even among those who are interested in developing mathematical models.

Evolution, R, Social Network Analysis

Nearly Neutral Networks and Holey Adaptive Landscapes

29 December 2008 jhj1 2 Comments

My holiday reading on modularity has led me into some of the literature on the evolution of complexity. Some of the most interesting work in theoretical biology that I've read in a while relates to the ideas of nearly neutral networks and holey adaptive landscapes, an area developed by Sergey Gavrilets at the University of Tennessee. The various papers of his to which I refer can be accessed on his website. I find his explanations very clear, but recognize that this work is fairly technical stuff and my understanding of it is greatly facilitated by previous experience with similar models in the context of epidemics on networks (believe it or not). Nonetheless, a reasonably accessible introduction can be found in his 2003 chapter, "Evolution and speciation in a hyperspace: the roles of neutrality, selection, mutation and random drift." I have based much of my discussion here on this paper along with his 1997 paper in JTB.

The father of American population genetics and Modern Evolutionary Synthesis pioneer Sewall Wright first came up with the metaphor of the adaptive landscape in 1932. The basic idea is a kind of topographic map where the map coordinates are given by the genotype and the heights above these coordinates are given by the fitnesses associated with particular genotype combinations. A landscape, of course, is a three dimensional object. It has a length, a width (or latitude and longitude) and height. This particular dimensionality turns out to be very important for this story.

A major concern that arises from the idea of an adaptive landscape is how populations get from one peak to another. In order to do this, they need to pass through a valley of low fitness and this runs rather counter to our intuitions of the way natural selection works. The usual way around this apparent paradox is to postulate that populations are small and that random genetic drift (which will be more severely felt in small populations) moves the population away from its optimal point on the landscape. Once perturbed down into a valley by random forces, there is the possibility that the population can climb some other adaptive peak.

This is a slightly unsatisfying explanation though. Say that we have a finite population of a diploid organism characterized by a single diallelic locus. Finite populations are subject to random drift. The size of the population is $N$ . Assume that the fitnesses are $w_{AA}=1$ , $w_{Aa}=1-s$ , and $w_{aa}=1$ . This is actually a very simple one-dimensional adaptive landscape with peaks at the ends of the segment and a valley in between. Assume that the resident population is all $AA$ . What happens to a mutant $a$ allele? We know from population genetics theory that the probability that a completely neutral (i.e., $s=0$ ) mutant allele reaching fixation is $1/2N$ . Russ Lande has shown that when the $s>0$ this probability becomes:

$U = \frac{1}{2N} \frac{e^{-Ns}\sqrt{4Ns/\pi}}{erf(\sqrt{Ns})},$

where $erf()$ is the error function $erf(t) = 2/\sqrt{\pi} \int_0^t e^{-y^2} dy$ .

When $Ns=20$ (say a population size of 200 and a fitness penalty of $s=0.1$ ), this probability is approximately $U=10^{-8}$ . So for quite modest population size and fitness disadvantage for the heterozygote, the probability that the population will drift from $AA$ to $aa$ is very small. This would seem to spell trouble for the adaptive landscape model.

Gavrilets solved this conundrum -- that moving between peaks on the adaptive landscape appears to require the repeated traversing of very low-probability events -- apparently by thinking a little harder about the Wrightean metaphor than the rest of us. Our brains can visualize things very well in three dimensions. Above that, we lose that ability. Despite all the Punnett squares we may have done in high school biology, real genotypes, of course, are not 2 or 3 dimensional. Instead, even the simplest organism has a genotype space defined by thousands of dimensions. What does a thousand dimensional landscape look like? I haven't the faintest idea and I seriously doubt anyone else does either. Really, all our intuitions about the geometry of such spaces very rapidly disappear when we go beyond three dimensions.

Using percolation theory from condensed matter physics, Gavrilets reveals a highly counter-intuitive feature of such spaces' topology. In particular, there are paths through this space that are very nearly neutral with respect to fitness. This path is what is termed a "nearly neutral network." This means that a population can drift around genotype space moving closer to particular configurations (of different fitness) while nonetheless maintaining the same fitness. It seems that the apparent problem of getting from one fitness peak to another in the adaptive landscape is actually an artifact of using low-dimensional models. In high-dimensional space, it turns out there are short-cuts between fitness peaks. Fitness wormholes? Maybe.

Gavrilets and Gravner (1997) provide an example of a nearly neutral network with a very simple example motivated by Dobzhansky (1937). This model makes it easier to imagine what they mean by nearly neutral networks in more realistic genotype spaces.

Assume that fitness takes on one of two binary values: viable and non-viable. This assumption turns our space into a particularly easy type of structure with which to work (and it turns out, it is easy to relax this assumption). Say that we have a three diallelic loci ( $A$ , $B$ , and $C$ ). Say also that we have reproductively isolated "species" whenever there is a difference of two homozygous loci -- i.e., in order to be a species a genotype must differ from the others by at least homozygous loci. The reproductive isolation that defines these species is enforced by the fact that individuals heterozygous at more than one locus are non-viable. While it may be a little hard to think of this as a "landscape", it is. The species nodes on the cube are the peaks of the landscape. The valleys that separate them are the non-viable nodes on the cube face. Our model for this is a cube depicted in this figure.

Now using only the visible faces of our projected cube, I highlight the different species in blue.

The cool thing about this landscape is that there are actually ridges that connect our peaks and it is along these ridges that evolution can proceed without us needing to postulate special conditions like small population size, etc. The paths between species are represented by the purple nodes of the cube. All the nodes that remain black are non-viable so that an evolutionary sequence leading from one species to another can not pass through them. We can see that there is a modest path that leads from one species to another -- i.e., from peak to peak of the fitness landscape. Note that we can not traverse the faces (representing heterozygotes for two loci) but have to stick to edges of the cube -- the ridges of our fitness landscape. There are 27 nodes on our cube and it turns out that 11 of them are viable (though the figure only shows the ones visible in our 2d projection of the cube).

So much for a three-dimensional genotype space. This is where the percolation theory comes in. Gavrilets and Gravner (1997) show that as we increase the dimensionality, the probability that we get a large path connecting different genotypes with identical fitness increases. Say that the assignment of fitness associated with a genotype is random with probability $p$ that the genotype is viable and $1-p$ that it is non-viable. When $p$ is small, it means that the environment is particularly harsh and that very few genotype combinations are going to be viable. In general, we expect $p$ to be small since most random genotype combinations will be non-viable. Percolation theory shows that there are essentially two regimes in our space. When $p<p_c$ , where $p_c$ is a critical threshold probability, the system is subcritical and we will have many small paths in the space. When $p>p_c$ , we achieve criticality and a giant component forms, making a large viable evolutionary path traversing many different genotypes in the space. These super-critical landscapes are what Gavrilets calls "holey". Returning to our three dimensional cube, imagine that it is a chunk of Swiss cheese. If we were to slice a face off, there would be connected parts (i.e., the cheese) and holes. If we were, say, ants trying to get across this slice of cheese, we would stick to the contiguous cheese and avoid the holes. As we increase the dimensionality of our cheese, the holes take up less and less of our slices (this might be pushing the metaphor too far, but hopefully it makes some sense).

A holey adaptive landscape holds a great deal of potential for evolutionary change via the fixation of single mutations. From any given point in the genotype space, there are many possibilities for evolutionary change. In contrast, when the system is sub-critical, there are typically only a couple of possible changes from any particular point in genotype space.

To get a sense for sub-critical and supercritical networks, I have simulated some random graphs (in the graph theoretic sense) using Carter Butts's sna package for R. These are simple 1000-node Bernoulli graphs (i.e., there is a constant probability that two nodes in the graph will share an undirected edge connecting them). In the first one, the probability that two nodes share an edge is below the critical threshold $p_c$ .

We see that there are a variety of short paths throughout the graph space but that starting from any random point in the space, there are not a lot of viable options along which evolution can proceed. In contrast to the sub-critical case, the next figure shows a similar 1000-node Bernoulli graph with the tie probability above the critical threshold -- the so-called "percolation threshold."

Here we see the coalescence of a giant component. For this particular simulated network, the giant component contains 59.4% of the graph. In contrast, the largest connected component in the sub-critical graph contained 1% of the nodes. The biological interpretation of this graph is that there are many viable pathways along which evolution can proceed from many different parts of the genotype space. Large portions of the space can be traversed without having to pass through valleys in the fitness landscape.

This work all relates to the concept of evolvability, discussed in the excellent (2008) essay by Massimo Pigliucci. Holey adaptive landscapes make evolvability possible. The ability to move genotypes around large stretches of the possible genotype space without having to repeatedly pull off highly improbable events means that adaptive evolution is not only possible, it is likely. In an interesting twist, this new understanding of the evolutionary process provided by Gavrilets's work increases the role of random genetic drift in adaptive evolution. Drift pushes populations around along the neutral networks, placing them closer to alternative adaptive peaks that might be attainable with a shift in selection.

Super cool stuff. Will it actually aid my research? That's another question altogether...

Another fun thing about this work is that this is essentially the same formalism that Mark Handcock and I used in our paper on epidemic thresholds in two-sex network models. I never cease being amazed at the utility of graph theory.

References

Dobzhansky, T. 1937. Genetics and the Origin of Species. New York: Columbia University Press.

Gavrilets, S. 2003. Evolution and speciation in a hyperspace: the roles of neutrality, selection, mutation and random drift. In Crutchfield, J. and P. Schuster (eds.) Towards a Comprehensive Dynamics of Evolution - Exploring the Interplay of Selection, Neutrality, Accident, and Function. Oxford University Press. pp.135-162.

Gavrilets, S., and J.Gravner. 1979. Percolation on the fitness hypercube and the evolution of reproductive isolation. Journal of Theoretical Biology 184: 51-64.

Lande, R.A. 1979. Effective Deme Sizes During Long-Term Evolution Estimated from Rates of Chromosomal Rearrangement. Evolution 33 (1):234-251.

Pigliucci, M. 2008. Is Evolvability Evolvable? Nature Genetics 9:75-82.

Wright, S. 1932. The roles of mutation, inbreeding, crossbreeding and selection in evolution. Proceedings of the 6th International Congress of Genetics. 1: 356–366.

Social Network Analysis

Halloween and Social Capital

31 October 2008 jhj1 Leave a comment

Once again, I am simply blown away by the participation rate in my kids' school Halloween parade this year. In a school of around 500 students, I'd say 98%+ wore a costume today for the Halloween parade. A substantial fraction of the parents wear something, even if just a silly hat, as well. Based on my experience growing up on the North Shore of Boston, each year I find this phenomenon hard to fathom. When I was in school, maybe 10% of kids would come to school in costume. I suppose that this could just be revealing period differences and (yet again) the fact that I'm not as young as I once was. However, I think it's something different.

Palo Alto is a very affluent, highly educated, community. Parents are amazingly involved in school. In fact, coming to Palo Alto has given me tremendous insight into the travesty that is American educational disparity. Why is it that Palo Alto has such a great school system? I'm sure that dedicated teachers and competent administration plays a major role. But there is also the fact that so many parents give a lot of time and money to the school. When our son was in kindergarten, at least two parents volunteered in the classroom each day. They did all the busy work that goes along with teaching kids to read -- getting appropriate level send-home books into kids' backpacks every day, checking back in the read books, etc. -- allowing the professional teacher to, well, teach. When a class is having a party and the teacher sends around an email to the parents asking for help, all the needs are taken care of within a day. "I'll bring cupcakes." "Oh, I can bring mini sandwiches." "Can anyone come and help ladle punch?" "Oh sure, I can do that!" You get the idea. It's also not hard to imagine when this is lacking (because parents don't have the luxury of participating because they are working multiple jobs to stay afloat or a large fraction of kids only have one parent or kids don't have a neighborhood connection to their school or because violence and apathy make school a frightening place or because parents collectively put their material desires above the needs of their children or communities), teachers have to spend much more time doing drudgery. Probably all but the most energetic and dedicated scrimp. Kids and school districts suffer.

We are very lucky here.

My hypothesis is that participation rates in Halloween parades is a marker of social capital. This is the concept popularized by Robert Putnam's terrific book, Bowling Alone , in which social relationships are strengthened by shared civic or other community experience. Such ideas go back to Merton and functionalist social analysis. Bowling has the manifest function of being an enjoyable way to pass the time. It's latent function though is building community by strengthening social capital. Social capital takes on a more explicity instrumental form in Bourdieu's (and others') notion that it is the sum of social resources that an individual can call upon to accomplish social ends.

Coming to school in a costume takes effort. It takes parental cooperation. It means buying or making a costume. It means, if you're a kid, taking the risk that other kids might make fun of you (of course, if enough kids do dress up, then you might get made fun of if you don't dress up, but that's another story). It typically means that parents (at least one, though I'm always blown away by the number of couples there) have to commit to come watch the whole spectacle. My Stanford colleagues Rebecca and Doug Bird have done a bunch of terrific work on the role of costly signaling in mediating social roles, particularly where fluid status hierarchies are involved. Costly signals are honest signals because they are harder to fake. When men do difficult and dangerous things (like hunt large game), they send a signal of their quality. This aids them in securing political allies among other men and mates among women. When men or women share food widely, they signal their good citizenship and probably make it more likely that others will share with them at some later date. For a nice discussion of these and other aspects of signaling theory in Anthropology, see Rebecca's (2005) paper with Eric Smith on the topic.

So, here we are in affluent Palo Alto, where 98% of kids, nearly all the teachers and administrators, and a large fraction of their parents participate in an annual (costly) ritual that strengthens community bonds and signals the health of civic engagement and our collective investment in the future. One can only imagine the types of returns our kids will receive on the social capital thus accrued. This raises major questions about society, democracy, the future. Is strong social capital really necessary for a functioning democracy, as Putnam argues? What are the effects of such collective social expressions on the world views of developing minds? What are the consequences for developing minds of not experiencing such expressions of community? How can we enable the formation of social capital in communities for whom the deck is less positively stacked?

notes on human ecology, population, and infectious disease