A discrete, stochastic model, written in Python

I’m always impressed by a model that reproduces a real life phenomenon, but is so simple that it seems like its creation was inevitable. That’s how I felt when I first saw the standing ovation and flocking birds models. I’ve developed a model that reproduces the process of receiving variable amounts of DNA from ancestors and it could hardly be simpler. It relies on three rules, the latter two of which increase variability in shared DNA between relatives:

  • Parents pass half of their genomes to their children.
  • The half of the genome that a parent gives to a child is done so randomly.
  • Parents pass half of their parents’ DNA somewhat randomly, resulting in half from each on average.

A consequence of these rules is that two relatives can expect to share with each other about half of what they each have from an ancestor. For example, siblings, who each share 50% of a parent’s DNA, should expect about 25 of those percentage points to overlap, i.e. ~25% of my mom’s genes are found in both me and my brother.

For this model, a simple assumption is made that men and women pass DNA to their children in the same manner. In reality, DNA from fathers recombines at lower rates than from mothers. So the variability in shared DNA could be greater from fathers, especially for successively all-male lines. Not taking into account the sex of the parent makes the model vastly simpler to implement and still provides useful approximations, especially for lines that are a fairly even mix of males and females.

A person inherits an average of 25% of a grandparent’s DNA, but this number varies within a few percentage points. Why is that? For the purposes of this model a person inherits exactly 50% of a parent’s DNA. That’s because, out of your 22 autosomal chromosomes, each parent gives you one of two homologues for each chromosome. (This is ignoring the differences in the sex chromosomes X and Y and that mitochondrial DNA comes only from the mother. Chromosome X is in fact only shared equally in females, who get one homologue from each parent. In males, all of the X chromosome is from the mother and all of the Y chromosome is from the father. In everyone, all mitochondrial DNA is from the mother.) Your mom gives you one homologue of your chromosome #1, and that homologue is typically a combination of both of her chromosome #1 homologues — one from each of her parents. This process is called recombination.

One of your chromosome #1 homologues is from your mother and it has exactly half of your mother’s chromosome #1 DNA; but, whichever homologue or combination of homologues she gave to you, it was done somewhat randomly such that there was an equal probability of getting DNA from each of your maternal grandparents. Usually your homologue will be a combination of genes from the two applicable grandparents. And, on average, a mother’s chromosomal homologues are passed to a child with about half of what she has from each grandparent, so her 50% becomes the grandchild’s 25%.

I should note that it isn’t unlikely that you would inherit a full chromosomal homologue from a grandparent, thereby inheriting only one grandparent’s DNA on that chromosomal homologue. The frequencies for these non-recombination events are shown in a table here. An average of 55 autosomal segments are passed from a parent to a child. So there are only about two recombination events per chromosome, on average. For the purposes of this model, all that really matters is that you get a random half of your mother’s two homologues and that that will result in your maternal homologue for a particular chromosome being about half from each of your maternal grandparents.

Using the property of halved DNA per generation, you have an average of 25% of a grandparent’s DNA and 12.5% of your great-grandparent’s DNA. However, we would expect the variability to be greater as the generations go further back. The model results will show how much variability increases per generation.

What if you’ve had your DNA tested, you’re trying to find an ancestor on your maternal grandmother’s side, and you get the idea that you just don’t have enough of her DNA to find people with matching DNA? After all, you probably only have about 25% of her DNA, so there could easily be about 4 times more DNA matches out there than the ones whom you’re seeing. If you can’t get your grandma’s DNA tested, the next best thing is to get as many as possible of her descendants to have their DNA tested. It would be even better if those people were her children. If you get your mother’s DNA tested, then your own DNA is no longer contributing to the percentage. The next best, after your grandmother’s children, would be the first cousins who are related to you via your grandmother.

Here are some percentages of your ancestors’ DNA that we would expect to reproduce:

  • You and a sibling would produce 75% (50% + half of 50%) of your parent’s DNA.
  • You and your aunt would produce about 62.5% of your grandma’s DNA (your aunt’s 50% plus half of the 25% you would expect to add by testing your mother’s DNA). Another way to think about it is that it’s halfway between the 50% that your aunt would reproduce and the 75% that both your aunt and your mom would reproduce.

These percentages all rely on the same principle that was a consequence of the three rules of the model: That the overlap between two children of a given parent will be about half of what they share. For example, if a son receives 50% of his mother’s DNA, and a daughter receives another 50%, chances are that those will neither be the same 50%, nor a completely different 50% — it’s much more likely to be in the middle of those two numbers. On average, they will each have a distinct 25% and a shared 25% of their mother’s DNA.

Here are two more scenarios:

  • You and a first cousin would produce about 43.75% of a shared grandmother’s DNA. This is your 25% plus three-quarters of your first cousin’s 25%. (It isn’t half of your first cousin’s 25% because, luckily, your aunt inherited a different 50% of your grandmother’s DNA than the 50% that your parent inherited. It isn’t all of your cousin’s 25% because your aunt didn’t inherit a completely different 50% than your parent inherited. Rather it’s halfway between half and all of your cousin’s 25%.) Another way to think about it is that 43.75% is halfway between what you’d expect to reproduce yourself and what you’d reproduce by both your aunt and yourself.
  • This scenario is far less likely, but interesting nonetheless. Let’s say you have an aunt who had a lot of children. If you were to get eight of your aunt’s children to test their DNA along with your own, you’d likely reproduce about 62.5% of your grandmother’s DNA. With ten of your aunt’s children plus your own DNA, you’d almost always reproduce greater than 62.4%. As more first cousins are tested, the number approaches the same percentage that you’d expect to reproduce by getting yourself and that aunt tested.

As you can see, these numbers are getting less intuitive and harder to calculate in your head. The model I’ve developed predicts these numbers and far more complicated ones. But, more importantly, it predicts the variability that you’d expect to see.

Before the model could be valid, it had to reproduce, in a training model, the real-life variance in DNA that one would expect to see between people who share a particular ancestor. Those numbers don’t appear to be available. The useful number would be the average variance per generation, but that number certainly wouldn’t be found: the variance in shared DNA is zero when passed from parent to child, as it’s only introduced in the next generation. The only value I could find is the standard deviation of fraction of shared DNA between siblings. That value was 0.036 from a published study that analyzed 4,401 sibling pairs. Luckily, it was very easy to use the model to simulate two children and then compare their shared DNA. Running the model multiple times and then comparing the amount of shared DNA in each model run would provide a standard deviation. That value works just as well as would the standard deviation of the percentage of a grandparent’s DNA. If a model were to simulate the transfer of 50% of DNA from a parent to each child, and it uses a method of simulating the real variance due to recombination, and that model results in a standard deviation of 0.036 of shared DNA between siblings, then that model is valid as a method to simulate the variance due to recombination. It can then be used recursively for multiple generations.

How did I do it?

I started with a model wherein each parent has 100 labeled pieces (each distinct from the other pieces and those of the other parent), or segments, of DNA that are available to pass to a child. It would then be easy to pass a random 50 of those from each parent to a child. Doing so for two children, taking the intersection of those segments, counting the number of those shared segments, and then dividing that by the total number of segments the child now has (100) gives the fraction that the siblings share. The standard deviation that resulted was surprisingly close.

I then got off track a bit. There’s a very informative blog called The Coop Lab that uses Poisson distributions to solve these kind of problems. Some of the posts have conveyed that there are an average of 33 recombination events per generation (although this number is only for mother-to-child relationships). If 23 chromosomes are broken up into segments an average of 33 times, the result would be an average of 56 segments passed per generation. So if my model used 56 segments, that seemed to be a better number than 100. Unfortunately, the resulting standard deviation between siblings (~0.47) was much higher than the target. The reason is that, in real life, a different set of 56 segments would be passed to each child. In my model, different combinations of the same 56 segments were used each time. So using 56 segments underestimates the number of different ways that segments can be passed. (Lower numbers of segments will always result in a greater standard deviation.)

The next method I tried was to use a very large number of segments. I soon realized that there would hardly be a standard deviation at all if too many segments were used. The rule of 50% inheritance with 50% overlap would overcome any possible variance as the number of segments gets too high. A child would always get 50% of parents’ DNA and another child would get an almost perfectly (by half) overlapping different 50%.

It became apparent that the right method would be to use whatever number of segments resulted in a standard deviation of 0.036. The number of segments would have to be updated by the simulation in order for it to converge to the right value. Another problem was that the number of segments that a parent has available needs to be an even whole-integer number if half of them are going to be passed to each child. This constraint wouldn’t allow for the resolution needed to get a fine-tuned standard deviation. The solution for that problem was found by using random numbers to slightly vary the number of segments for a given trial. So, if an odd number of segments turned out to be closer to what was needed, half of the time the simulation could use one number lower and half of the time it could use one number higher.

The next number of segments I tried was 101. Each parent would then pass either 50 or 51 of these segments to a child. The 101 segments would be halved. The decimal portion (0.5) would be stored in a new variable. If a pseudo-randomly generated number from a uniform distribution was greater than the decimal portion, 50 segments would be used for each parent to each child. Otherwise, 51 segments would be used. This was allowed to vary for each trial, but remained constant within a trial. If the standard deviation overshot the target value, the number of segments needed to be increased. Otherwise, it needed to be decreased. To reduce the variability of the standard deviation in a simulation run, I used a high number of trials (200,401), each of which represents a comparison of one pair of siblings.

Using 101 segments moved the standard deviation in the wrong direction. Using the same principle as for 101 segments, decimal numbers could be introduced in the number of segments. If the input number was 99.5, the halved number would be 49.75. Thus, for ~25% of trials the number of passed segments would be 49 and ~75% of the trials would use 50 segments. As the number of segments appeared to converge around the high-90s, I re-adjusted the input number of segments and started the simulation over again, allowing it to change the number of segments by 0.05. The convergence of the number of segments and the resulting standard deviation is a method of the simulation to learn the right input for a given output. The segments were simply labeled for a simulated mother, consisting of m = {m0, m1, m2, …, mn-1} where n is the rounded (up or down) integer of halved segments multiplied by two. And for the father they were labeled p = {p0, p1, p2, …, pn-1}.

Results

The problem of predicting the mean percentage of shared DNA between segments is trivial. The percentage nearly always comes out 50.0%. Whether the input number of segments is initialized as higher or lower, the number of segments tends to converge around 97.5. Since the target standard deviation is 0.036, with only three decimal places, any simulation output of standard deviation above 0.0355 and less than 0.0365 is an accurate prediction. Whether I use an input number of segments of 95 (standard deviation = 0.0364) or 100 (standard deviation = 0.03551), the standard deviation seems to fall within that range. However, using the number of segments that the simulation converges to (which was 97.5 segments) is preferable. If the target standard deviation had more decimal places, I believe that the number of segments could be tuned to even finer resolution, however, the number of siblings pairs would also need to be increased, which would cause the simulation to run for a burdensome length of time.

Is there any significance to the number 97.5? It’s tempting to think that a parent’s genome is broken up into an average of 97.5 segments that are available, half of which end up being passed to a child. But that number is already known to be about 55 segments. The difference is due to the fact that, as I stated above, a different set of real-life segments (chopped in different places) are passed to each child — not just other segments from the same bunch like in my model. Took make the model work in a manner more similar to real life, the available segments would have to be different even before selecting half of them. (But my goal is to make the model simpler than real-life while reporting an accurate result.) In my model, the number of segments in the simulation has to be greater than 55 in order to introduce more variance to account for those different ways of breaking up the segments. The best way I can describe the meaning of the number of segments found by the model is as follows: If a parent’s genome was broken up into a fixed number of pieces, and those same pieces were the ones that were available to pass on each time the parent has a child, i.e. they couldn’t be broken up smaller and parts of segments couldn’t be combined, then a number of available segments averaging (across all parents) almost exactly 97.5 would provide the right amount of variance between all of a person’s descendants. What the number may show is that ~55/97.5 = ~56% of the variance in percentage of inherited genes is due to randomly halving the genome for each child and the other ~44% is due to segments being broken up in different ways each time. Regardless of what the number means, the important thing is that, in finding the number that produces the right variance, the model then becomes useful to perform other tasks.

Simulating the shared percentage of DNA and standard deviation between siblings was considered a calibration step of the model. Most of the same code could then be used as a tool to simulate the mean and standard deviation of DNA passed down from ancestors to descendants. Some examples of how that tool can be used and the subsequent results can be found here.

The next step would be to allow the simulation to treat recombination from mothers and fathers differently for improved accuracy. Presumably, 97.5 segments would lie halfway between the ideal number of segments to use for maternal recombination and that of paternal recombination. However, that would require a large dataset of great-grandparent to great-grandchild relationships. Unfortunately, I think we’re a long way off from having such a dataset, and I’ve already learned that nobody would be willing to share it with me if they had it. However, since the number of paternal and maternal recombination events has been observed from some studies, I may be able to interpolate those numbers in the same way that the number 55 was stretched to 97.5.

Originally published on Medium on 27 June 2018.

Cover photo by Sharon McCutcheon. The results for this model can be found here. I’ve also developed an updated version of this model and an X Chromosome modelFeel free to ask me about modeling & simulation, genetic genealogy, or genealogical research. To see my articles on Medium, click here. And try out a nifty calculator that’s based on the first of my three genetic models. It lets you find the amount of an ancestor’s DNA you have when combined with various relatives. And most importantly, check out these ranges of shared DNA percentages or shared centiMorgans, which are the only published values that match known standard deviations.