Genetic Modeling Background
A couple of years ago I made my first model of genetic inheritance. It was probably about as simple as such a model could be. Rather than having 22 separate autosomal chromosomes, it was more like having one long string of connected chromosomes. And rather than having two homologues, or a second copy of each chromosome, it was more like the string of chromosomes was twice as long as it would otherwise be. On top of that, chromosome ‘segments’ wouldn’t stay in any particular place during the model—order wouldn’t matter. I didn’t think about it at the time, but a good analogy for the model would be that each parent has a handful of marbles and a random half are given to each child. This model found that, under the given assumptions, each parent has an average of 97.5 gene segments to pass to a child. The number in real life is closer to 55. I explained why there’s a difference in my original article, but the main reasons are that allowing segments to change places increases variance, while increasing the number of segments decreases variance.
Just recently, I finished a model of X Chromosome inheritance. I started with the same structure of the autosomal model. The X Chromosome is a fairly short one, so it stands to reason that its length is much shorter than the combined length of all of the 22 autosomal chromosomes. The number of simulated ‘segments’ would be much lower for this model. I found that it was about 4.5. The problem with using so few segments is that the resolution of the results will be very poor. In fact, grandchildren could only share the following percentages of X-DNA with their grandparents, and nothing in between: 0%, 16.67%, 25%, 33.3%, and 50%, 66.7%, or 100%.
I had to try something different for the X Chromosome model. There was something that I had wanted to do since I had first made the autosomal model. The solution would be to simulate two homologues rather than one, and for all ‘genes’ to remain at the same loci throughout the model. For the X Chromosome, the number of times segments were recombined, or essentially ‘cut,’ would be determined by a Poisson distribution. I found that the average random value should be about 1.7, and the number of available segments should be about 14, since it would be highly unlikely for the given distribution to produce a number higher than 13. Once the X Chromosome model was completed, I wanted to use the same model format for an autosomal model.
Training the New Autosomal Model
The new autosomal model uses an average of 55 recombinations when a parent transfers DNA to their child. The number of recombinations for an individual event is calculated by adding 22 to a random Poisson value with an input of 33. A histogram of a test distribution with one billion values is plotted in Figure 1. The minimum value was 28 and the maximum was 94. This would be the range in values that signify the number of places that ‘genes’ would be ‘cut’ in the simulation. And it corresponds to anywhere from 1.27 to 4.27 average recombinations per chromosome. As with the original autosomal model, I calculated the shared DNA between sibling pairs, and it would again aim for a standard deviation of 0.036.
Figure 1. Normalized histogram for one billion values of the simulation’s Poisson distribution, which was calculated by adding a value of 22 to random Poisson values with an average of 33, resulting in an average of 55.
The number of available ‘genes’ or ‘gene segments’ wouldn’t be set in stone, but I knew from my previous models that that number would likely affect the standard deviation of shared DNA between relatives. Any number could be used above 95 segments, since using 94 or fewer could result in a Poisson distribution that calls for cutting more spots than are available. Varying the number of available segments did change the standard deviation. I tested many values ranging from 90 segments to 800 thousand. At the lower end, the standard deviation was about 0.039. As the number of available segments increased, the standard deviation decreased. Somewhere in the range of a few hundred segments, the standard deviation started to increase again, but seemingly never reaching as high as 0.036. Luckily, it had reached that value pretty early on when the number was low. For each simulation, I did 500 thousand trials. Using 98 available segments resulted in a standard deviation of 0.0364, while 100 segments resulted in a value of 0.0358. It was pretty obvious what value should be used. When the model input was 99 available segments, which were split up between 28 and 94 different ways, the standard deviation of the fraction of shared DNA between siblings was exactly 0.0360. Ninety-nine available segments would be the number to use in the model.
Of interest in a genealogical model would normally be the average amount of DNA that a person shares with various relatives. I had never actually computed those statistics with the original autosomal model. Instead, since those statistics are fairly well-known, I had calculated the average amount of DNA that a person would have when combined with various relatives. This value would be a little higher than one’s own, such as 75% of a parent’s DNA, on average, when combined with a sibling, or 62.5% of a grandparent’s when combined with an aunt or uncle. That model was used to make this calculator.
Now that I had two autosomal models I wanted to compare them to each other, with the possibility of discovering which one more accurately models real world statistics. My guess is that, although the older model is less realistic, it’s more accurate for most of the results. It doesn’t attempt to simulate many of the features of recombination, but that’s what makes it elegant. And I doubt that the added complexity of the newer model is necessary. I do, however, think that the newer model could be very accurate if given the right input distribution. Unfortunately, with a dearth of statistics available, finding out which model is better will likely have to wait.
The table in Figure 2 shows the comparisons between the two autosomal models. There is close agreement between the models for most of the simulations.
Figure 2. Comparison of results for two autosomal models. Each row corresponds to a simulation of 500 trials of comparison between an individual and a particular relative. The rows alternate by model, first with the original model in which gene segments don’t remain in place, next with the new two homologue model in which gene segments have a fixed position. The lower and upper limits of the 95% confidence interval (CI) are shown on either side of the average. Within the constraints and assumptions of the particular model, there is 95% confidence that shared DNA between the two relatives would fall within that range. The column ‘0% Shared’ refers to the percentage of trial runs that result in relatives not sharing any DNA. This doesn’t occur for very close relations. For brevity and easier handling of program variables, the terminology ‘parent of 2nd cousin’ is used rather than ‘1st cousin, once removed.’ Inputs are based on the number of generations back from the user, therefore the model input for ‘parent of cousin’ is ‘gen = 3’ (three generations, for great-grandparents), rather than the ‘gen = 2’ (for grandparents) that would be used from the other perspective. Similarly, one could find their expected shared percentage of DNA with a niece by using ‘gen = 2’ and simulating a comparison to an aunt.
The results don’t appear to contradict my prediction that the older, simpler model is more accurate, however it’s probably not possible to verify that until real world statistics become available. There is one case in which I believe the newer, more complicated model is more accurate. In Figure 2, certain comparisons show a very high percentage of no shared DNA between relatives, namely 2nd cousins, half-second cousins, 3rd cousins, and 3rd great-grandparents. I believe that this is due to limited resolution due to the number available segments (97.5, on average). With a different model, such as one with 1,000 segments, there may have been many cases in which anywhere from one to ten segments were shared between the above relatives. However, this model is limited to the cases in which 0, 1, 2, or more segments are shared. Those numbers correspond to 0%, 1%, 2%, or more shared DNA, with no percentages possible between 0% and 1% (although, over multiple trial runs, this can average out to decimals in the percentage). As the relatives grow more distant, the number who share 0% DNA in the simulation rather than 1% will grow increasingly large. Conversely, the two homologue model employs 99 segments in addition to having the possibility of relatives sharing DNA on either homologue, resulting in 198 possibilities (or results of 0.5% when one segment is shared). While the simpler model has the possibility of gene segments switching to a different location, this doesn’t change the fact that only approximately whole number percentages can be shared between relatives.
Other than individual cases in which no DNA is shared, I believe that the simpler model is likely more reliable. The more complicated model could perhaps be tuned to very high accuracy if statistics were available. As always, I implore 23andMe, Ancestry.com, FTDNA, and others who have access to datasets, some of which number in the tens of millions, to release very simple aggregate statistics such as the ones found in Figure 2. It really shouldn’t be up to a person making Python models on a personal computer in order for the public to get an idea of how related they are to particular relatives.
A next step that I have in mind is to start treating recombination differently based on the sex of a parent for one or both of these models. It’s already known that recombination occurs more in a mother’s genome than in a father’s. Presumably the shared DNA for maternal grandparent-grandchild relationships would have a lower standard deviation. The shared percentage of DNA for paternal grandparents would vary more from the expected 25%. After a couple of years of trying to acquire those data, they still aren’t available, but I do have one idea for a standard deviation to aim for.
Cover photo by Scott Graham. Feel free to ask me about modeling & simulation, genetic genealogy, or genealogical research. To see my articles on Medium, click here. And try out a nifty calculator that’s based on the first of my three genetic models. It lets you find the amount of an ancestor’s DNA you have when combined with various relatives. And most importantly, check out these ranges of shared DNA percentages or shared centiMorgans, which are the only published values that match known standard deviations.