A comparison of three models, including updated amount of shared DNA between various relatives and ancestors
Note: this article is relatively old. I’ve updated my model to be far more accurate, in light of recently published standard deviations, and far more realistic at the same time. The newest model result can be found here.
Genetic Modeling Background
I recently published a comparison of autosomal genetic models. The first model, which I made about two years ago, is as simple as can be, but I think it captures the important insights. The new model adds the feature of two homologues per chromosome that you’d find if you peered into the genome of a real human. This allows the simulation of ‘genes’ or ‘segments’ switching places from one homologue to another, potentially multiple times. A constraint on both models is that siblings have to share 50% of their DNA, on average, but with a standard deviation of 3.6%. A constraint on the newer, two homologue model, is that parents pass, on average, 55 segments of DNA to a child’s genome (the number is actually higher in mothers and lower in fathers, but the specific numbers are unknown), and the random distribution that produces that number can be approximated by a Poisson distribution.
The only thing left to decide was how many ‘genes’ or ‘segments’ would have to be available in order to recombine them so that they’d make, on average, 55 segments, but still have the right standard deviation. A lower bound on that number would be one more than the maximum value you would get from the Poisson distribution. It would be hard to find an upper bound, and a larger number of available segments would very quickly become computationally burdensome for the simulation.
A Better Model
I knew at the time that the complicated model wouldn’t be necessarily more accurate. In fact, I noted that the simpler one is likely the more accurate of the two. But there could be additional constraints that, if known, could change that. For example, what’s the standard deviation between a grandparent and grandchild? Or what percentage of third cousins share no DNA with each other? Or my favorite constraint of all …
No sooner had I published that comparison of two models than I found something I had been waiting for since making my first genetic model. Late last year a preprint was released for a study of recombination and variance in genetic relatedness. This paper estimated the variation between grandchildren and a paternal grandparent to be 4%. Surely this value would be easy to come by empirically for someone with a decent dataset, and I had been asking the major direct-to-consumer testing companies for statistics on grandparent-grandchild relationships for two years, but so far none have been interested in helping. I don’t know how accurate the value is from the above paper. In fact, I thought it sounded quite high, but it’s at least a starting point for something I’ve been wanting to model for a couple of years. If the value turns out to be a bit different, I can simply substitute an updated value in the future.
It’s been known for years that genomes recombine differently in men and women, with more recombinations in women, resulting in a less variation between maternal relatives, and fewer recombinations in men, resulting in more variation between paternal relatives. My earlier models showed that the standard deviation between a grandchild and a grandparent might be about 2.55%. Therefore, if the standard deviation of grandchildren and paternal grandparents is 4%, the standard deviation between grandchildren and maternal grandparents is likely much less than that (in order to average something like 2.55%). This is in line with more recombinations in women and less variation between maternal relatives.
I was ready to create a model that treats recombination rates of mothers differently than those of fathers. Once the model could find the recombination rate for fathers that produces the correct standard deviations, the recombination rate for mothers and the standard deviation for maternal grandparents would be known. Finally, I could compare results of the updated autosomal model with the two previous models.
Training the Model
Initial tests made it clear that, as the number of segments are increased, the input recombination rate for fathers has to increase asymptotically (see Figures 1 and 2). That paternal rate appears to level off around 35 average recombinations per autosomal genome (corresponding to a Poisson lambda input of 13). In order for the average recombination rate to be 55, the rate for mothers would have to be about 75.
Figure 1. Graph of number of available segments versus the paternal lambda Poisson input, to which 22 must be added to find the average number of paternal recombinations. The colorbar refers to the difference in simulation output from the target standard deviations (3.6% for siblings and 4% for paternal grandparents). Ten times more weight was given to the difference in standard deviation for siblings since that statistic is known to one more decimal point: Diff. from Target = 10*abs(0.036 – std_dev_sib_model) + abs(0.04 – std_dev_pat_gp_model). The curved line is the predicted result from a multivariate polynomial regression of order three.
Figure 2. All of the parameters are the same as for Figure 1. The x-axis is extended farther and a few additional points are generated and plotted to show the asymptotic nature of the graph. Unfortunately, the third degree polynomial begins to miss the mark at this point (after about 3,000 segments). An exponential curve would probably fit the data better.
In order to run simulations that don’t take too long, it would have been preferable to use a low number of segments such as 99, which was used in the first two-homologue model. Perhaps suggesting that a low number could be used, values close to the targeted standard deviations could be achieved by using almost any number of segments. However, with the asymptotic relationship between the number of segments and the recombination rate used, it seems as thought it would be best to pick a high number of segments—one that gets the recombination rate close to the asymptote.
When the number of available segments used in the simulation is high enough, those segments are essentially simulating individual genes. There are approximately 20,000 coding genes in the human genome. This would probably be a good number to use in the simulation. There are actually many more genes that don’t code for proteins, suggesting that the simulation should maybe use a higher number. However, recombinations often occur at the same spots. And that’s obviously true for the beginning and end of each chromosome, which this model doesn’t attempt to simulate. Recombination occurring in the same spots suggest using a lower number of available segments for the model. Without more information, 20,000 genes is probably a good mid-point. Unfortunately, regular computers won’t be able to simulate many trials of this model if 20,000 available segments are used. It will have to be a much lower number, but one that gets the recombination rate close to the asymptote.
If more statistics become available to use as constraints, such as the percentages of 3rd cousins who share no DNA, I believe that that would pinpoint the best spot on the asymptotic curve to use. If the assumptions of the model are at all correct, which I think is true, given the constraints that this model has, that would probably give a recombination rate that’s very close to the real-life value.
- The simulated genome consists one chromosome in two homologous pairs.
- Recombination occurs via random crossover between pairs. The number of events can be approximately simulated by adding 22 to a random Poisson variable that will equal 33 on average.
- Genomes recombine differently in men and women. The average number of recombinations will be higher in women and lower in men, but the average between the two will be 55.
- There may be a certain number of available ‘segments,’ on average, that get passed from parents to children. That number might be able to be found based on the other constraints of the model.
- The standard deviation of shared percentage of DNA between full siblings is 3.6%.
- The standard deviation of shared percentage of DNA between paternal grandparents and grandchildren is 4%.
I eventually decided to run simulations with 1,700 segments. I wasn’t very happy about how far away that was from the asymptote, but any higher number of segments required simulations that took way too long to run. Perhaps someday better processors or distribution over processor cores will allow a higher number of segments. Or perhaps I should be running these simulations in C++. The best paternal recombination rate to use with 1,700 segments was 33.84, which is probably about one value lower than the asymptote. That results in a maternal recombination rate of 76.16. Based on other values I’ve tested, some more accurate combinations to use would probably be 1,800 segments and 33.86 recombinations or 2,000 segments and 33.9 recombinations. However, since the target statistics aren’t known to very precise values (the sibling standard deviation could be anywhere from 3.55% to 3.65% and the paternal grandparent standard deviation could be anywhere from 3.5% to 4.5%), there’s no need to try to predict them more accurately.
Recommended model input based on training model results:
- Number of available segments: 1,700.
- Average number of recombinations from fathers: 22 + np.random.poisson(11.84).
- Average maternal recombinations: 22 + np.random.poisson(54.16)
I’ve noted already that the single homologue model is likely more accurate than the first two-homologue model that I made since the latter didn’t have additional constraints on it to justify its additional complexity. However, the new model presented here has the additional constraint of relatedness to paternal grandparents. I believe that this model is now the most accurate of the three. The tables below will present the statistics for shared DNA between various relatives alongside the results of previous models.
Figure 3. Comparison of results for three autosomal models. Each row corresponds to a simulation of 500 trials of comparison between two siblings. The first row contains results from the original model in which gene segments don’t remain in place, the second row is from the two homologue model in which gene segments have a fixed position, and the third row is from the new model with two homologues but differing recombination rates for mothers and fathers. The lower and upper limits of the 95% confidence interval (CI) are shown on either side of the average. Within the constraints and assumptions of the particular model, there is 95% confidence that shared DNA between the two relatives would fall within that range. The column ‘0% Shared’ refers to the percentage of trial runs that result in relatives not sharing any DNA. This doesn’t occur for very close relations.
Figure 4. Comparison of three autosomal models as in Figure 3. For the new model, two different pairs of grandparents must be compared: paternal grandparents (P GP) and maternal grandparents (M GP). The new and improved model results in more deviation by every measure for grandparents. Even the maternal grandparent standard deviation is higher than that of the previous models.
Figure 5. Comparisons of three autosomal models as in the figures above. Statistics are for shared percentage of DNA between oneself and an aunt or an uncle. The third row is for paternal relatives and the fourth row is for maternal relatives. Contrasting with the results for grandparents, every measure in the new model for aunts or uncles has less deviation than the previous models. Also, the deviation for aunts and uncles is lower than for grandparents by every measure. The lesson must be that, the more times recombination occurs, the closer the results will be to the average.
Figure 6. Comparisons of autosomal models as in the figures above. Statistics are for shared percentages of DNA between oneself and a full 1st cousin. The third row is for 1st cousins who are children of one’s paternal uncle. The fourth row is for children of one’s paternal aunt, the fifth for children of one’s maternal uncle, and the sixth for children of one’s maternal aunt. A child of your paternal aunt should have the same results as a child of your maternal uncle, since from the perspective of the latter, you are the former.
Figure 7. Results for an aunt or an uncle. Comparison of three autosomal genetic models, as in figures above. The third row is for a child of one’s paternal grandpa, but not paternal grandma. The fourth row is for a child of one’s paternal grandma, the fifth for a child of one’s maternal grandpa, and the sixth for children of one’s maternal aunt.
Figure 8. Results for half 1st cousins. Half 1st cousins can result from eight different ancestor/relative combinations. However, some of these are equivalent to each other. For example, the child of your father’s half-sister (by paternal grandfather) should have the same simulation results as the child of your mother’s half-brother (by maternal grandfather), since from the perspective of the latter, you are the former. Those relatives correspond to rows four and seven in the chart, which is why the 95% confidence intervals are identical for those two rows. Similarly, the confidence intervals are the same for rows six and nine.
Figure 9. Results for great-grandparents. Comparisons of three autosomal genetic models. Ancestors compared are paternal paternal great-grandpas (PP G-GP), paternal paternal great-grandmas (PP G-GM), paternal maternal great-grandpas (PM G-GP, or father’s mother’s father), paternal maternal great-grandmas (PM G-GM), maternal paternal great-grandpas (MP G-GP), maternal paternal great-grandmas (MP G-GM), maternal maternal great-grandpas (MM G-GP), and maternal maternal great-grandmas (MM G-GM).
Figure 10. Results for great-aunts/great-uncles. PP great-aunt is your paternal grandfather’s sister, PM great-aunt is your paternal grandmother’s sister, MP great-aunt is your maternal grandfather’s sister, and MM great-aunt is your maternal grandmother’s sister.
Figure 11. Results for 1st cousins, once removed. The third row shows results for a child of a son of your paternal paternal great-grandparents. The fourth row is for the child of a daughter of your same ancestors. It continues like that, with a son and then a daughter of paternal maternal great-grandparents (your paternal grandmother’s parents), maternal paternal great-grandparents, and then maternal maternal great grandparents. In model programming, I usually call these relatives ‘parents of 2nd cousins.’ That’s because the input number of generations needs to be 3 back from you in order to generate great-grandparents. From the perspective of a 1st cousin once removed, you are a descendent of their grandparents. From your perspective, they share one of your great-grandparent pairs.
Figure 12. Results for 2nd cousins. There are 16 different ways in which you could have a 2nd cousin in terms of the genders of yours and their ancestors. The third row shows results for a 2nd cousin whose paternal paternal great-grandparents are your paternal paternal great-grandparents (you are both children of sons of sons of the same great-grandparents). As in Figure 6, some of these relationships are equivalent to each other. A good way to identify those rows is by looking for ones in which the 95% confidence intervals are equal.
Figure 12 shows the first relatives (tested here so far) with which you might not share any DNA. I believe that these figures are fairly accurate. I mentioned in a previous article that ‘0% shared DNA’ is the statistic that the simple, single homologue model probably gets wrong. The reason is that the model only uses an average of 97.5 segments available to pass from a parent to a child. The smallest amount of DNA that can be shared between two people is 1/98 segments (~1%), other than zero. In many cases, values that might have otherwise fallen between 0 and 1/98 will end up as 0 in the simple model. In the first two-homologue model, more than twice as many segments are available, allowing for better resolution when very little DNA is shared. In the newest two-homologue model, 3,400 segments are available (after adding the two homologues together). Using something like 20,000 segments would result in even better resolution, but probably only slightly.
Figure 13. Results for 3rd great-grandparents. As in Figures 6 and 12, some of these relative relationships are equivalent.
A follow-up post shows all of the results for the new and improved model without the previous model results.
Cover photo by Robin Kumar. Feel free to ask me about modeling & simulation, genetic genealogy, or genealogical research. To see my articles on Medium, click here. And try out a nifty calculator that’s based on the first of my three genetic models. It lets you find the amount of an ancestor’s DNA you have when combined with various relatives. And most importantly, check out these ranges of shared DNA percentages or shared centiMorgans, which are the only published values that match known standard deviations.