Select Page

You already know that you get 25% DNA, on average, from a particular grandparent. But we can do better than that.

See the relationship predictor that uses the most accurate ranges of shared DNA available.

Suppose that you and some friends want to place a bet. You each guess how much DNA you share with one of your grandparents, without any of you already knowing, and then you check how much you actually share on a genotyping platform. Your best guess is 25% in this scenario. If you guess 25% and your friends don’t, you’re the most likely person to win.

But consider a slightly different scenario. Let’s say that you and your friends each want to guess two numbers. One will be the the amount of DNA you share with one maternal grandparent and the other is the amount you share with the other maternal grandparent. The winner will be decided by the lowest sum of differences between what was chosen and what was actually shared with grandparents, but you’ll be allowed to assign the closest one to each. For example, if you picked 24% and 26%, and you actually share 23% with one maternal grandparent and 27% from the other, you’ll compare the 23% to the 24% and the 27% to the 26%. You will have been off by 2 percentage points in this scenario. Not bad.

It turns out that a person who guesses 23% and 24% is onto something. You don’t want to guess 25% and 25% in this new scenario.

At this point, if you don’t like math or statistics, you might just want to scroll down until you get near the tables, which will give you a nearly exact answer to the above problem as well as a few others.

Perhaps you’ve heard of the Monty Hall problem. You’re a contestant on a game show. You choose one of three doors (Door 1), hoping to win a car, but it isn’t opened yet. The host then opens Door 2 to show you a goat. You can then keep your first choice or switch to a different door. Many choose the door they originally picked, thinking that the probabilities are each 50%, but this is a bad choice. The host has selected one of the two bad doors for you when there were only two bad doors to begin with. The probability that you originally chose the right door was 1/3. The probability that the car was behind Door 2 or Door 3 was 2/3. Clearly, your chances would be better if you were allowed to choose Door 2 and Door 3 and win if the car is behind either. Once the host opens Door 2 and you see a goat, the probability that the car is behind Door 1 appears to go up to 1/2, but it actually doesn’t. Remember that the chance that the car is behind Door 2 or Door 3 is 2/3. And it isn’t behind Door 2. That means that there’s now a 2/3 chance that the car is behind Door 3! This is a case in which you can take great advantage of new information. (I actually really like goats and I don’t drive too often, so my odds of winning are 100%, as a contestant can always pick the door that the host opened if they just want a goat. Also, I wonder if the car was ever a Pontiac GTO.)

While the expected amount of DNA you share with any particular grandparent is 25%, it’s almost impossible for you to share exactly 25% with any particular grandparent. And the game is in your favor by allowing you to align your two guesses with the percentage that’s closest to each. It would be impossible for this rule to hurt your ability to approximate the right percentages and the only case in which it will have no benefit is when you choose two equal percentages, which I’ve already said is a bad choice.

Your percentages of DNA inherited from maternal grandparents add up to 50%, so knowing one value affects the other. The first value you find is independent of the second, but the second value is dependent on the first. In fact, the second value could be calculated from the first. Now we only need to know how much variation there is for maternal grandparent/grandchild relationships. If there’s a lot of variation, then we would guess two numbers that are both pretty far away from 25% but average out to that value. If there isn’t much variation, we might pick something like 24.9% and 25.1%.

The amount of DNA shared for this relationship type forms some kind of statistical distribution in the population. If we had a statistical formula for that distribution, we could probably calculate the expected value for the pair of percentages, one being a and one being (50% – a). Unfortunately, we don’t have a formula for that distribution. It might be ok to assume that the distribution is shaped like a normal distribution. The mean of grandparent/grandchild relationships is 25% and the standard deviation for those on the maternal side is 3.33%. I think I might know how to solve the problem from there: Pick some small delta like 0.01 percentage points. Use formulas for Z-scores to find the probability that a value is between a – 0.01 and a + 0.01. Do the same for (50% – a). Multiply those two new formulas together and then take the derivative and set it equal to zero to find the maximum. I’m not sure if that sounds painful or fun.

But I already have a tool that can easily solve this problem. It’s a simulation that calculated the cousin statistics found here. And these are the only shared percentage and shared centiMorgan tables that have been validated by peer-reviewed literature (standard deviations found in Veller et al., 2019 and 2020). I was able to get consistent results by only running 50,000 trials of the simulation. Results are shown in the tables below.

Figure 1. The pair of percentages of DNA that you most likely share with paternal and maternal grandparents. Please note that these values are reversible. The two columns of results could be switched with each other and the results would be just as true. Also, the value from one column could be switched with the value from another column of the same row without changing any other rows. I’m simply showing the lower value first and the higher value second, but they could be in any order.

There’s the answer to the question. If you’re taking bets with friends on how much DNA you share with your maternal grandparents, the best guess is that you share 22.3% with one of them and 27.7% with the other—nearly three percentage points away from the average. Since paternal relationships are more variable, you’d want to move even farther away from the mean for paternal grandparents: 21.8% shared with one and 28.2% with the other. Getting values close to these are far more likely than getting close to 25% from each paternal grandparent.

Here are the results for some farther-back ancestors:

Figure 2. The pair of percentages of DNA that you most likely share with great-grandparents. PP = paternal, paternal; PM = paternal, maternal; etc.

Figure 3. The pair of percentages of DNA that you most likely share with 2nd great-grandparents.

Figure 4. The pair of percentages of DNA that you most likely share with 3rd great-grandparents.

Even disregarding that certain cousins may be more likely to have tested, one shouldn’t be surprised that they have more DNA matches from certain parts of their tree from others. Unequal numbers are more likely than equal numbers.

If you’re wondering how I did these calculations, consider that you have a dataset of grandparent/grandchild relationships. For each percentage, you have both grandparents from a particular side (paternal or maternal). You then put them into two columns, one contains the minimum value for each pair and the other always contains the maximum value, in percent. Then you take the average of each column individually. You can probably tell that the column that has the minimum values will average less than 25% and the one with the maximum values will average more than 25%. You would only need one pair of percentages for that to be true. That’s pretty much what I did. You could find the percentage configurations that I found if you had a very large, quality controlled dataset.

Maybe I’ll try to calculate some of these values using Z-scores. The simulation has already been validated by peer-reviewed standard deviations. If I get the same results, it would be evidence that shared DNA for at least some relationships can be approximated by the shape of a normal distribution. 

If you had access to the most accurate relationship predictor, would you use it? Feel free to ask a question or leave a comment. And make sure to check out these ranges of shared DNA percentages or shared centiMorgans, which are the only published values that match peer-reviewed standard deviations. Or, try a calculator that lets you find the amount of an ancestor’s DNA you have when combining multiple kits. I also have some older articles that are only on Medium.