Using very simple math to get the most out of multiple kits

Many genetic genealogy enthusiasts have their own DNA genotyped as well as some of their siblings. As the enthusiast of your family, you might have access to all of these kits. Or, if they’re all on GEDmatch.com, then you can use these kits in tools whether or not you’re the manager. One thing to definitely take advantage of at that site is the Lazarus tool. However, people often find that not enough of the necessary relatives have their DNA genotyped in order to successfully make a Lazarus kit.

There’s another way to use multiple kits to great effect, and all it requires is a chromosome browser and a calculator. Below I’ll show you how.

I wrote an article a couple of years ago that shows you how much of an ancestor’s DNA you can expect to reproduce when combining the kits of multiple relatives. I even made a calculator that lets you combine various relatives. This concept of DNA coverage has become much more popular in the past few months. For the method I’m going to talk about below, you could combine kits for various relatives, but let’s start with the simple case of siblings. You have 50% of a parent’s DNA; but you and a sibling have, on average, 75% of a parent’s DNA; and three children have 87.5% of a parent’s DNA, on average. Every time you add a sibling, you add half as much percentage as you added last time. If there are enough children, you essentially have your parent’s DNA kit. Not only does the DNA coverage grow, but the range of coverage becomes narrower with each additional sibling. That last point will be important later.

Let’s say that you and a sibling have a match and the amount of shared DNA suggests something like a third cousin or a half second cousin. You might be investigating your father’s side and you’ve ensured that none of the matching segments with this cousin came from your mother. One of the first things you probably did was to check the centiMorgan (cM) value in a relationship predictor. You might think that that’s all you need to do, and that nothing else is necessary. However, you’re not using much more than the information that’s found in one kit at this point, even if you check all three kits separately. You might be able to eliminate one of many potential relationships to the cousin. But imagine a scenario in which all three kits share the same amount: something like 70 cM. Then, checking a shared cM table for possible relationships for 70 cM isn’t going to be any more useful than for one of the kits.

We already know that you and your sibling probably have about 75% of your father’s DNA when combined. How would you estimate how much DNA your father would have shared with this match? The math could hardly be simpler. If you invert the fraction 75/100, you get 100/75. Multiplying the percentage or centiMorgans of DNA you and your sibling share with this match by 100/75 would do the trick.

But how do you know how much combined DNA you share with this cousin? You can’t just add the two amounts, otherwise, when calculating how much of your father’s DNA you have, you’d just add 50% each time, in which case two siblings would have 100% of their father’s DNA and three siblings would have 150%. Instead, you have to count up the distinct DNA that you share and not double-count any of it. Getting the distinct segments isn’t hard, and we’ll see that this process is really worth it. If you’re only using AncestryDNA, which is the only large genotyping site without the essential feature known as the chromosome browser, then you’ll have to upload your DNA somewhere else in order to do this. The tools you need for this are totally free at GEDmatch.

The best case scenario is that the DNA you and your sibling share with the applicable cousin is totally distinct. In that case, you and your sibling not only share more DNA with the cousin than you may have expected, but all you have to do is add up the total cM that you each share with the cousin. If the shared segment from one of you can be found completely within a segment from the other, this is also very easy. All you have to do is discard the smaller segment. Finally, if some of the segments partially overlap, one thing you could do is estimate the length of the overlap in cM. For example, if you and your sibling both share a 100 cM segment with the cousin, and half of that appears to overlap, the distinct amount of combined DNA that you share with the cousin is 150 cM—count the first 50 cM, then the overlapping 50 cM only once, and then the last 50 cM. Even better, you could use Jonny Perl’s CM Estimator to find a very accurate number of cM by the start and stop positions. Even better than that, you could use his new tool, where you copy and paste all sets of segments from multiple family members and it will give you the total distinct cM.

Once you’ve added up all of the distinct segments and you have a total in cM or percentage, use the applicable fraction or multiple from the table below.

# of Siblings, Including Yourself Fraction Multiple
2 100/75 1.33
3 100/87.5 1.14
4 100/93.75 1.07
5 100/96.9 1.03
6 100/98.4 1.02

For those of you who have studied statistics, the above method gives the expected value to the problem. If the solution isn’t intuitive, you may be relieved to hear that I have a way to test this method against other methods. Using a simulation that can’t get the averages wrong for the expected amount of shared DNA between relatives, and gets the ranges more accurately than probably anything else available, I can calculate how far off, on average, various methods are for estimating how much a parent would’ve shared with a particular cousin.

For the case below, I simulated three siblings and their half second cousin, on their father’s side, 100,000 times. Each simulation run, I calculated the shared DNA between the father and the cousin and that amount estimated for different methods using the three siblings’ kits, and I took the difference between the two. Below are the results of those comparisons.

In the above table, the first row is for the recommended method based on the expected value, in the statistical sense, for how much DNA the father should share with the cousin. The second row shows how many percentage points, on average, one would be off if they used a method of averaging the shared DNA between the cousin and the three children, and then multiplied that by two in order to approximate the amount the father shared with the cousin. The third row shows the difference between actual and predicted for a method of using the amount of shared DNA for only the sibling who shares the most with the cousin and assuming that the father shares the same amount.

It would be expected that the father shares 3.125% DNA (224 cM at GEDmatch) with this cousin. A child of his would be expected to share 1.56% DNA (112 cM), on average, and it overestimates as often as it underestimates. The predicted method is the best by far. It’s off by 0.28 percentage points (20 cM), on average. The method of using the average shared DNA between the cousin and the three siblings is off 0.47 percentage points (33.7 cM), on average. This method could be very useful if there’s absolutely no way to get the kits into a chromosome browser because it doesn’t require finding distinct segments. The method of assuming that the father shares the same amount of DNA as the child who shares the most DNA with the cousin does the worst. It’s off by 1.13 percentage points (81 cM), on average.

Once you’ve approximated the amount of DNA your father shares with this cousin, the last step is to use a relationship predictor to see the possible relationships they could have.

Apart from just combining kits for siblings, you could combine, say, your own kit and that of your aunt. For example, the two of you likely reproduce 62.5% of your grandparent’s DNA. That means that you can approximate how much DNA your cousin shares with a grandparent by multiplying the distinct segments that you and your aunt share with your cousin by 100/62.5 = 1.6. You can get many of the multiples to use from this article. Also, this calculator can be used to get the multiple for many other scenarios. (I have to clear my browser history to get it to work. I do it in a web browser I don’t normally use so I don’t lose my browsing history.)

The method that’s recommended by statistics as well as backed up by simulation works great, but there’s a way to improve upon it a bit, although it would require you to first figure out which of your applicable chromosome copies are maternal and which are paternal. If you can do that, you could use the shared paternal DNA between siblings to calculate what actual percentage of a father’s DNA they have when combined. For example, if two siblings share 60% of their paternal DNA with each other, that’s 30% of their total genome. We know that a child shares 50% DNA with their father, so each sibling must have an additional 20 percentage points of shared DNA with their father. That means that their combined DNA only covers 70% (from adding: 30% + 20% + 20%) of their father’s DNA, rather than the expected 75%.

That was a case where siblings shared more paternal DNA than average, so their combined paternal DNA was less than average. On the other hand, if two siblings share 40% of their father’s DNA with each other, that’s 20% of their total genome. Then they must each share an additional 30 percentage points, meaning that they’re fortunate enough to have a combined 80% (from adding: 20% + 30% + 30%).

In agreement with what we know from mathematical set theory, the formula for the amount of paternal or maternal DNA covered by two siblings is as follows:

where Comb1,2 is the percentage of combined DNA from siblings 1 & 2 and Share1,2 is the percentage of paternal or maternal DNA shared between siblings 1 & 2.

How would you do this if three siblings had their DNA tested? You can use the following formula for that.

The amount of combined paternal or maternal DNA in three siblings based on the amount they share with each other

Unfortunately, three siblings is as high as we’re going to go for this little extra improvement upon our calculation. You’ll see why in the figure below from this document. Even if you figured out which parts of the diagram to add to and subtract from each other, which would make up a great many terms, making all of those comparisons between siblings would be much too arduous to be worth your time. It’s no matter, though. As the number of siblings goes up, this method produces very similar results to the theoretical amount of reproduced parental DNA (75%, 87.5%, 93.75%, etc.), which can be used to find the appropriate multiple.

Diagram showing the level of difficulty for finding the universal set for four or more sets

But we can easily make this calculation for two or three siblings. If your mother has her DNA genotyped, and you and your siblings have phased your kits with her in order to make a new kit for the DNA you got from your father, this would be fairly easy and you can improve further upon what was the best method to find the amount of DNA your father would share with a particular cousin. However, this method should be treated with caution. Even a kit phased with a parent’s doesn’t necessarily include all of the SNPs for the other parent due to genotyping errors. For this reason, it might be best to check the method of 100/87.5 as well as using the multiple that you find based on the shared DNA between siblings.

But if you have half-siblings tested then you’re really lucky. I can compare my own DNA against four of my half-siblings. The calculation can be very simple in this case. If you share 27% DNA with a half-sibling, the two of you combined have 100% – 27% = 73% of the DNA of your applicable parent, rather than the expected value of 75%.

Another check that you can do is to make sure that the predicted relationships from whichever method you use are also possible for each sibling-cousin match in the relationship predictor.

And, of course, this doesn’t only work for paternal matches. You could use the same methods to find out how much your mother would share with a cousin if she doesn’t have her DNA tested. And you could improve upon that if your father has tested and you can phase your kit to find your maternal DNA.

This article was part 1 of a series. If you want to keep reading about the new science on this subject:

Part 2: Testing the methods

Part 3: Testing the methods with empirical data

If you had access to the most accurate relationship predictor, would you use it? Feel free to ask a question or leave a comment. And make sure to check out these ranges of shared DNA percentages or shared centiMorgans, which are the only published values that match peer-reviewed standard deviations. Or, try a tool that lets you find the amount of an ancestor’s DNA you cover when combining multiple kits. I also have some older articles that are only on Medium.