Using very simple math to get the most out of multiple kits
Many genetic genealogy enthusiasts have their own DNA genotyped as well as some of their siblings. As the enthusiast of your family, you might have access to all of these kits. Or, if they’re all on GEDmatch.com, then you can use these kits in tools whether or not you’re the manager. One thing to definitely take advantage of at that site is the Lazarus tool. However, people often find that not enough of the necessary relatives have their DNA genotyped in order to successfully make a Lazarus kit.
There’s another way to use multiple kits to great effect, and all it requires is a chromosome browser and a calculator. Below I’ll show you how.
I wrote an article a couple of years ago that shows you how much of an ancestor’s DNA you can expect to reproduce when combining the kits of multiple relatives. I even made a calculator that lets you combine various relatives. This concept of DNA coverage has become much more popular in the past few months. For the method I’m going to talk about below, you could combine kits for various relatives, but let’s start with the simple case of siblings. You have 50% of a parent’s DNA; but you and a sibling have, on average, 75% of a parent’s DNA; and three children have 87.5% of a parent’s DNA, on average. Every time you add a sibling, you add half as much percentage as you added last time. If there are enough children, you essentially have your parent’s DNA kit. Not only does the DNA coverage grow, but the range of coverage becomes narrower with each additional sibling. That last point will be important later.
Lets say that you and a sibling have a match and the amount of shared DNA suggests something like a third cousin or a half second cousin. You might be investigating your father’s side and you’ve ensured that none of the matching segments with this cousin came from your mother. One of the first things you probably did was to check the relationships in a shared centiMorgan (cM) or DNA percentage table. You might think that that’s all you need to do, and that nothing else is necessary. However, you’re not using much more than the information that’s found in one kit at this point, even if you check all three kits separately. You might be able to eliminate one of many potential relationships to the cousin. But imagine a scenario in which all three kits share the same amount: something like 70 cM. Then, checking a shared cM table for possible relationships for 70 cM isn’t going to be any more useful than for one of the kits.
We already know that you and your sibling probably have about 75% of your father’s DNA when combined. How would you estimate how much DNA your father would have shared with this match? The math could hardly be simpler. If you invert the fraction 75/100, you get 100/75. Multiplying the percentage or centiMorgans of DNA you and your sibling share with this match by 100/75 would do the trick.
But how do you know how much combined DNA you share with this cousin? You can’t just add the two amounts, otherwise, when calculating how much of your father’s DNA you have, you’d just add 50% each time, in which case two siblings would have 100% of their father’s DNA and three siblings would have 150%. Instead, you have to count up the distinct DNA that you share and not double-count any of it. Getting the distinct segments isn’t hard, and we’ll see that this process is really worth it. If you’re only using AncestryDNA, which is the only large genotyping site without the essential feature known as the chromosome browser, then you’ll have to upload your DNA somewhere else in order to do this. The tools you need for this are totally free at GEDmatch.
The best case scenario is that the DNA you and your sibling share with the applicable cousin is totally distinct. In that case, you and your sibling not only share more DNA with the cousin than you may have expected, but all you have to do is add up the total cM that you each share with the cousin. If the shared segment from one of you can be found completely within a segment from the other, this is also very easy. All you have to do is discard the smaller segment. Finally, if some of the segments partially overlap, you can estimate the length of the overlap in cM. For example, if you and your sibling both share a 100 cM segment with the cousin, and half of that appears to overlap, the distinct amount of combined DNA that you share with the cousin is 150 cM—count the first 50 cM, then the overlapping 50 cM only once, and then the last 50 cM. It may not feel right to approximate how much is overlapping, but this allows you to use a method that achieves exceptional results.
Once you’ve added up all of the distinct segments and you have a total in cM or percentage, use the applicable fraction or multiple from the table below.
|# of Siblings, Including Yourself||Fraction||Multiple|
For those of you who have studied statistics, the above method gives the expected value to the problem. If the solution isn’t intuitive, you may be relieved to hear that I have a way to test this method against other methods. Using a simulation that can’t get the averages wrong for the expected amount of shared DNA between relatives, and gets the ranges more accurately than probably anything else available, I can calculate how far off, on average, various methods are for estimating how much a parent would’ve shared with a particular cousin.
For the case below, I simulated three siblings and their half second cousin, on their father’s side, 100,000 times. Each simulation run, I calculated the shared DNA between the father and the cousin and that amount estimated for different methods using the three siblings’ kits, and I took the difference between the two. Below are the results of those comparisons.
In the above table, the first row is for the recommended method based on the expected value, in the statistical sense, for how much DNA the father should share with the cousin. The second row shows how many percentage points, on average, one would be off if they used a method of averaging the shared DNA between the cousin and the three children, and then multiplied that by two in order to approximate the amount the father shared with the cousin. The third row shows the difference between actual and predicted for a method of using the amount of shared DNA for only the sibling who shares the most with the cousin and assuming that the father shares the same amount.
It would be expected that the father shares 3.125% DNA (224 cM at GEDmatch) with this cousin. A child of his would be expected to share 1.56% DNA (112 cM), on average, and it overestimates as often as it underestimates. The predicted method is the best by far. It’s off by 0.28 percentage points (20 cM), on average. The method of using the average shared DNA between the cousin and the three siblings is off 0.47 percentage points (33.7 cM), on average. This method could be very useful if there’s absolutely no way to get the kits into a chromosome browser because it doesn’t require finding distinct segments. The method of assuming that the father shares the same amount of DNA as the child who shares the most DNA with the cousin does the worst. It’s off by 1.13 percentage points (81 cM), on average.
Once you’ve approximated the amount of DNA your father shares with this cousin, the last step is to use a shared cM or DNA percentage chart to see the possible relationships they could have. There will be fewer possibilities this way than if you had just used one kit. For the most accurate results possible, including different ranges depending on the genders of all ancestors involved, use a shared DNA percentage table that matches the standard deviations in the peer-reviewed literature.
Apart from just combining kits for siblings, you could combine, say, your own kit and that of your aunt. For example, the two of you likely reproduce 62.5% of your grandparent’s DNA. That means that you can approximate how much DNA your cousin shares with a grandparent by multiplying the distinct segments that you and your aunt share with your cousin by 100/62.5 = 1.6.
The method that’s recommended by statistics as well as backed up by simulation works great, but there’s a way to improve upon it a bit. It would require either kit phasing or for genotyping companies to finally be able to separate the paternal and maternal copies of each chromosome, although it would require some work still to tell which was which. In this case, one could use the shared paternal DNA between the siblings to calculate what actual percentage of a father’s DNA they have when combined. For example, if two siblings share 30% of their paternal DNA with each other, the must each have 20% of additional shared DNA with their father, meaning that their combined DNA only covers 70% (30% + 20% + 20%) of their father’s DNA, rather than the expected 75%. On the other hand, if they share 20% of their father’s DNA with each other, the must each share an additional 30%, meaning that they’re fortunate enough to have a combined 80% (20% + 30% + 30%). If your mother has her DNA genotyped, and you’ve both phased your kits with her in order to make a new kit for the DNA you got from your father, this would be fairly easy and you can greatly improve upon the calculation.
Feel free to ask me about modeling & simulation, genetic genealogy, or genealogical research. To see my articles on Medium, click here. And try out a nifty calculator that’s based on the first of my three genetic models. It lets you find the amount of an ancestor’s DNA you have when combined with various relatives. And most importantly, check out these ranges of shared DNA percentages or shared centiMorgans, which are the only published values that match known standard deviations.