Comparing kits to a known relative when multiple siblings have tested
A year ago I wrote an article, Part 1, about the theoretical best way to use the kits of multiple siblings to help determine relationships of your DNA matches.
More recently, in Part 2, I tested that method along with several others. The method that had been used the most turned out to be the worst-performing method while the theoretical method turned out to be the best.
In case there is any remaining doubt about how to take advantage of multiple siblings’ kits, I’ll do the same analysis with empirical data. I’m lucky enough that all five of my maternal siblings have had their DNA tested as well as myself. The best performing methods in Part 2 require the use of a chromosome browser, so it’s also fortunate that we all tested at 23andMe. That will allow me to analyze the performance of several methods of combining our DNA kits and comparing them to a known cousin.
But first, I’d like to make one acknowledgment. Jonny Perl has developed two amazing tools that were essential for this analysis. I entered segment-level DNA data into his distinct segment generator and common segment generator over a hundred times during the course of this study and I have no doubt that all of the results they gave me were accurate. Calculating the amount of DNA that three or more siblings share with each other or the distinct segments that siblings share with a cousin would have taken several months if I had to check the start and stop points of each segment. And, if these tools weren’t available, I would’ve definitely used Jonny’s cM estimator to make the hard way easier.
I discussed how to compute the values for the best-performing method in Part 1 and Part 2, but I’ll go over them again here as a refresher. There are two values to find for the top-performing method in this study: the total amount of DNA shared from distinct segments with a cousin and the total amount of shared maternal DNA between two or three siblings. The distinct segment information is also needed for the other two best-performing methods.
To find the number of distinct segments that you and a sibling share with a cousin (“Distinct Segments” in Table 1), you can copy and paste the two sets of segments from a chromosome browser into the distinct segment generator. For three siblings, you can copy and paste the three sets of segments from the chromosome browser, each one containing the segments that you or one of your siblings shares with the cousin.
If you have maternal half-siblings, the total amount of DNA you share is the same as the maternal amount of DNA you share. But, for a full-sibling, you have to remove the amount that you share on your father’s side. The easiest way to do this, if you have your father tested, would be to make a phased kit at GEDmatch and then compare the maternal kits of you and your sibling. You can then subtract the amount you share from 100% or 7,174 cM.
Finding the amount of maternal DNA that three siblings share (“Sibs. Shared Maternal” in Table 1) is harder, but can be made much easier by using the common segment generator if at least one of the siblings is a half-sibling to the other two. That’s how I did the analysis. The common segment generator tool only has two input boxes, but that’s all you need. In one case, Sib1 was a half-sibling to Sib2 and Sib3 and Sib2 and Sib3 were full-siblings. So I put the shared segments between Sib1 and Sib2 in one box and the shared segments between Sib1 and Sib3 in the other box. You could also compare phased kits for full-siblings and then copy and paste the segment information into the common segment generator.
Once you’ve found the amount of maternal shared DNA between siblings, the total amount of your mother’s DNA that you’ve reproduced by two or three siblings (“CM or % of Mother’s DNA” in Table 1″) can be found by using the formulas given in Part 1.
Table 1 shows one example of a three sibling trio and how all of the needed values were calculated.
Table 1. Methods of combining three siblings’ kits and their results in the case when a DNA match is actually their mother’s 2nd cousin once removed (2C1R). The methods are ordered from worst-performing at the top to best-performing at the bottom. Measurement units are differences, in percentage points, from the 2C1R group probability based on the mother’s and 2C1R’s kits. Methods that require using the siblings’ kits as a proxy for the mother, and thus use 2C1R group probabilities, are highlighted in purple. Labels with white highlighting indicate methods that use the probabilities for the 3C group. Rows in grey are only used to calculate values for the method in the bottom row, which is the top-performing method.
If you don’t have maternal half-siblings, then it might not be practical for you to find out how much maternal DNA you share. (Or for paternal DNA, if you don’t have paternal half-siblings.) But I had to do this in order to further test the methods that I’ve already tested in Part 2, now with entirely empirical data. Plus, I enjoy projects like this. Luckily, the “distinct segments time the theoretical multiple” method is very easy to use and also very accurate, so it’s ok if you can’t find the amount of DNA that full-siblings share with only one parent.
There are 15 possible pairs out of the six maternal siblings in my family, including me, all with DNA at 23andMe. I tested several methods of approximating the DNA my mother shares with her 2nd cousin once removed (2C1R), one pair of siblings at a time. Then I calculated the difference between the probability that this cousin is indeed my mother’s 2C1R and the probability given by each method of combining siblings’ kits. Finally, I averaged the differences for each method, resulting in the average number of percentage points off each method is at its prediction.
In a way, there is only one data point studied here—that of my mother’s match to her 2C1R. But, if we assume that her match is fairly reasonable, we have 15 data points within that one data point that can shed light on the variability of shared DNA that my mother’s children have with her 2C1R. And her shared DNA is fairly reasonable, albeit a bit high. My mother shares 151 centiMorgans (cMs) with this cousin. For these reasons, the 1,000 data points in Part 2 are more trustworthy than the 1 or 15 data points here.
So how do the methods perform for sibling pairs with empirical data?
Table 2. Averaged results for methods of combining two siblings’ kits and their results in the case when a DNA match is actually their mother’s 2nd cousin once removed (2C1R). Methods, measurements, units, and highlighting are the same as in Table 1. Lower values are better.
The methods rank in almost exactly the same order as they did in Table 6 of Part 2, which was for a parent’s 2C rather than 2C1R. The only difference is that the “average of children’s kits” and “distinct segments” methods have swapped places. The “average” method has done significantly worse in this one case than it did in Part 2. These ranks are also very similar to those in Table 9 of Part 2, which was for a parent’s 3C. Here, all of the methods that attempt to approximate the parent’s relationship to the cousin perform better than the ones that don’t. Still, the results in Part 2 are more trustworthy, as 1,000 data points were used in that analysis. Here, in Part 3, we have 15 data points from only one cousin, but these results add further validation to Part 2 by the fact that limited empirical data show the methods ranked in almost the exact same order as in Part 2.
We’ve seen how the methods perform for two children. So what’s next?
How Much Data Will We Have?
Before we examine the performance of these methods for trios of children, I want to make a brief diversion into mathematics because it’s fun. Also, it might answer the rare questions about how much data I was able to use and why. If you really don’t like math, you could skip this section.
One reason I’d like to know how many combinations of siblings’ kits I can use is to make sure I’m not missing any in my analysis. Indeed, I initially missed one for the combination of four children’s kits before calculating the number of combinations I should’ve had. In this case the order in which I list a sibling trio doesn’t matter. For example, the kits of Art, George, and Paul contain the same information when combined as those of Paul, Art, and George. Since the order doesn’t matter, we want to use the formula for combinations rather than permutations.
The below formula shows how to calculate the number of possible trios for six siblings’ kits.
Figure 1. How to find the number of combinations of three that are possible from six total DNA kits.
“Tot. sibs.” is the number of total number of siblings in the family, in this case 6. “Grp. amt.” is the number of siblings in a group, in this case 3. And the “!” refers to the handy factorial operator. You could also do this with pairs out of the six siblings and you’d see that you get 15, which is what I had in the Sibling Pairs section.
So, when I analyze the kits of three children combined out of six to approximate a parent’s shared DNA with a cousin, I have 20 different ways to test the methods for three children’s kits. This would be quite easy if all of the children were half-siblings to each other. It’s a lot more work when they’re full-siblings to each other. That would be fine if I were doing it once because I only had three total kits. But it’s also not so bad to do it for a trio that includes one full-sibling pair and two half-sibling relationships.
For sibling trios out of six siblings, the number of combinations (20) reaches a peak as well as the number of combinations I’ll be using (16). For my maternal siblings, there are 16 sibling trios that include one or fewer full-sibling pairs. There are actually zero trios that include only half-sibling relationships—all 16 include one full-sibling pair. This is doable.
Let’s see how the methods perform when using the kits of three children.
Table 3. Averaged results for methods of combining three siblings’ kits and their results in the case when a DNA match is actually their mother’s 2C1R. Methods, measurements, units, and highlighting are the same as in Table 1. Lower values are better.
The ranking of the methods in Table 2 above is identical to the ranks in Table 7 of Part 2, which was for three children in the case of a parent’s 2C. Also, the ranks are very similar to those in Table 10 of Part 2, which was for a parent’s 3C. On with the analysis for four children.
The analysis is about to get much easier, but we’re also going to lose the best-performing method. As I discussed in Part 1 and Part 2, a quirk of mathematical set theory is that finding the universal set from four or more sets is ridiculously complicated. So we can no longer reasonably estimate the fraction of a parent’s DNA that the children have, when combined. The good news is that the other methods are far easier and the ones that employ the distinct segments that siblings share with a cousin perform remarkably well with increasing numbers of siblings.
The number of possible combinations of four siblings’ kits is the same here as it is for sibling trios: 15. But now that we don’t have to calculate how much of a parent’s DNA the four children have, when combined, it’s easy to use all combinations of children, including any number of full-siblings. Please note, you don’t ever have to use that method, even for three or fewer siblings; it’s just that it’s a better performing method, at least for 2nd cousins and closer.
Table 4. Averaged results for methods of combining four siblings’ kits and their results in the case when a DNA match is actually their mother’s 2C1R. Methods, measurements, units, and highlighting are the same as in Table 1. Lower values are better.
Here the methods rank in the same order as those in Tables 8 and 11 of Part 2, for 2C and 3C respectively. They also rank the same as they did for three children above. Let’s see how the methods perform when the kits of five children are used.
There are six possible combinations of five DNA kits out of six children—each child has one opportunity to not be part of the group of five.
Table 5. Averaged results for methods of combining five siblings’ kits and their results in the case when a DNA match is actually their mother’s 2C1R. Methods, measurements, units, and highlighting are the same as in Table 1. Lower values are better.
In Part 2, I didn’t analyze the methods for cases when five children have their DNA tested. However, since I’m only comparing the kits to one cousin in this case, I thought I would take this analysis as far as I can with all of my maternal siblings’ kits. The results here are in line with those of four or fewer children’s kits for 2C, 2C1R, and 3C. The only difference is that the distinct segments method with no multiple is now the best performing method. This is in line with the graphs in Part 2, which show the distinct segments method gaining in performance with increasing number of children’s kits. The ranks for five children are also the same as they were for three or four children above.
There is only one combination of six siblings out of my tested maternal siblings and myself. So here we’re dealing with only one data point. The data in Part 2 will be more accurate than what’s below, but that doesn’t mean that what’s below isn’t accurate. This is a case of empirical data, which validates the findings in Part 2. And, with the kits of six children, you will have all or almost all of a parent’s DNA. Hence, any variability in the results we would get is going to depend on variability in the amount of DNA that the applicable parent shares with their cousins.
In the vast majority of cases, your parent is going to share a fairly normal amount of DNA with their cousins. That’s inherent in the word normal—that it will have a semblance to most cases. The results below will therefore be very instructive, because I already know that my mother shares a fairly normal amount of DNA with this particular 2C1R of hers. The same can be said for the case of five siblings above, and probably four siblings, etc., even though we only have one data point, i.e. one cousin for this analysis.
Table 6. Results for methods of combining six siblings’ kits and their results in the case when a DNA match is actually their mother’s 2C1R. Methods, measurements, units, and highlighting are the same as in Table 1. Lower values are better.
We can’t compare these results to those in Part 2 because I didn’t use six siblings for that analysis. But the rankings of the above methods is the same here as is was for three, four, and five siblings. I find that pretty convincing, especially combined with the findings of Part 2, that the methods shown as the best-performing are indeed very accurate.
Leah Larkin recommends twinning and recently said that the best performing methods are “inventing data for the [parent].” What Larkin failed to understand is that twinning is not only also inventing probabilities, but it’s doing so incorrectly.
I can’t help but think of a new question when seeing all of these results: If twinning is so bad, what happens if I enter other close relatives, besides full-siblings, into a WATO tree? The reason that including close relationships doesn’t work as well is generally thought to be because of a lack of independence in the probabilities. If using siblings doesn’t work well, there’s no question in my mind that entering aunts/uncles/nieces/nephews and other close relationships won’t work well either. Really, I think it’d be best to only enter one match per line from the MRCA. But there’s nothing stopping us from using the methods above to add, say, your own DNA plus that of an aunt. I wrote an article about that years ago and even have a tool that helps you calculate the theoretical DNA coverage.
Here’s something else that’s pretty cool—I solved a 27-year mystery while I was finishing up writing this article. A man wanted to know who his father was and wanted my help. I found that his top match at 23andMe also had a full-sibling on that site. And that’s a great thing about 23andMe—they include the most basic and essential tools like chromosome browsers. One of the first things I did was to combine the distinct segments of the siblings. This gave me a very limited number of relationships that their father could have had to the man I was helping. It took less than 2 and a half hours for me to name the father. It sure feels great to help people, and what a wonderful coincidence that I had an opportunity to use the above methods just as I was wrapping up the study!
I know that you’ll come across sibling pairs in your research, too, if you do this long enough. And maybe you have siblings of your own who have tested or can test. I hope that you try out these methods, especially the ones that only take a few seconds to employ. And it’s true—with Jonny Perl’s amazing distinct segment generator, the method that’s best or a close second makes it extremely fast to get the distinct segments and then multiply the cM by the theoretical multiple (100/75, 100/87.5, etc.). I know that if you try these methods that you’ll find them to be very accurate and helpful for speeding up and improving your research.
If you had access to the most accurate, free, relationship predictor, would you use it? Feel free to ask a question or leave a comment. And make sure to check out these ranges of shared DNA percentages or shared centiMorgans, which are the only published values that match peer-reviewed standard deviations. Or, try a tool that lets you find the amount of an ancestor’s DNA you cover when combining multiple kits. I also have some older articles that are only on Medium.