Improved predictions are now available when siblings have tested

As of today a new relationship predictor is available to genetic genealogists. Previously, you could get the most accurate DNA match predictions by using SegcM. All you had to do was enter the total cMs and number of segments you share with your match and then look at the possible relationships and their probabilities. Now you can enter the amount of DNA that your sibling shares with the same match for much improved predictions. Even if you don’t have a sibling tested, you can enter the amount of DNA you share with two siblings when you aren’t sure how they’re related to you.

This is the first time that a prediction tool will give you improved predictions when you add the shared DNA for a sibling. A previous tool that let you enter the shared DNA for two siblings simply reports probabilities for only the first sibling entered. For another tool, predictions have been shown to decrease in accuracy when you add the shared DNA of a sibling.

How does it work?

The original SegcM tool predicts relationships with very high accuracy by considering both the total cMs and the number of segments. Thousands of people have seen an image similar to the one below and know that’s it’s associated with the most accurate predictions in genetic genealogy. But what they might not know is that the image actually shows how SegcM is able to separate relationships so well, including by maternal and paternal sides.

sex-specific cMs vs. segments for parent/child to 1st cousins including full- and half-sibling, aunt/uncle/niece/nephew,and grandparent/grandchild

These clusters show how SegcM can tell the difference betwen close family relationships. For example, the blue paternal grandparent/grandchild cluster at the bottom shows that much of the time a paternal grandparent/grandchild match will get a probability of 100%, with no other relationships possible because none overlap the lower section of that cluster. Similarly, perhaps as much as 50% of all paternal aunt/uncle/niece/nephew relationships (the pink cluster) have no overlap with other paternal relationships (such as paternal half-siblings in purple), which means that if the match is assigned a paternal label you’ll know they’re your aunt or uncle.

Like SegcM, SegcM for Multiple Testers uses machine learning to separate the clusters from each other as accurately as possible. This allows the model to predict relationships and their probabilities. Now, instead of predicting based on two classes, it uses four classes: Sibling 1 cMs, Sibling 1 # of segments, Sibling 2 cMs, and Sibling 2 # of segments. The results are much more accurate than if you entered the values individually into SegcM and then averaged or multiplied them.

Why couldn’t we get improved predictions for two siblings before now?

The best idea anyone could come up with a few years ago for how to tackle this problem was to multiply probabilities together. We know that averaging probabilities wouldn’t be ideal for the following reason: When one piece of information tells us that something is impossible and another piece of information tells us that it has some probability, then it’s impossible.

The minerals azurite and malachite have similar colors, overlapping hardness, and overlapping densities. Azurite has a hardness of 3.5-4 and a density of 3.77-3.89 gm/cc. Malachite has a hardness of 3.5-4 and a density of 3.6-4 gm/cc. Let’s say an unknown rock has the color of azurite or malachite, a hardness of 3.75, and a density of 4 gm/cc. It might then have a 50% probability of being malachite or azurite based on the hardness, but it has a 0% probability of being azurite based on the density. It’s impossible becasue it’s out of range. The rock has a 100% probability of being malachite, not 75% from averaging 50% and 100%.

Similarly, if there’s a 20% probability that an unknown DNA match is a grandparent to one sibling and a 0% probability for the other sibling, then the probability should be 0%. That’s what you get if you multiply the probabilities together.

When you multiply probabilities you’re assuming that the probabilities are independent of each other. But we know that that assumption is very much untrue for DNA matches. If your mother shares more than average DNA with her sister, then there’s a greater than 50% probability that you share more than the average amount with this aunt, and likewise for your sibling. If your mother shares a large enough amount of DNA with her sister, there’s likely a greater than 50% probability that both you and your sibling share more than average with your aunt. These probabilities are very much dependent, and there’s no escaping that for distant cousins, close matches, or anything in between.

That’s why tools that multiply probabilities together have been shown to decrease in accuracy when you add a sibling.

How to use the new tool

SegcM for Multiple Testers is currently designed for two full-siblings and their unknown DNA match, but it will work just as well for half-siblings in most cases. It will not work properly when the unknown match is a full-sibling to one of the half-siblings, a full-sibling’s descendant, or a descendant of one of the half-siblings.

SegcM for Multiple Testers will only currently work for full- or half-siblings. Why can you enter 1st cousins or other testers from the same generation into other tools but not this one? The reason is that those tools only show the possible relationships, not probabilities, based on what they deem to be in range. But the probabilities will be different for siblings from what they’d be for cousins. Two 1st cousins will usually have a wider range of shared DNA with an unknown tester than two siblings and that would affect the probabilities. Siblings are more likely to share an amount of DNA that reflects the amount their parent, but cousins will usually share an amount of DNA with an unknown tester that’s more reflective of their respective parents.

When the unknown match may be a descendant of one of the siblings, two different relationships will be listed in the same row of the probability table. For example, if the unknown tester might be a child of one of the siblings, the possibility “Child, Niece/Nephew” will be shown. The first relationship shown will be for the sibling who shares more with the unknown match and the second one will be for the lower cM sibling. If you enter 1,750 cMs and 40 segments for the first tester and 3,500 cMs and 22 segments for the second tester, the result won’t show “Niece/Nephew, Child” in order to have the relationships line up respectively. The match will be a child of the sibling who shares 3,500 cMs and a niece/nephew of the sibling who shares 1,750 cMs.

This tool is designed for DNA matches at Ancestry, FTDNA, GEDmatch, or MyHeritage. These sites only show half-identical region (HIR) DNA rather than the full amount of matching. 23andMe is different in that it uses the total IBD DNA sharing metric, including X-DNA when applicable. Predictions here will be accurate for 23andMe the vast majority of the time, but will have reduced accuracy for 23andMe matches who share X-DNA or have a double relationship. Full-siblings are a type of double relationship, so the tool may not work when the unknown match is a full-sibling at 23andMe. Please subtract any X-DNA from cM values for 23andMe. Including X-DNA in the total is better than ignoring the X-DNA amount, but this tool isn’t designed for that yet.

It’s ok for one of the siblings to not share DNA with the other unknown match. Simply enter zeros for the cMs and number of segments in that case. You won’t be able to view probabilities for when both siblings share no DNA with the unknown match—that would make them an unknown non-match. All we know in cases when neither sibling matches is that the unknown person is likely not a 2nd cousin or closer. A relationship predictor can only help us when at least one person matches the unknown tester.

Additional improvements

SegcM for Multiple Testers also takes into account genotyping errors, thanks to the people who have submitted data to this survey. Genotyping errors typically add to the number of segments and decrease the total cMs by splitting segments or truncating them at one of their ends, but they can also decrease the number of segments by breaking a segment into pieces that are too small to be included by the matching algorithm. To my knowledge, this is the first prediction tool to take into account genotyping errors.

When a relationship predictor makes adjustments for genotyping errors, the lines between relationships get blurrier, on top of the fact that relationships are already very blurry because of the large variation in DNA ranges. Because this tool takes into account genotyping errors, which can egregiously inflate the number of segments at AncestryDNA, the probabilities displayed will be conservative, i.e. you might see a probability of 0.1% or 0.5% for a relationship that you think is impossible or should have a lower probability than that. Even though that’s the case, this tool assigns very high probabilities, on average, to the correct relationship.

Another new update to this tool is that it provides predictions for relatives as distant as 14th cousins. Previously, Orogen and SegcM predictors went back to 8th cousins once removed, which were the farthest back predictions you could get. 14th cousins are an interesting point in our distant cousinship because, not only would 14th cousins almost never share DNA according to the most advanced science, we also likely have very few relatives who are 14th cousins and not related to us in some closer way. This makes 14th cousins an ideal stopping point by two different metrics. The new tool groups 7th cousins once removed to 14th cousins into one prediction category.

Population weights are applied to the relationships in this tool. Population weights increase the probabilities for more distant cousins because a person typically has about 5x more cousins from each generation further back. This increases the probability that an unknown match is a more distant cousin for a given cM value.

For half 3rd cousins and 3rd cousins once removed to 7th cousins, I expect the new tool to assign lower probabilities than SegcM. The error was in SegcM and earlier predictors by overestimating how rare it is for 9th to 14th cousins to share DNA. It turns out that they share DNA often enough to make a difference and there are so many of them that they significantly affect predictions. Entering 7 cMs and 1 segment for both siblings into SegcM for Multiple Testers results in a 97.9% probability for 7th cousins once removed to 14th cousins. This is the correct probability. SegcM and Orogen predictions should be updated to reflect that fact.

The accuracy of this tool was greatly improved by people who submitted data to this survey. Thank you so much! Future tools will be even more accurate if you continue to submit data here.

Thank you very much to the many beta testers who helped test SegcM for Multiple Testers. The data they submitted shows an increase in the predictive power over SegcM, which is no easy feat. So far, the relationship group with the most improvement has been the 2nd cousin group, which also happens to be the one with the most submissions. The results indicate that SegcM for Multiple Testers assigns over 29 percentage points more to the 2nd cousin group, on average, than SegcM does with single testers.

And a big thanks to Lee Herman, who introduced me to a much faster machine learning model. I never would’ve been able to compare so many relationships at once with the model I used for the original SegcM.

I hope you enjoy using the new tool!

DNA-Sci — advancing the science of relationship predictions. Please also submit data to this new DNA match survey that will greatly help improve and build new relationship prediction tools. You can also find mobile apps. for relationship predictions in the Apple Store and on Google Play. Feel free to ask a question or leave a comment. You might also like this tool to visualize how much DNA full-siblings share. DNA-Sci is also the original home of DNA coverage calculations.