People have been asking for this for a long time. Relationship predictions that include # of segments work much better than expected.
It’s here! Today a new relationship predictor is available for you to use for free. You can now enter the total number of segments along with total centiMorgans (cMs) to see how you’re related to your DNA matches. I always thought that this would be useful, but it’s a thousand times better than I expected.
Back in November I thought I would give it a try to see what would happen. With the first value I tried, Python output a prediction that I knew was amazing. I’ve since worked tirelessly to improve the predictions further. It’s been very difficult to keep it a secret. Actually, I did help several people along the way without exactly telling them how I did it.
The new predictor often tells you the exact relationship for the match you’re looking at, i.e. paternal half-sibling, maternal grandparent/grandchild, or paternal grandparent/grandchild. I’ve now seen countless cases when a person knew that a relationship was, for example, paternal. Then the predictions based on cMs and number of segments end up leaving only one option, such as paternal half-sibling.
Update, 3 Apr. 2023: Check out this new article with even further validation and some tips for understanding probabilities and what makes a good relationship prediction.
For validation, I used a dataset of 125 known close family to 1st cousin matches. I’ve asked an algorithm in Python to return six random examples from that dataset. I only did this once. Below I’ll show you the predictor output from the one and only time I’ve taken a random sample of the dataset, in the order that the algorithm gave them to me. I’ve denoted the known relationships in red below each prediction output.
In all but one case only three relationship subtypes are listed as possibilities. This cuts down the possible relationships by half from six. (Actually, predictions for 1,986 cMs would’ve also returned full-siblings as an option if we didn’t enter the number of segments, so there’s another benefit of the new tool.) Reducing the number of possible relationships by half drastically reduces the amount of time that people have to spend finding out the true relationship.
But what’s even better is that the correct relationship subtype is usually the very first one listed because of the very high probabilities. And in the one case in which it wasn’t at the top, maternal half-sibling actually had the highest individual probability! In half of the cases, if you knew the relationship was paternal, then you’d have less than 1% chance that the true relationship was anything other than the first one listed. I saw this happen dozens of times in beta tested data and when people were asking for help online. Ancestry tells most of us which side the match is on now, at least once we’ve designated which side is Parent 1 and which is Parent 2, so that makes it quite easy to know the relationship in a lot of cases.
Of course, people still have to do their due diligence to verify the correct match. This should be done genealogically, where possible. And you should still sort your matches into four groups and verify the right matching patterns. For example, if you think your match is a half-sibling, both of you should match through only one grandparent pair. If your match shows as “both sides,” that suggests that you might be the person’s aunt or uncle. But then sometimes there are double connections—all of us have them if you go back far enough. This complicates sorting our matches. When it’s hard to get your matches into four neat groups, what a boon it is to have a predictor that returns such high probabilities for the correct match!
Speaking of that, the average probability assigned to the correct match across the entire dataset of 125 matching pairs was a whopping 45%! It’s a bit higher, actually. In our random sample of six above we have four that are slightly under that and two that are astoundingly good. That’s well in line with my observations of the dataset on the whole. I’ve seen at least six cases when the probability of the correct relationship had 96% probability or more, including some with 100%.
What kind of probabilities should we expect to get for the correct relationship? When six relationship types are possible for Group 2 (the 25% average, or half-sibling, avuncular, and grandparent/grandchild group), evenly splitting them would give us 100%/6, or about 16.7%. So anything higher than that is good. For Group 3 (the 12.5% average, or 1st cousin group), where 12 different relationship subtypes are possible, predicting the correct relationship at 8% or higher would be good. But then there are cM and segment values for which both Group 2 and Group 3 are possible. We would expect some pretty low probabilities assigned to the correct relationship subtype in these ranges.
And that’s what makes the new tool so amazing, assigning over 45% probability to the correct subtype. Anyone can check their known values against these predictions and I highly encourage that. The vast majority of people who check a few known close family members will see very high probabilities assigned to the correct relationship.
I’d like to show a few more predictions that weren’t part of the random sample above so you can see what else this predictor can do.
The paternal grandfather match is obviously as good as it gets. Imagine being the person who found an unknown match and plugged this amount of cMs and segments into the tool. The tool handles 1st cousin relationships quite well, too, although the more distant the relationship is, the less differences we’ll see in the probabilities. The next match was a known maternal maternal 1st cousin and that relationship is right at the top of the probability list. The third match is a known maternal paternal 1st cousin. Keep in mind that a maternal paternal 1st cousin is the same relationship as a paternal maternal 1st cousin if viewed from the other’s perspective, and so the combined probability for those two would be 30.2%.
All full-sibling and parent/child matches in the empirical dataset were assigned a probability of 100% by the tool. If entering only cMs in two different relationship predictors, two of those full-sibling pairs were given less than 100% probability. Predictions are simply better when including the number of segments.
What led us here?
Really, 23andMe showed us that the number of segments is important back in 2012. Their scatterplot shows pretty distinct values when comparing different relationships.
I had to wonder if we could get good predictions if we further separated relationships by maternal and paternal sides.
Edit 19 Apr. 2023: Kitty Cooper showed back in 2017 that the number of segments can be useful for telling the difference between relationships and even for paternal/maternal sides.
The figure below shows cMs vs. segments for parent/child, full-sibling, and sex-specific plots for half-sibling, grandparent/grandchild, and aunt/uncle/niece/nephew. It also shows the difference between paternal paternal 1st cousins and maternal maternal 1st cousins.
There’s a lot of overlap between relationship subtypes in the scatterplot, far more than in the 23andMe plot and one has to wonder how theirs got to be so optimistic. Part of the answer might be because the 23andMe plot included far less data. It also appears like the 23andMe plot might only include full-siblings and not half-siblings. The new plot makes it seem as thought it would be hard to give a high probability to any one relationship in Group 2 (25% average group). Yet we often see 50% and sometimes see 100% given to one relationship subtype.
I found it difficult to choose colors such that the entire center of the plot wouldn’t look like one brown blob. Adding transparency for some, I can now see six different clusters. Those relationships, from the top, are red for maternal half-sibling, blue for maternal avuncular (aunt/uncle/niece/nephew), pink for paternal avuncular, purple for paternal half-sibling, grey for maternal grandparent/grandchild, and a different shade of blue for paternal grandparent-grandchild. (Paternal half-sibling markers have less transparency in the plot, which can unrealistically make their cM values look higher than grandparent/grandchild.)
As you can see the paternal relationships have fewer segments than the maternal relationships. This is expected because the average maternal recombination rate is about 1.7x higher than the paternal rate. The full-sibling cluster is almost completely separated from the other relationships, but regardless it’s easy to distinguish them based on fully-identical regions (FIRs).
The Orogen predictor was already giving great predictions for grandparent/grandchild relationships as well as paternal half-siblings because those relationships can have very high or very low cMs. So in a way it isn’t surprising that SegcM now does amazingly well for those relationships. Adding in the segments provides enough distinction between clusters that maternal grandparent/grandchild or half-sibling sometimes have over 50% probability, paternal half-siblings can have much more, and values of paternal grandparent/grandchild have already been found to have 100% probability.
The avuncular relationships (both maternal and paternal aunt/uncle/niece/nephew) have much more overlap with each other and with other relationships. For all of the visible overlap in the plot, you can see that the number of segments splits the group into three possibilities for a large chunk of the top part of the graph and a different trio of possibilities for most of the bottom part of the graph. CMs, which we already knew are more important, do the rest of the work.
You might be wondering how I was able to develop relationship predictions from two different independent variables, total cMs and number of segments. If I was into buzzwords I would say I used “machine learning.” But there’ve been regression and classification algorithms around for several decades and that’s what we called them regression and classification algorithms. The goal in this case is to separate the different relationship types, or classes, using the best fitting hyperplane.
Preparing data, choosing regression parameters, and compiling the output probabilities was a tremendous amount of work, enough to have a noticeable effect on my eyesight, resulting in reading glasses for yours truly. But now that I have all of the code written, saved, and organized, all of the code can be re-run pretty quickly. Python does the really complicated parts automatically. Apart from keeping track of the number of segments, the pre-processing was the same as in Nicholson (2022).
Currently, SegcM only includes predictions for the 3rd cousin group and closer. The benefit of including the number of segments rapidly decreases with lower cMs and it becomes much more difficult for a regression algorithm to separate a dozen groups from each other when they’re all possible over a given range of cMs and segments.
Also, number of segment predictions are only currently available for half-identical region (HIR) sites, which include all except 23andMe. This means that you can get cM and number of segment predictions for any relationships at 23andMe except full-siblings. But you’d have to subtract the X-DNA, just like one would have to for relationship predictions at any site other than DNA-Sci. Sometimes matches at 23andMe aren’t sharing segment information, so we don’t have access to the amount of X-DNA. In that case, you could subtract off the average X-DNA for a given relationship type if you know it’s possible for you to share X-DNA with a particular match. Also, I’m sure I’ll have all of the work done for 23andMe predictions based on number of segments eventually.
The data used for the Orogen predictions came from Ped-sim. In this case, the refined genetic map of Bhérer et al. (2017) was used as well as the crossover interference parameters of Campbell et al. (2015). I compiled 500,000 data points for each relationship type.
For a couple of weeks I’ve had the opportunity to see how this tool will be used to help people solve cases of unknown matches. Clearly this will help adoptees find their close family members more quickly. I’ve seen several cases in which people were looking for close family members and this tool could have provided a clear and quick way forward.
Here are the first five combinations of total cMs and number of segments that I came across from people looking for help once I started beta testing, in the order in which I saw them over just a three day period (i.e. no cherry picking):
The first person was looking for her father and soon after confirmed that this was a paternal match. There was almost no chance that they were anything other than a paternal half-sibling. As I’ve seen from over 100 empirical data points, the predictions are very accurate. It’s helpful to see the true relationship listed right near the top with a very high probability.
The second person had a match who was much older than her and was on her maternal side. While age isn’t always a good predictor of relationships, this match seemed to be a maternal paternal half- or great-aunt/uncle. Ancestry had them listed as a 1st to 2nd cousin. It’s much more helpful to see the likely relationships listed at the top.
The third person suspected that their father had another child and that this was that person, who was also looking for her father. SegcM suggests that they were right!
The fourth person had a question about a known relative. I asked if they were a maternal maternal 1st cousin and the answer was “yes.”
The fifth person found a match on her father’s side she suspected was a half-sister. They were born two months apart. While it’s always good to confirm that one isn’t the aunt of the other by making sure neither shows as “both sides,” and while this kind of thing definitely happens, it’s less likely that both grandparents or a full-sibling hid their child who was born two months after her. Surprise half-aunts are a bit more common. And does she even have a full-sibling who was old enough to have a child at that time? The predictor suggests that this is her paternal half-sister. Does that sound fairly reasonable?
When I was testing the predictor against known relationships that were sent to me, there was one value for a paternal grandfather with a 0% probability. That concerned me. But I found out a few days later that the relationship was mislabeled. It turned out that the grandfather was actually maternal with a 48.9% probability. So that’s one more benefit of this tool. It might tell you if you need to double-check your notes!
Also, in order for empirical data to help me validate the tool, I needed to know the exact relationship. There were probably a dozen times when I asked the people sending me data “This is a maternal relationship, right?” (or I asked if it was paternal) and they affirmed. There were no cases when I guessed wrong.
Previously, the most accurate predictions we could get were from Orogen. That tool predicted the right relationship subtype just over 26% of the time. Using segments along with cMs nearly doubles that, to over 45%. It should be noted that these averages apply to the particular dataset of 125 close matches. If a different dataset included distant cousins or even more first cousins than this one, the averages would be much lower. Similarly, if a dataset included more full-siblings, the averages would be higher. But we can definitely learn by comparing different tools to the same dataset.
We’ve seen how the tools perform when it comes to relationship subtypes. It took me a long time to look into this, but I’ve also recently compared the ability of a few different tools to predict groups. It’s much easier for a tool to say that a relationship is likely in Group 2 (half-sibling, avuncular, etc.).
The functionality of cM-only predictions at SegcM and Orogen is similar to that of a tool at an excellent site called DNA Painter, which uses data that came from simulations conducted by AncestryDNA. The simulations resulted in probabilities that were shown in a graph in their 2016 matching white paper. These probabilities were then copied using an online plot digitizer. Data from Ancestry simulations (DNA Painter) give the correct relationship group an average probability of 97% for the 125 close matches. Orogen predictions gave the true group a probability of 98%. The new predictions at SegcM give the right group an average probability of 99%.
It’s no surprise that relationship predictors get the right group for close family matches. But we’ve seen above many examples of how we can get a greater than 50% probability for the right relationship subtype. It’s great to be living in the future. We’re now thinking about relationship predictions in 2D!
DNA-Sci — advancing the science of relationship predictions. Please also submit data to this new DNA match survey that will greatly help improve and build new relationship prediction tools. You can also find mobile apps. for relationship predictions in the Apple Store and on Google Play. Feel free to ask a question or leave a comment. You might also like this tool to visualize how much DNA full-siblings share. DNA-Sci is also the original home of DNA coverage calculations.