Key to understanding how new relationship predictors perform is an understanding of how probabilities work

One time I was sitting and waiting for class to begin in one of many advanced statistics courses I took. The professor entered and started class with something of a riddle. I now know that it’s from a famous experiment and I’ll just quote it from here:

Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations. Which is more probable?

1. Linda is a bank teller.

2. Linda is a bank teller and is active in the feminist movement.

My hand shot up and I answered the question when called upon, possibly ruining the exercise. I could tell that the probability of #2 must be lower or equal to the probability of #1 because #2 demands more information to be true. The professor then spent some time trying to convince other students that this was the case. The “and” in #2 decreases its probability.

Probabilities can be tricky.

A couple of new relationship prediction tools have been released this year for genetic genealogists. Predictors have always used centiMorgans (cMs) to determine a relationship. In February, a new tool called SegcM began also using the number of segments to make very accurate relationship predictions. In March, MyHeritage released the first relationship prediction tool that takes the testers’ ages into account.

As expected, both the number of segments and ages help. It’s clear that the probabilities for the correct relationships will be higher when we use more of the available information.

A different probability riddle

What if, in our bank teller example, #2 had used the word “or” instead of “and?” It would look like this:

Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations. Which is more probable?

1. Linda is a bank teller.

2. Linda is a bank teller or is active in the feminist movement.

Now #2 is more probable because there are more options. Option #2 makes up a group of possibilities that includes the possibility of #1. Option #2 will always be greater than or equal to #1. This is the type of problem that we’ll be concerned with today.

What’s the advantage of these two new predictors?

Both SegcM and the MyHeritage predictor list some individual relationships rather than lumping them into large groups. Naturally, the probabilities decrease when more possibilities exist. The SegcM tool usually rules out some of the close family options, so it was hard to find one where all six relationships were possible, shown below on the left.

This is a paternal half-sibling match with 44.8% probability. Maternal grandparent/grandchild has 44.9%. This image was selected because the other four close family relationships have positive probabilities.

The image on the left shows a pretty typical paternal half-sibling match, but it was selected because it’s a little less conclusive than usual. In this case all six close family relationships are possible. Still, it’s easy to see the value here when compared to predictions that simply say 100% probability of the close family group like in the DNA Painter tool on the right.

The DNA Painter tool shows the probabilities from simulations conducted by Ancestry and shown in a graph in their white paper. In this tool, the probability assigned to each of the three (purple) relationship types above is about 33% and the probability assigned to each of the six relationship subtypes is 100% / 6 ~= 17%.

If 17% assigned to paternal half-sibling on the right doesn’t sound correct to you, consider this. Did the image on the right assign 100% to maternal grandparent/grandchild and 100% to paternal half-sibling? No, that would add up to a 200% probability, which isn’t possible. In fact, there are six relationships shown on the left. The tool on the right can’t assign a 600% probability to the total. You have to divide the probability on the right by the number of relationships represented in the group to see how much probability it assigned to each of the relationships on the left.

We have to make these adjustments if we want to compare tools. Or we can just compare groups, but comparing 100% to 100% isn’t very useful.

We’ve already seen from 125 empirical data points how well SegcM performs. These values weren’t cherry-picked; it isn’t a tiny dataset and all of the data available were used (read: not anecdotal). The data were mostly close family but included some from the 1st cousin group.

The average probability assigned to the correct relationship for the whole dataset was 45% for SegcM and 26% for Orogen, a cM-only predictor, but the first to use a peer-reviewed dataset. No other tools were available at the time to compare individual probabilities.

Comparing groups of relationships is less interesting. When I did so, SegcM assigned more than 99% to the correct relationship, Orogen 98%, and AncestryDNA simulated 2016 probabilities 97%.

An average of 45% is just so much higher than 17%. 17% is the maximum probability that any DNA Painter tool can assign to an individual relationship type in the close family group. That means that SegcM assigns at least a 3x higher probability to close family relationships than probabilities from Ancestry simulations.

1st cousins and extended family

The forensic genealogy case workers are using SegcM because the probabilities assigned to the correct relationship are so much higher. It’s easy for anyone to see the advantage of SegcM for close matches. But most of our matches are distant. Orogen predictions, which used only cMs, gave pretty similar predictions to those of other tools, although slightly better and with the benefit of a published methodology and a peer-reviewed data source.

When we add the number of segments to the analysis, does that improve the predictions for matches other than close family?

Roberta Estes recently wrote a blog post evaluating the predictions of the new MyHeritage tool. Fortunately for us, she included the number of segments for each of her matches that she analyzed. It doesn’t look like we can see the ages of both testers for every match, so I can’t get the probability that the MyHeritage tool assigned to the correct relationship.

Unfortunately, we also don’t know which side the matches are on. For example a paternal maternal 1st cousin is a child of your father’s full-sister. We’ll have to treat all 1st cousins as the same in this case when comparing DNA-Sci tools to the probabilities from 2016 Ancestry simulations.

The table below shows how the two tools performed for Roberta’s dataset.

comparison of three tools with data from Roberta Estes’ blog post

We see the same results in this table that we see in the 125 close family to 1st cousin data from the last post, although the advantage that SegcM has for close family is even more striking. That is, SegcM assigns the highest probability to the correct relationship, then Orogen, then the Ancestry simulations. When looking at groups, the tools rank in the same order, but the differences aren’t as large.

Comparing the tools’ ability to predict the correct individual relationship type, SegcM had the highest probability of the three tools about 50% of the time. Orogen probabilities ranked second about 90% of the time. And the Ancestry simulated data gave the lowest probability 68% of the time.

When predicting group probabilities, SegcM had the highest probability 63% of the time, Orogen ranked second 73% of the time, and the Ancestry simulated data performed the worst 68% of the time. It’s clear that even the Orogen probabilities are better than the Ancestry simulated probabilities, just like we saw for close family relationships.

Overall, SegcM assigned a 32% higher probability to the correct relationship type and a 10% higher probability to the correct group. Combine that with the benefits of a peer-reviewed data source and clear, published methodologies for DNA-Sci tools.

It’s worth noting that in the few cases when SegcM assigns a lower probability to the correct relationships, that doesn’t mean that SegcM gave a bad prediction. As the predictor that’s much more accurate, SegcM gives a truer probability. That means that in the few cases that the SegcM probability is lower, it’s because the match shares an unusual number of segments.

Users of SegcM are getting very high probabilities for the correct relationship the vast majority of the time. But occasionally someone says that they got a bad prediction. What do those bad predictions look like?

Bad Predictions?

About a week ago a person said that all of their SegcM predictions were bad. I responded that I’d be curious to see them. I said that I’ve seen people say that a “45% paternal half-sibling” prediction was bad when that was the correct relationship. 45% is a monumental improvement over the expected 17%. What was their answer? They got a 48% probability prediction for their paternal half-sibling. So the next time that happens I can cite that statistic instead of the 45% one.

I’ve also seen people say that the MyHeritage predictor only gave them 9% probability for the correct relationship. They asked, “why can’t my half-niece be my half-niece?” The genetic genealogy community has a lot of work to do to make sure it’s understood that a 9% probability event isn’t impossible.

I was also messaging with a podcaster a couple of weeks ago about SegcM. They told me that they checked several of their close, known family, and it seems way off. Fortunately, there’s a blog that goes by the same name. In a blog post on that site, they looked at a couple of their predictions with the new MyHeritage tool and they happened to show the number of segments. What an opportunity.

They had a match of 628 cMs and 14 segments for a known half 1st cousin. They said the third probability down at MyHeritage with 22.8% was “an overall very good prediction.” Guess what? SegcM gives it a 38.6% probability.

Hopefully this post clears up some confusion about probabilities.

We also saw a very big difference between the three tools for Roberta Estes’ data. I highly encourage you to check your own data in the same way.

DNA-Sci — advancing the science of relationship predictions. Please also submit data to this new DNA match survey that will greatly help improve and build new relationship prediction tools. You can also find mobile apps. for relationship predictions in the Apple Store and on Google Play. Feel free to ask a question or leave a comment. You might also like this tool to visualize how much DNA full-siblings share. DNA-Sci is also the original home of DNA coverage calculations.

5 Comments

Adam on April 10, 2023 at 10:38 am

I have first heard about the study made to compare various predictors on the FB group. When I checked the blog post of one of the organisers, it made you look like a villain for locking access to your tool. Now I understand your logic. Good thing is that even though the comments were maligning your work, thanks to them I have discovered your tool and your blog which is on a different level and as a mathematician I enjoy it very much.
Also, I am helping my friend to solve a genealogical mystery, he has a match 196.1 cM with 11 segments and it didn’t occur to me before that this may be a half 1C. Thanks to your tool this is now one of my prime hypotheses and it corroborates well with the locations of people involved. Unlike what I read in the beforementioned FB group to the tune of ‘if the relationship falls on the tail you should probably look elsewhere’. SegcM helped me to realise this is a real possibility.
I think Andy Lee has made your tool justice as well in his recent video, “SegcM vs Shared cM – Which is Better at Predicting Close DNA Relationships?”
I have tested extensive family members and if you wish I can give you my dataset of several dozens family members from 1C2R, 2C1R and closer. I think some of them really fall on the tail of the distribution which goes to show we need to keep our mind open as a genetic genealogists!
John on May 13, 2023 at 6:57 pm

Do you have a predictor which includes longest segment (on another axis) ?
- DNA-Sci on May 13, 2023 at 10:25 pm
  
  Hi John,
  
  I don’t have that yet, but it’s something I’d like to look into. As of right now I have no idea how much value it would add to predictions. We can already see that the number of segments and total cMs result in phenomenal predictions. But there’s always room for improvement.
Lee Herman on August 6, 2023 at 1:38 pm

Hi

Are your methods built on publicly-available data? If so, is it possible to get a copy? I’d like to play with it using my machine learning methods.
- DNA-Sci on August 8, 2023 at 10:53 am
  
  Hi Lee,
  
  The data source is publicly available. One just has to set up and run Ped-sim: https://github.com/williamslab/ped-sim
  
  The probabilities for SegcM were generated by a machine learning algorithm in Python based on Ped-sim data.