Should we use weighted or unweighted cMs in a relationship predictor?

Does the Timber algorithm do a good job at reducing cMs from pile-up regions?

A common question in genetic genealogy has been “Should I enter the unweighted or weighted cM value from AncestryDNA when I’m using a relationship predictor?” This refers to either the original centiMorgan (cM) value (unweighted) or the amount after Ancestry’s Timber algorithm has reduced or removed segments it deems to be from pile-up regions (weighted). Pile-up regions are segments of our DNA that many people tend to share with each other because some ancestors in a founder population had them and their descendants are very numerous.

In this article I’ll use data that DNA testers have generously submitted to this survey to answer the question empirically and objectively.

Many people have anecdotally reported that the unweighted value seems to result in better predictions, insinuating that Timber has removed too many cMs from their matches, either by reducing the cMs of too much or by incorrectly labeling segments as pile-up regions.

Population weights

One bit of terminology that might result in some confusion is that relationship predictors have historically used population weights to improve the accuracy of probabilities. These are unrelated to the weighted or unweighted cM values at Ancestry. Population weights increase the probability given to distant cousins by a predictor because a person typically has about five times as many cousins with each more distant generation. So while you might have about seven 1st cousins, you could expect to have closer to 35 2nd cousins. And this means that distant cousins are more likely to show up in your DNA match list, even for a given cM value.

Remember that the purpose of a relationship predictor is to identify unknown matches. That’s much more important than determining how well it works for known matches. (We still have to evaluate predictors based on known matches, but can’t compare a population weighted predictor to an unweighted predictor.) Population weights help us identify unknown matches at the expense of the predictions for known matches.

If none of that is making sense, but you wish it did, please read the article about population weights.

Which tool should we use to evaluate different cM values?

One might instinctively think that we’d compare Ancestry’s unweighted and weighted cM values into SegcM, the relationship predictor that’s by far the most accurate and is the only one that comes with the trustworthiness of a methodology published in a science journal.

But we’ll have to use a predictor that doesn’t have population weights instead.

When we compare a relationship predictor’s ability to identify known matches, predictors without population weights have the advantage. Another way to describe what population weights do is that they make it look like you entered a lower cM value than what you actually entered into a relationship predictor. This causes a problem for our current study, where we want to see if weighted (post-Timber) or unweighted (pre-Timber) values give a better prediction.

I already know what’s going to happen in this study if we use the most accurate relationship predictor: unweighted values, which are greater than or equal to weighted values, are going to look better in a population weighted predictor because they’re going to offset what the population weights are doing. While the population weights make it look like you entered a lower cM value, the unweighted values are going to be higher, resulting in higher probabilities for known relationships. Indeed, I tried using a population weighted predictor and saw higher probabilities for unweighted cM values for every relationship.

There’s a simple solution for this. All we have to do is compare the unweighted and weighted cM values with a predictor that doesn’t use population weights. Fortunately, there have been unweighted predictors hosted at this site for several years.

Methodology

The dataset will be fairly limited since Timber is only applied to matches sharing 90 cMs or less. I will only include relationships from the half 2nd cousin and 2nd cousin once removed group to the 6th cousin group. This results in a dataset of several hundred matches, over 87% of which are affected by Timber.

I’ll use the same file that provides probabilities for this unweighted predictor to compare probabilities given to the correct relationship based on both weighted and unweighted cMs. The probabilities shown will the average across the entire group.

Results

What we see in the results is that matches get higher probabilities for the correct relationship if we enter the weighted (post-Timber) cM value. This is true for all relationships from the 3rd cousins once removed group to 6th cousins.

Comparison of predictions from weighted (post-Timber) vs. unweighted cM

For the 2nd cousin once removed group, probabilities were higher with unweighted values. For the rest of the relationship groups, the weighted value performed better. However, when we’re deciding to enter the weighted or unweighted amount into a relationship predictor, it’s because we don’t know the relationship.

A better way to do the analysis is to test by cM ranges. Breaking up the cM values of empirical data pairs into 10 cM bins, we’re left with nine bins. The results below show that for almost all cM ranges, the weighted (post-Timber) cM value results in a prediction that’s the same or better.

Comparison of predictions from weighted (post-Timber) vs. unweighted cMs by cM range

One exception is for the (80, 90] cM range, which corresponds well with what we saw in the first table: 2nd cousins once removed getting better predictions from the unweighted amount. The only other exception is for the range of (30, 40] cMs, which to me looks like an anomaly in the data. In fact, 22% and 23% probabilities may be identical from a statistical standpoint.

The benefits of the weighted cM amount seem to peak around 50 – 70 cMs. As expected, there’s little to know difference in predictions between weighted or unweighted cM amounts as the values get closer to 8 cMs.

Discussion

It seems that Timber works well for distant cousins but not 2nd cousins once removed and closer. What would cause this disparity?

One explanation might be that, while half 2nd cousins and 2nd cousins once removed can share DNA on pile-up regions, they’re more likely to share it from their recent common ancestor(s) than from very distant ancestors from a founder population. For distant cousins, they might be more likely to share segments on pile-up regions because of sharing distant population founder ancestors than because of the most recent shared ancestor.

This would be an argument for lowering the 90 cM threshold for Timber to something like 80 cMs. Or it might be better to prevent Timber from acting upon segments of a certain size, since 2nd to 3rd cousins might share larger segments that happen to contain known pile-up regions within them.

Another explanation could be that distant cousins tend to share more connections. I’m not referring to shared ancestors from a founder population, but to the fact that people share double, triple, or more connections, including all people if you go far enough back. It might be that most of the 4th cousins in the empirical dataset also share a relationship such as 6th cousins in addition to their known relationship.

This would mean that one error is canceling out another, or more accurately an error is canceling out an unwanted artifact: Timber is removing some cMs, bringing matches more in line with theoretical values since no relationship predictors account for the fact that some distant cousins share double connections. The reason that weighted values don’t work as well for 2nd cousins once removed, then, could be that they don’t usually share double connections that contribute as much cMs as a proportion of the known connection.

The results of this study suggest that you should use the weighted cM value for 80 cM matches and below, although it won’t matter too much for 50 cM matches and below. For matches in the range of (80, 90] cMs, it’s better to enter the unweighted value. No matter what, the differences in predictions aren’t going to be huge. But people have been asking this question for a long time and they can finally rest easy entering the weighted cM value for most matches, which is fortunately the one displayed more prominently by Ancestry.

If you’d like to make this analysis more robust, whether you think more data will strengthen the findings here or start to reverse them, please submit DNA match data to the most scientifically relevant DNA match survey here. It’ll be very easy for me to re-run my program as new data arrive.

DNA-Sci — advancing the science of relationship predictions. Please also submit data to this new DNA match survey that will greatly help improve and build new relationship prediction tools. You can also find mobile apps. for relationship predictions in the Apple Store and on Google Play. Feel free to ask a question or leave a comment. You might also like this tool to visualize how much DNA full-siblings share. DNA-Sci is also the original home of DNA coverage calculations.

Should we use weighted or unweighted cMs in a relationship predictor?

Does the Timber algorithm do a good job at reducing cMs from pile-up regions?

Related

Submit a Comment Cancel reply

Recent Posts

Recent Comments

Archives