Population Weights

What are population weights as applied to DNA match relationship predictors?

Relationship predictors are DNA match tools such as SegcM that produce a table of possible relationships and their probabilities. These tools have historically used population weights to improve the accuracy of results.

Population weights increase the probability given to distant cousins by a relationship predictor because a person typically has about five times as many cousins with each more distant generation. So while you might have about seven 1st cousins, you could expect to have closer to 35 2nd cousins.

Background

In 2012 scientists from 23andMe published the first article about population weights. They developed a formula for approximating the number of cousins that a person would have when assuming that 2.5 children survive per family. In 2021 I developed the same formula without realizing that Henn et al. (2012) had already done so. But at the same time, since relationship predictors also include groups halfway in between nth cousins, such as cousins once removed or half cousins, I had to develop formulas for cousins once removed. These formulas were later described in this science article.

Number of cousins

The formulas show that population weights grow very large for distant cousins. And this means that distant cousins are more likely to show up in your DNA match list, even for a given cM value. For something like 4th cousins and more distant, the population weights drive the probabilities much more than any other factor, such as whether or not the predictor uses a peer-reviewed scientific model. The table below shows the number of cousins one might expect to have up to 8th cousins once removed, ignoring pedigree collapse and endogamy.

Number of expected cousins based on formulas from Henn et al. (2012) and Nicholson (2022). 1C = 1st cousin; 1C1R = 1st cousin once removed, etc.

Why is this necessary?

The reason for population weights isn’t intuitive for most people. So consider this scenario. Let’s say you have seven 1st cousins and seventeen 1st cousins once removed (1C1R). That’s 2.4x as many 1C1Rs as 1st cousins. The average shared DNA for a 1C1R is 6.25%, while the average is 12.5% for a 1st cousin. Because 1st cousins have a wider range of shared DNA, the point at which 1st cousins and 1C1Rs each have 50% probability is around 8.9%. That’s what would show in a relationship predictor without population weights.

Let’s say you have a new 8.9% match. Knowing nothing other than the amount of DNA you share and how many 1st cousins and 1C1Rs a person typically has, what are the probabilities for a 1st cousin or a 1C1R? Assuming that no other relationships are possible, it’s 2.4x more likely that they’re a 1C1R because you have 2.4x as many 1C1Rs as 1st cousins. That means that there’s a 29% chance that the match is a 1st cousin and a 71% chance that they’re a 1C1R. This is what population weights do and it’s a good thing we have them.

One important thing to note is that population weights are unrelated to the weighted or unweighted cM values at Ancestry, which is related to their attempt to reduce segments from pile-up regions.

Remember that the purpose of a relationship predictor is to identify unknown matches. That’s much more important than determining how well it works for known matches. (We still have to evaluate predictors based on known matches, but can’t compare a population weighted predictor to an unweighted predictor.) Population weights help us identify unknown matches at the expense of the predictions for known matches.

If you start looking at your matches sorted by the highest amount of shared DNA first, you’re going to see distant matches earlier on than if all degrees of cousins were equally represented in the population. When you come to an unknown cousin, they’re likely more distant than what’s suggested by the cMs. If you skip over all of the matches with an unknown relationship and enter the shared DNA of known relationships into an unweighted predictor, you’ll get better predictions. If you skip all of the known matches and enter the shared DNA for unknown matches into an unweighted predictor, it’ll often make it seem like the matches are closer than they really are.

How do population weights affect the results?

Since the methodology for relationship predictors from DNA matching sites is proprietary, people have been left guessing about whether or not any particular predictor includes population weights. Fortunately, I started building predictors both with and without population weights in early 2021. When you compare other predictors to the weighted or unweighted predictors at DNA Science, it’s very easy to see which one they match.

The image below has been taken from a 2023 GFO presentation. It shows probabilities from four different relationship predictors, all at 8 cMs. We can see from the second predictor from the left that without population weights the most likely relationship is 4th cousins. This goes against the empirical evidence that any genetic genealogist has seen while working matches. The vast majority of our 8 cM matches are much farther back than 4th cousins. The population weighted predictor on the far left shows that this is most likely a 8th cousin once removed or farther back.

weighted predictor vs. three different unweighted predictors, all at 8 cMs

The Ancestry probabilities shown above, which have been updated since the 2016 matching white paper probabilities, are clearly similar to the unweighted DNA-Sci predictor and not the weighted version. The same goes for the MyHeritage predictor. Although it’s possible that the MyHeritage predictor has some mild population weights, the probabilities are far closer to the unweighted DNA-Sci predictor.

There are also probabilities from Ancestry’s 2016 matching white paper, which are rumored to have population weights. If you look at those probabilities, you can see that they do have population weights, but they’re milder than the Henn et al. (2012) and Nicholson (2022) population weights.

Population weights are an essential component of relationship predictions. It’s best to use a predictor with a clear methodology published, such as SegcM, so we know that the most accurate methodology is being used.

DNA-Sci — advancing the science of relationship predictions. Please also submit data to this new DNA match survey that will greatly help improve and build new relationship prediction tools. You can also find mobile apps. for relationship predictions in the Apple Store and on Google Play. Feel free to ask a question or leave a comment. You might also like this tool to visualize how much DNA full-siblings share. DNA-Sci is also the original home of DNA coverage calculations.

What are population weights as applied to DNA match relationship predictors?

Related

Submit a Comment Cancel reply

Recent Posts

Recent Comments

Archives