An analysis of whether or not the longest segment should be included in relationship predictions

In February of 2023 a new tool called SegcM redefined the way genetic genealogists get their relationship predictions. SegcM uses the number of segments along with total centiMorgans (cMs) to predict how two DNA matches are related. The probability that SegcM assigns to the correct relationship is several times higher than from cMs alone (see here or here). Probably the most amazing thing about SegcM is its ability to distinguish between relationships that have the same average shared DNA. Not only have we known for years that the cM ranges are different between these relationships, but the number of segments is quite different, especially for close family relationships.

The longest segment

We’ve seen countless examples of how much the number of segments improves relationship predictions. Could the longest segment be just as useful? For the past few months I’ve been saying “probably not.”

Since the number of segments and the longest segment are fairly well related variables, we wouldn’t expect too big of an improvement by including them both over one or the other. Also, while the number of segments provides a good characterization of the whole match (i.e. how many recombinations were there?), the longest segment is a much more specific metric that will have more random variability. Imagine trying to predict relationships by using the shortest segment or the number of cMs on Chromosome 3.

Another aspect of the longest segment that could be troublesome is how genotyping errors will affect it. Genotyping errors are more likely to occur on the longest segment than on any other one segment. And they’re very common, especially at Ancestry. I typically see evidence of about two or three genotyping errors per match when comparing empirical data to simulated data.

Consider how genotyping errors would affect matches in the close family group. While genotyping errors typically add 1-3 segments to a 30 segment match and take away just a few cMs from what could be a couple of thousand cMs, they could easily cause the longest segment to drop by 50%.

A grandparent/grandchild match might share four full chromosomes, just as an example. What if one of them is Chromosome 1 and the other three are much smaller chromosomes? A genotyping error on Chromosome 1 could change the longest segment from 280 cMs to 170 cMs, which is about the genetic length of Chromosome 8.

Genotyping errors on the two largest shared chromosomes could bump the longest segment down even farther. In the above example, if both Chromosomes 1 and 8 have genotyping errors, the longest segment could become Chromosome 14. That means that a match with only two genotyping errors could show a longest segment of 119 cMs when it should have been 280 cMs. It must be a pretty common occurrence that the longest segment drops by 25% or 50% because of genotyping errors.

For distant matches, the longest segment will be as unhelpful for relationship predictions as the number of segments is for distant matches. People who are distantly related in only one way will almost always share only one segment. And the longest segment will be equal to the total cMs. It doesn’t sound like the longest segment will be of much value for relationship predictions.

But let’s look at it anyway.

Methodology

I’ve chosen two machine learning models that work great for shared matching data and made the following six comparisons:

  • Six close family relationships compared to each other
  • Close family relationships compared to the 1st cousin (1C) group relationships
  • Sixteen 1C group relationships compared to each other
  • 1C relationship groups compared to the 1C once removed (1C1R) and half 1C group
  • The 1C1R group compared to 2nd cousins

The six relationship comparisons will be tested against the following three metrics:

  • The longest segment along with total cMs
  • The # of segments along with total cMs
  • The longest segment, the # of segments, and the total cMs

As a baseline, the accuracy scores for a model in which only cMs were used will also be shown.

Whichever of the two models produces the highest accuracy score will be the one for which the results are displayed in a table here. Each comparison will have a potentially different set of hyperparameters in order to get the highest accuracy score for each one. So some rows in the table can use a different set of parameters or even a different classification model.

Results

The results of the comparisons are shown in the table below.

Accuracy scores for the number of segments and longest segments for six different relationship comparisons

Accuracy scores for six different relationship comparisons and four different metrics. All metrics include total cMs. The rightmost three columns include the longest segment, # of segments, or all three metrics. Comparisons of relationships within groups are in unshaded rows. Comparisons of two different groups are in shaded rows. A particular row will have results from whichever of two models performed the best.

As we’ve seen before, accuracy scores are much higher when differentiating between rather than within groups. It’s far easier to tell the difference between two groups than between 12 different relationships in the 1C group. Still, close family predictions here are much better than the expected 17% and 1C predictions are much better than the expected 8.33% that you would get if you divide 100% by the number of relationships in each group.

Out of the three metrics, except for the case of one tie, the lowest accuracy score always came from the longest segment and total cMs. When trying to differentiate between relationships within the same group, the accuracy scores for the longest segment are much lower. This means that predictions that use the longest segment wouldn’t be nearly as good as for the number of segments.

CMs alone are about as good at differentiating between groups. But that doesn’t mean that probabilities from Ancestry simulations done in 2016 will be about as good simply because they only used cMs. These scores come from machine learning models and have been proven to be more accurate than Ancestry simulations when tested against empirical data. If cMs alone are about as good at differentiating between groups, it would be quite easy to build a relationship predictor like SegcM that uses only cMs to differentiate between groups but then provides the individual relationship probabilities based on both cMs and the number of segments.

In one case (comparing the 1C group to the 1C1R group), including the longest segment along with cMs and the number of segments actually decreased the accuracy score. When there’s a gain from including all three metrics, it’s only a very slight increase in the accuracy score.

Discussion

The longest segment doesn’t help very much with relationship predictions. It performs worse than the number of segments, and using both variables doesn’t increase the accuracy scores much at all. Considering that it took the genetic genealogy community a few months to embrace a predictor that triples or quadruples the accuracy scores for close family predictions, genealogists won’t be too eager to move the accuracy score from 87.0% to 87.1% for differentiating 1C1Rs from 2Cs.

We wouldn’t be able to include the longest segment only when it makes predictions better because relationship predictions are meant to be used when we don’t know the relationship, so we wouldn’t necessarily know that it would make predictions better in a particular case. Although, it seems like we could safely get a slight positive effect by including the longest segment when the shared DNA is above about 1,400 cMs in order to have a slightly more accurate model when differentiating between close family relationships.

Unless a more advanced machine learning algorithm is developed that’s exceedingly better at teasing apart relationships based on the longest segment, it won’t be worth it to include the longest segment in relationship predictions. But I suspect that it’s just inherent to the properties of the longest segment that it doesn’t provide much information on top of the number of segments.

Consider the figure below that explains how SegcM works and that was released at the same time. It looks to be fairly jumbled. But closely examining the figure reveals some interesting benefits of the number of segments. For example, about half of paternal aunt/uncle/niece/nephew pairs will have no overlap with any other paternal relationship, meaning that if a match is known to be paternal, that can be the only option. Also, the maternal grandparent/grandchild and paternal grandparent/grandchild clusters have very little overlap with any other maternal relationships, resulting in the same kind of benefit. The paternal grandparent/grandchild cluster has many areas where it’s the only option, leading to predictions of 100% probability in some cases.

sex-specific cMs vs. segments for parent/child to 1st cousins including

A similar plot of total cMs versus the longest segment will show whether or not it could be as useful. In a way, the figure below is like a number of segments plot flipped upside down. That makes sense because relationships with a higher number of segments will generally have smaller segments due to more recombinations. But the most important facet of the below plot is that it’s more jumbled together. The clusters have too much overlap to be useful for relationship predictions. And it’s the same relationships (maternal relationships and aunt/uncle/niece/nephew relationships) that have the most overlap, just like with the number of segments. The longest segment could have been useful if it helped distinguish between the relationships that are harder for the number of segments to differentiate.

Probably a more useful metric would be the standard deviation or variance of segment size. Total cMs along with the number of segments and a measure of the spread of segment sizes would be a combination of metrics that capture a bigger picture view of a DNA match. The only problem with that is that not a single DNA testing site reports the standard deviation or variance of segments sizes to us. But if we could demonstrate its usefulness, perhaps a site such as GEDmatch, that’s more open to customer feedback and typically provides us with much more information about our DNA matches, could start reporting that to us.

The small gains from the largest segments aren’t worth the substantial amount of work that would be necessary. The probability files for SegcM already contain repeats of all of the possible cM values for each number of segments. A second metric such as the number of segments makes the number of rows in the probability files grow proportionally to the square of the original number of rows (approximately, but fortunately there are fewer values of the number of segments possible than the number of total cMs). Including the number of segments would require repeating all of the cM and number of segment pairs for each possible longest segment value, thus approximately cubing the total number of rows in the probability files.

Finally, here are some things to keep in mind about this study:

  • The results here show accuracy scores that are sometimes higher than what can be obtained in real life. Sometimes more than two relationship groups are possible for the given input values. For example, there are cM values at which the 1C group, the 1C1R, and the 2C group are all possible, but I only made pairwise comparisons of groups here.
  • Some of the accuracy scores I’ve shown here could be improved upon with additional hyperparameter tuning, but only slightly.
  • The data do not include genotyping errors. Including genotyping errors would likely show the longest segment more unfavorably.

Overall, the longest segment should not be used in relationship predictions, but it could potentially be used to differentiate between relationships in the close family group.


The data used for these predictions came from Caballero et al. (2019). In this case, the refined genetic map of Bhérer et al. (2017) was used as well as the crossover interference parameters of Campbell et al. (2015).


Edit: This article was published on 12 Sep. 2023, but the URL suggests it was published on 6 Sep. 2023. That’s because it was accidentally published for about five seconds on the earlier date when the article was far from finished. I’ll likely correct the URL eventually and then old links will be broken.

DNA-Sci — advancing the science of relationship predictions. Please also submit data to this new DNA match survey that will greatly help improve and build new relationship prediction tools. You can also find mobile apps. for relationship predictions in the Apple Store and on Google Play. Feel free to ask a question or leave a comment. You might also like this tool to visualize how much DNA full-siblings share. DNA-Sci is also the original home of DNA coverage calculations.