Small segments affect most of our matches, whether we think we’re using them or not.
I don’t know that I would call it a debate, but the discussion about low centiMorgan (cM) DNA segments continues with fervor. And I almost never hear any arguments from the “other side.” That is, there are very outspoken people who don’t want you to spend your time researching small segments, and then there is this fantom group of people who somehow keep inciting the outspoken people into speaking up.
Small segments are generally considered to be below about 7 cMs. I don’t disagree with the people who are against using small segments. I personally don’t spend my time on them, although that would be my only chance of finding DNA matches who were descendants of my ancestors of interest. But there’s a more important aspect of small segments that I’ve never heard anyone mention. Removing “small segments” that aren’t actually very small will result in underestimated counts of total cMs.
A new way of looking at the problem:
The cM cutoff for segments to be included in the total should be such that the total cMs of improperly removed segments equals the total cMs of improperly included segments.
This thesis statement is based on the idea that preventing people from spending their time on segments that may not be valid isn’t as important as ensuring that we get accurate reports of total cMs from genotyping platforms. As it is now, the majority of DNA matches may have artificially low total cM values. I’ve only ever heard people talk about how they don’t want false segments in their totals because that would mess up our relationship predictions, but that isn’t the right goal. After all, if we wanted to be absolutely sure that we kept none, we’d might have to remove all 20 cM segments and below, or worse.
Trying to keep all false segments out is what would mess up our relationship predictions. There’s a good chance that removing a segment that has an 80% chance of being real will result in an underestimation of total cMs. Although, the cutoff won’t necessarily be right at the cM value where a segment has a 50/50 chance. That’s because there are many more small segments than large segments out there. As stated in this paper, removing all segments under 5 cMs eliminates 99% of true segments.
The genotyping platforms could probably tell us exactly where the cutoff would be in order to satisfy our thesis statement, but don’t hold your breath. Fortunately, we do have ways to evaluate this problem. Family Tree DNA (FTDNA) recently released a white paper in which they reported the probabilities of segments being real based on segment size. These probabilities, which are shown below, are far more optimistic than the previously most frequently shared table. So, it was surreal to see the news spun as further evidence against small segments. And not that I mind if someone calls a 6 cM segment “poison,” but it was especially strange to see a post by an admin. with that headline in a group in which one of the only rules is to keep things positive about what we can do with the DNA tools available to us rather than what we can’t do.
The formula to use
How do we evaluate the data in order to satisfy our thesis statement? We’d have to pick a cutoff value, say 5 cMs and under, and then test it in this way:
Number of true 1 cM segments * 1 cM + Number of true 2 cM segments * 2 cMs + Number of true 3 cM segments * 3 cMs + Number of true 4 cM segments * 4 cMs + Number of true 5 cM segments * 5 cMs
Number of false 6 cM segments * 6 cMs + number of false 7 cM segments * 7 cMs + …
We need to see what lower threshold of cM values results in both sides of the equation being equivalent. That’s pretty simple. And all of the information we need is found in the first three columns of the FTDNA table. Of course, this is an approximation. To get a more accurate value, we’d have to sum up the number of true 4.9 cM segments times 4.9 cMs, and so forth for 4.8, etc. And to get even more accurate, we’d have to extend that to two decimals. But this approximation will pretty insightful. I’m going to start out with the same cutoff as above. Let’s see what we get.
There’s an enormous difference between the improperly removed cMs and the improperly kept when 5 cM segments and below are discarded. More than three times as many cMs in this dataset are improperly discarded as improperly kept. If we were to assume for a moment that FTDNA’s numbers are correct, I think this difference is far too great to be chalked up to this being an approximation. And the lack of data above 14 cMs will hardly have an impact either: the last two values added to the “improperly kept” series are 351 cMs and 182 cMs—a very small, decreasing amount compared to the total of nearly 20,000. We need a lower cM cutoff according to FTDNA’s numbers.
Now that only 4 cM segments and below are discarded, we have more cMs improperly kept than improperly removed, but now the values are much closer to each other. So, if the FTDNA analysis were sound, the cutoff would need to be somewhere between 4 and 5 cMs in order to get accurately reported cM totals, and probably much closer to 4 cMs than 5 cMs. I’m not saying that genotyping platforms should use a cutoff value of 4.2 cMs for reporting totals; rather, that’s only according to FTDNA’s analysis and there may be issues with their methodology. But if FTDNA has high confidence in their methods, it would be perfectly reasonable for them to use a cutoff around 4.2 cMs for reporting total cMs.
Other DNA sites might pick a different value. An analysis of data from 23andMe by Durand, Eriksson, and McLean (2014) shows probabilities of false matches that are a little less optimistic than FTDNA’s numbers. 23andMe considers two people a match if they share at least a 7 cM segment, but they report additional segments as low as 5 cMs. Could it be possible that the scientists at the genotyping platforms have already thought of all of this? They may have set their thresholds at values such that their databases will be more accurate.
I’ve done the same analysis as I did above with a set of empirical data that isn’t nearly as optimistic about small segments as FTDNA’s table. However, I arrive at an even lower number: somewhere between 3 and 4 cMs.
There’s another way to check if an improper threshold is being used. A cutoff value that’s too high will result in datasets with averages that are too low. If you ever compile a decently sized dataset and the average you calculate for a given relationship type is lower than the theoretical value, there’s a good chance that you can blame it on too high of a cM cutoff value.
If you had access to the most accurate relationship predictor, would you use it? Feel free to ask a question or leave a comment. And make sure to check out these ranges of shared DNA percentages or shared centiMorgans, which are the only published values that match peer-reviewed standard deviations. Or, try a calculator that lets you find the amount of an ancestor’s DNA you have when combining multiple kits. I also have some older articles that are only on Medium.