Small segments affect most of our matches, whether we think we’re using them or not.
I don’t know that I would call it a debate, but the discussion about low centiMorgan (cM) DNA segments continues with fervor. And I almost never hear any arguments from the “other side.” That is, there are very outspoken people who don’t want you to spend your time researching small segments, and then there is this fantom group of people who somehow keep inciting the outspoken people into speaking up.
Small segments are generally considered to be below about 7 cMs. I don’t disagree with the people who are against using small segments. I personally don’t spend my time on them, although that would be my only chance of finding DNA matches who were descendants of my ancestors of interest. But there’s a more important aspect of small segments that I’ve never heard anyone mention. Removing “small segments” that aren’t actually very small will result in underestimated counts of total cMs.
A new way of looking at the problem:
The cM cutoff for segments to be included in the total should be such that the total cMs of improperly removed segments equals the total cMs of improperly included segments.
This thesis statement is based on the idea that preventing people from spending their time on segments that may not be valid isn’t as important as ensuring that we get accurate reports of total cMs from genotyping platforms. As it is now, the majority of DNA matches may have artificially low total cM values. I’ve only ever heard people talk about how they don’t want false segments in their totals because that would mess up our relationship predictions, but that isn’t the right goal. After all, if we wanted to be absolutely sure that we kept none, we’d might have to remove all 20 cM segments and below, or worse.
Trying to keep all false segments out is what would mess up our relationship predictions. There’s a good chance that removing a segment that has an 80% chance of being real will result in an underestimation of total cMs. Although, the cutoff won’t necessarily be right at the cM value where a segment has a 50/50 chance. That’s because there are many more small segments than large segments out there. As stated in this paper, removing all segments under 5 cMs eliminates 99% of true segments.
The genotyping platforms could probably tell us exactly where the cutoff would be in order to satisfy our thesis statement, but don’t hold your breath. Fortunately, we do have ways to evaluate this problem. Family Tree DNA (FTDNA) recently released a white paper in which they reported the probabilities of segments being real based on segment size. These probabilities, which are shown below, are far more optimistic than the previously most frequently shared table. So, it was surreal to see the news spun as further evidence against small segments. And not that I mind if someone calls a 6 cM segment “poison,” but it was especially strange to see a post by an admin. with that headline in a group in which one of the only rules is to keep things positive about what we can do with the DNA tools available to us rather than what we can’t do.
The formula to use
How do we evaluate the data in order to satisfy our thesis statement? We’d have to pick a cutoff value, say 5 cMs and under, and then test it in this way:
Number of true 1 cM segments * 1 cM + Number of true 2 cM segments * 2 cMs + Number of true 3 cM segments * 3 cMs + Number of true 4 cM segments * 4 cMs + Number of true 5 cM segments * 5 cMs
~=
Number of false 6 cM segments * 6 cMs + number of false 7 cM segments * 7 cMs + …
We need to see what lower threshold of cM values results in both sides of the equation being equivalent. That’s pretty simple. And all of the information we need is found in the first three columns of the FTDNA table. Of course, this is an approximation. To get a more accurate value, we’d have to sum up the number of true 4.9 cM segments times 4.9 cMs, and so forth for 4.8, etc. And to get even more accurate, we’d have to extend that to two decimals. But this approximation will pretty insightful. I’m going to start out with the same cutoff as above. Let’s see what we get.
There’s an enormous difference between the improperly removed cMs and the improperly kept when 5 cM segments and below are discarded. More than three times as many cMs in this dataset are improperly discarded as improperly kept. If we were to assume for a moment that FTDNA’s numbers are correct, I think this difference is far too great to be chalked up to this being an approximation. And the lack of data above 14 cMs will hardly have an impact either: the last two values added to the “improperly kept” series are 351 cMs and 182 cMs—a very small, decreasing amount compared to the total of nearly 20,000. We need a lower cM cutoff according to FTDNA’s numbers.
Now that only 4 cM segments and below are discarded, we have more cMs improperly kept than improperly removed, but now the values are much closer to each other. So, if the FTDNA analysis were sound, the cutoff would need to be somewhere between 4 and 5 cMs in order to get accurately reported cM totals, and probably much closer to 4 cMs than 5 cMs. I’m not saying that genotyping platforms should use a cutoff value of 4.2 cMs for reporting totals; rather, that’s only according to FTDNA’s analysis and there may be issues with their methodology. But if FTDNA has high confidence in their methods, it would be perfectly reasonable for them to use a cutoff around 4.2 cMs for reporting total cMs.
Other DNA sites might pick a different value. An analysis of data from 23andMe by Durand, Eriksson, and McLean (2014) shows probabilities of false matches that are a little less optimistic than FTDNA’s numbers. 23andMe considers two people a match if they share at least a 7 cM segment, but they report additional segments as low as 5 cMs. Could it be possible that the scientists at the genotyping platforms have already thought of all of this? They may have set their thresholds at values such that their databases will be more accurate.
I’ve done the same analysis as I did above with a set of empirical data that isn’t nearly as optimistic about small segments as FTDNA’s table. However, I arrive at an even lower number: somewhere between 3 and 4 cMs.
There’s another way to check if an improper threshold is being used. A cutoff value that’s too high will result in datasets with averages that are too low. If you ever compile a decently sized dataset and the average you calculate for a given relationship type is lower than the theoretical value, there’s a good chance that you can blame it on too high of a cM cutoff value.
If you had access to the most accurate relationship predictor, would you use it? Feel free to ask a question or leave a comment. And make sure to check out these ranges of shared DNA percentages or shared centiMorgans, which are the only published values that match peer-reviewed standard deviations. Or, try a calculator that lets you find the amount of an ancestor’s DNA you have when combining multiple kits. I also have some older articles that are only on Medium.
I read the article you mentioned too, and felt like I was in an alternate universe, so thank you for making it clear that the numbers were actually better, not worse. A far more helpful article would have been one that gave tips on how to discern false segments, instead of swearing off small total matches.
Thanks for your comment. The first thing I said when I read the FTDNA paper was, “The people who dedicate so much of their time against small segments are not going to like this.” So I was surprised to see it shared later as evidence against small segments with no mention of the improved probabilities. There are valid complaints against the FTDNA methodology, so I would’ve understood bringing up those. But I have no doubt that someday we’ll have a much better grasp of whether a particular segment is real or not, so I’d like to see less pessimism about that. We do have reasons to be optimistic. This paper found “a 5% to 15% increase in relationship detection power for 7th through 12th-degree relationships.” Their “results demonstrate that [the software package] ERSA 2.0 can detect relationships as distant as 12th degree and has high power to detect relationships as distant as 8th degree from whole-genome sequence data.”
I have a question, Im working with a large group (Approx 100 kits) and Im running hundreds of segment comparisons between them trying to verify a certain somewhat distant lineage, approx 10 generations. In this group I have 5 sets of siblings, each set has 2 to 6 siblings. While running
these comparisons I am seeing definitive patterns in the small segments. In some cases 25 out of the 100 kits will have nearly identical matching 3cm to 6cm segments on chromosome 15. Most of the time some of the siblings will each share exactly identical segments with the other kits. What are the chances of this being accidental? Im not comparing the siblings to eachother, Im comparing each of them to these distant cousins to try and prove the bioligical connection. I understand the false segments are unreliable when working with a relatively small group. But what about large group comparisons and statistcal analysis?
There has to be some validity there.
Hello and thank you for your question. I think triangulation definitely helps the case for a segment being truly IBD, and the more matches from distinct ancestor paths the better. But I have to wonder if these true IBD segments at such low cM values came from the ancestor you’re investigating. There’s a good chance that these segments are far more distant than 8th cousins. The farther we go back, the more degrees of cousinship are overlapping for a particular cM value. And at that genetic distance, we likely have some pedigree collapse that increases quite a bit with each generation. We then have ancestors who occur in our trees multiple times and we’re also usually missing large parts of our family trees. It seems that there’s no way to know if a segment that small came from a particular ancestor in the part of our tree we’re investigating, or if it came from the same ancestor in a different part of our tree, perhaps farther back, or if it came from some other ancestor very far back.
Also, I believe there’s a known pile-up region on chromosome 15. I find that I have matches in these regions who are related so far back that I can rarely find the MRCA. Eventually, I give up on adding people to my lists of matches in these regions because it seems that there’s a never-ending supply of them. I also find that these lists often can’t be separated into paternal and maternal groups. Instead, it appears that all of my matches on those segments fall into one large group from very far back.
Interesting. Thank you for this post.
A question: how does one reconcile the (surprising!) statement in the linked article
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4104314/?fbclid=IwAR1Im1vJD5GSCqHSKPPMO8FY-DcPap4uAZemGLaktKmucjZzqN9ouTp9-Fw
that eliminating segments under 5cM eliminates 99% of IBD segments with the data in the first column of the FTDNA statement, which seems to indicate a minority of IBD segments will be below 5cM. I feel like I must be misreading the table: it seems hard to believe that there would be an order of magnitude fewer 1cM segments than segments 2cM or larger.
Hi Barry. Thanks for your comment. You’re right—the FTDNA table shows that, at most, 28.3% of IBD segments are greater than 5 cM. It would appear that either the linked article is calling some 1-5 cM segments IBD when they’re not or that the FTDNA analysis is missing some true IBD segments in the 1-5 cM range, or both.
The linked article uses a criterion of 80% overlap with a parent and the match to call a segment IBD. They hint at the enormous number of small segments when they say that 67% of segments under 4 cM are false, but 99% of IBD segments are under 5 cM. But I would imagine that the criterion they use would improperly classify a large number of additional small segments as IBD. We don’t know that a parent’s segment with a match was true IBD.
There are also issues with the FTDNA paper. It’s hard to believe that the number of true IBD segments goes down as the segment size goes down below 5 cM. I understand that the total number of segments goes up with decreasing cM, but I would also expect more true IBD segments with decreasing cM. Recombination interference prevents very small segments from forming in one generation, but does nothing to prevent a recombination near one from a previous generation.
I suspect that the linked article is more correct. But I think that the 99% value has to be treated as a maximum due to the likelihood of erroneously classifying some small segments as true IBD. It should probably read that “up to 99% of true IBD segments may be below 5 cM.”