The Overused CentiMorgan

The amount of centiMorgans (cMs) we share with a relative tells us how closely related we are to them. Most people think about cM as a length. In fact, geneticists refer to it as “genetic length.” A smaller subset recognize that cM is the probability, in percentage, that a crossover will occur between two genetic loci during a given meiosis event. But we can get even more accurate about the definition. If it were just a probability, a 100 cM segment would have a 100% chance of recombining. But it doesn’t. A 100 cM segment will have one recombination on average across the population. A 3,500 cM genome, i.e. a 35 Morgan genome, will have 35 recombinations, on average. Why would we use a unit with such a strange definition?

Well, we’d use cMs if didn’t know the amount of base pairs (bp) that match or if we’re genotyping our DNA based on a limited number of SNPs. This is how it was done long before scientists started sequencing genomes. Fortunately, whole-genome sequencing is getting progressively cheaper and we now have a very good idea about how many base pairs are in the human genome. For our decently close DNA matches, this will do away with the need to make approximations using the genetic linkage metric (cMs). But what do we do now, when most of us who have had our DNA tested only have a small percentage of our SNPs genotyped? It turns out there’s very little reason reason that the units of shared DNA should have to remain in cMs afterwards (as you can see in most science articles). In fact, there are very good reasons to use something else, as you’ll see below.

Percentages or fractions of shared DNA are much more intuitive concepts to genetic genealogy novices and are preferred by scientists as the most natural metric (from correspondence with a prominent, published geneticist). Additionally, people who understand cMs also understand percentages, while the reverse is far from true.

The cM metric has plenty of benefits and is one of the most used terms/concepts in genetic genealogy. It seems to be used orders of magnitude more than the simple “percentage,” which would work just as well or better in more cases.

People often compare cMs across platforms as if those numbers mean the same thing. This is not the case. AncestryDNA tests a number of single nucleotide polymorphisms (SNPs) such that the maximum shared cMs is 6,978. Conversely, 23andMe has a maximum of 7,076 cMs. Below is a table that shows the total autosomal cMs across the largest platforms.

Site	Half (cMs)	Total (cMs)
GEDmatch	3,587.5	7,175
23andMe	3,538	7,076
MyHeritage	3,500	7,000
AncestryDNA	3,489	6,978
FTDNA	3,384	6,768

Comparison of the total number of autosomal cMs possible at the five largest platforms. FTDNA stands for Family Tree DNA. The second column denotes the amount of cMs possible in one copy of the genome, while the third column shows the amount in both copies. Numbers close to these can be found by comparing a kit to itself, a person to their parent, or comparing identical twins, but note that some platforms are going to show only half of the available cMs because they show half-identical regions (HIR), by default, and some sites will show the full amount. MyHeritage numbers come from their FAQ page. All other data come from personal collection.

You might be wondering how GEDmatch can have total cMs greater than any of the individual platforms from which it gets data. That’s a good question. Looking at just one chromosome will be an instructive lesson. Let’s compare 23andMe to GEDmatch.

There’s a theoretical genetic map that contains all of the base pairs in our genome. And there are condensed genetic maps that are a result of only sampling parts of the genome. This is what genotyping companies do—they only test certain SNPs. There are certain SNPs that you could remove without affecting the cMs as much as other SNPs. This is because recombination varies across different parts of the genome. When GEDmatch does a comparison of kits from 23andMe, they know where the kit came from and they could use the genetic map from 23andMe. But instead GEDmatch uses even more SNPs. If you check a parent/child match at 23andMe you’ll likely see about 3,538 autosomal cMs. If you check the same kits at GEDmatch, you’ll see something close to 3,587 cMs. GEDmatch uses their own genetic map when comparing kits, rightly so.

Now that we’re comfortable with the differences across platforms, I’d like to compare some numbers between the one with the highest cMs (GEDmatch) and the one with the lowest cMs (FTDNA). One could also show stark differences between 23andMe and FTDNA, or between any two platforms for that matter.

For close relationships, there’s a real difference between GEDmatch and FTDNA. A paternal grandparent could easily share 35% DNA with you. That would be 2,511 cMs at GEDmatch or 2,369 cMs at FTDNA, a difference of 142 cMs. Genetic genealogists have serious arguments over much lower values and people asking for advice are often told that their family has secrets based on differences this high. The differences aren’t as large between other platforms, but are still significant. In order to do an exact comparison of cMs from AncestryDNA to a cM value at GEDmatch, one would have to multiply by the fraction 7,175 / 6,978. Essentially, a person is converting from one platform, to percentage, and then to another platform when they do that calculation.

For this reason, percentages are a universal unit, while values in cM aren’t comparable between platforms. It’s no wonder that scientists prefer fractions or percentages of shared DNA over cMs.

This brings up an interesting problem. Carefully controlled scientific studies apply the same methods to all of the data involved in the study. But datasets that combine cMs from multiple direct-to-consumer genotyping sites have some level of inaccuracy necessarily baked in, as the number of total cM is different for every site. At the very least, one would have to know exactly what proportion of the data came from each platform. And, hopefully, the total cMs didn’t change at any of those platforms during the course of data collection.

For my ranges of shared DNA, which include sex-specific relationships in most cases, I prefer to use percentages. This way, it only takes one step for someone to calculate how many cMs that would imply at a particular platform. And if I didn’t report my statistics in percentages, it would be hard to decide which platform’s total cM numbers I would use to convert those percentages to cMs. Whichever site I chose, people would likely then compare the numbers to other sites without properly converting them. That isn’t a big deal for small matches—an 8 cM segment at Ancestry is about 8.1 cMs at 23andMe, on average. But it can make large matches off by hundreds of cMs.

All of this suggests that, when a genotyping novice asks a group what it means that they have a match at 25%, maybe the appropriate response isn’t for a dozen people to immediately bark at them, “What’s the cM?” It turns out that the novice knew better all along: percentages were perfectly fine, better even. We can easily guide a person, from memory, towards likely relationships for a given percentage of shared DNA.

I admit that percentages for distant relationships are cumbersome. I’d rather see that one of my matches at 23andMe shares 59 cM with me than have it presented as 0.83%. But I’d rather see a 25% match than a 1,769 cM match. There’s a higher point at which the benefits of percentages outweigh the benefits of cMs. That point could be different for each person, but for me it probably falls somewhere between 1% and 3%, which means that I’d like to see relationships more distant than 2nd cousins reported to me in cMs.

You might be thinking that the cM metric is necessary for providing a cutoff to eliminate false matches. I can’t stress enough that the world’s leading geneticists know what they’re doing. When they count up the amount of IBD sharing between individuals using the genomic metric, they obviously can’t include every matching base-pair as a match. We know that people who aren’t closely related still share over 99% of their DNA. Geneticists use a cut-off when counting base pairs just like they do when counting cMs. One can even use a cM cutoff when counting bp, i.e. add up all of the matching bp, but only for segments ≥ 7 cMs.

Out of all the criticisms of percentages that I’ve heard so far, none were good reasons. But there is one issue with reporting percentages. While cMs take into account recombination, the bp metric doesn’t. For close matches, such as first cousins, it’s highly unlikely that the ~12.5% of DNA you share all came from high recombination or low recombination segments of the genome. It’s much more likely that it will be made up of average recombination regions and/or some that are high and some that are low. The cMs for first cousin matches will average out so that bp sharing in percentages is a great measurement, better than cMs even. However, if you only share one 0.14% segment with someone, that could be from a high or low recombination region and therefore isn’t necessarily a 10 cM segment. But, then again, you know that a 0.14% segment is at least 7 cMs if that’s the threshold that was used. Still, it’s best to report distant matches in cMs and close matches in percentage. For this reason, I don’t think either metric will ever go away.

Some people claim that they remember cMs better than percentages. Let’s test that out. If you ever catch me on the spot without any paper or electronics, I’ll happily recite to you that the population averages of shared DNA are 50% for siblings; 25% for half-siblings, grandparents, and aunts/uncles; 12.5% for great-grandparents, great-aunts/uncles, or first cousins, etc. Will you do the same for me, i.e. tell me off the top of your head, that at 23andMe, the averages are 3,538 cMs for full-siblings, 1,769 cMs for the close family group, and 885 cMs for great-grandparents, great-aunts/uncles, and 1st cousins, etc? And will you be able to recite the different averages for each other platform? If you do, it will be because you took great care to memorize the numbers, whereas it took me no effort to learn the percentages.

You may think that using percentages would entail going back to a “dumbed-down” unit, but that isn’t true. Sometimes a simpler solution is actually better, or more accurate, such as in this case. The best solution is to use a genomic (bp) measurement of IBD sharing, but with a low cM cutoff threshold, and then to report close matches in percentage and distant matches in cM. I hope to see more people using percentages in the future.

DNA-Sci — advancing the science of relationship predictions. You can also find mobile apps. for relationship predictions in the Apple Store and on Google Play. Feel free to ask a question or leave a comment. And make sure to check out these ranges of shared X-DNA, shared atDNA percentages, and shared atDNA centiMorgans. Or, try a tool that lets you find the amount of an ancestor’s DNA you cover when combining multiple kits. I also have some older articles that are only on Medium. Cover photo by Annie Spratt.

The Overused CentiMorgan

Related

1 Comment

Submit a Comment Cancel reply

Recent Posts

Recent Comments

Archives