The amount of centiMorgans (cM) we share with a relative tells us how closely related we are to them. A cM is the probability, in percentage, that a crossover will occur between two genetic loci during a given meiosis event. The units of cM are right there in the definition: it’s a probability, in percentage. However, if you’re a student of genetics you’ll know, or if you do a quick internet search for “centiMorgans” you’ll see, that almost everyone refers to it as a length or a distance. That’s because the purpose of a cM is to quantify how far apart two different markers are in the genome. Those of us who are very familiar with genetics understand the above definition, but why would we use a unit with such a strange definition?
Well, we’d use cM if didn’t know the genomic distance in base pairs (bp) or if we’re genotyping our DNA based on a limited number of SNPs. This is how it was done long before scientists started sequencing genomes. Fortunately, whole-genome sequencing is getting progressively cheaper and we now have a very good idea about how many base pairs are in the human genome. For our decently close DNA matches, this will do away with the need to make approximations using the genetic linkage metric (cM). But what do we do now, when most of us who have had our DNA tested only have a small percentage of our SNPs genotyped? It turns out that, even if the genetic linkage metric is used to determine the amount of DNA shared, there’s no reason that the units of shared DNA have to remain in cM afterwards (Veller et al., 2019 and 2020). In fact, there are very good reasons to convert the amount of shared DNA from cM to something else, as you’ll see below.
Percentages or fractions of shared DNA are much more intuitive concepts to genetic genealogy novices and are preferred by scientists as the most natural metric (from correspondence with a prominent, published geneticist). Additionally, people who understand cM also understand percentages, while the reverse is far from true.
The most popular charts for determining genetic relationships use cM. It has plenty of benefits and is one of the most used terms/concepts in genetic genealogy. It seems to be used orders of magnitude more than simple “percentage,” which would work just as well or better in more cases.
People often compare cM numbers across platforms as if those numbers mean the same thing. This is not the case. AncestryDNA tests a number of single nucleotide polymorphisms (SNPs) such that the maximum shared cM is 6,950. Conversely, 23andMe has a maximum of 7,074 cM. Below is a table that shows the total autosomal cM across the largest platforms.
|Site||Half (cM)||Full (cM)|
Comparison of the total number of autosomal cM possible at the five largest platforms. FTDNA stands for Family Tree DNA. The second column denotes the amount of cM possible in one copy of the genome, while the third column shows the amount in both copies. Numbers close to these can be found by comparing a kit to itself, a person to their parent, or comparing identical twins, but note that some platforms are going to show only half of the available cM because they show half-identical regions (HIR), by default, and some sites will show the full amount. MyHeritage numbers come from their FAQ page. All numbers except for MyHeritage come from here.
You might be wondering how GEDmatch can have total cM greater than any of the individual platforms from which it gets data. That’s a good question. Looking at just one chromosome will be an instructive lesson. Let’s compare 23andMe to GEDmatch. When you compare a parent/child match at 23andMe it will show you a full match on chromosome 6, but it’s called a half-identical match because it’s on only one of the two chromosome copies. 23andMe uses their own genetic map. There’s a theoretical genetic map that contains all of the base pairs in our genome. And there are condensed genetic maps that are a result of only sampling parts of the genome. This is what genotyping companies do—they only test certain SNPs. There are certain SNPs that you could remove without affecting the cM as much as other SNPs. This is because recombination varies across different parts of the genome. When GEDmatch does a comparison of kits from 23andMe, they know where the kit came from and they could use the genetic map from 23andMe. But GEDmatch uses a better genetic map in that they’ve included more SNPs. So why would they write algorithms that have to pull up a genetic map for whatever site a kit is from when theirs is better? And how would they compare kits from different platforms? I’ve checked a parent/child match at 23andMe on chromosome 6. The match was 191.84 cM. Those same kits for the same parent/child match, when uploaded to GEDmatch, result in 194.1 cM at GEDmatch. Likewise, the totals for all chromosomes match the half match totals in the table above. GEDmatch uses their own genetic map when comparing kits, rightly so.
Now that we’re comfortable with the differences across platforms, I’d like to compare some numbers between the one with the highest cM (GEDmatch) and the one with the lowest cM (FTDNA). One could also show stark differences between 23andMe and FTDNA, or between any two platforms for that matter.
For close relationships, there’s a real difference between GEDmatch and FTDNA. A paternal grandparent could easily share 35% DNA with you. That would be 2,511 cM at GEDmatch or 2,369 cM at FTDNA, a difference of 142 cM. Genetic genealogists have serious arguments over much lower values and people asking for advice are often told that their family has secrets based on differences this high. The differences aren’t as large between other platforms, but are still significant. In order to do an exact comparison of cM from AncestryDNA to a cM value at GEDmatch, one would have to multiply by the fraction 7,174 / 6,950. Essentially, a person is converting from one platform, to percentage, and then to another platform when they do that calculation.
For this reason, percentages are a universal unit, while values in cM aren’t comparable between platforms. It’s no wonder that scientists prefer fractions or percentages of shared DNA over cM.
This brings up an interesting problem. Carefully controlled scientific studies apply the same methods to all of the data involved in the study. But datasets that combine cM from multiple direct-to-consumer genotyping sites have some level of inaccuracy necessarily baked in, as the number of total cM is different for every site. At the very least, one would have to know exactly what proportion of the data came from each platform. And, hopefully, the total cM didn’t change at any of those platforms during the course of data collection.
In my very accurate tables that show ranges of shared DNA, and do so for gender-specific relationships in most cases, I prefer to use percentages. This way, it only takes one step for someone to calculate what cM that would imply at a particular platform. And if I didn’t report my statistics in percentages, it would be hard to decide which platform’s total cM numbers I would use to convert those percentages to cM. Whichever site I chose, people would likely then compare the numbers to other sites without properly converting them.
There’s one other very large flaw with using cM. The autosomal (atDNA) genetic length for males is 2,730 cM, but it’s 4,435 cM for females. (“Length” is the term used by the authors of the linked research article. “Genetic length” is a commonly used phrase by geneticists.) So, for atDNA, a woman matches her mother at 2,218 cM, while a man only matches his father at 1,365 cM. And a woman who matches her half-sister at the exact average for that relationship shares 2,218 cM with her, or 25%. But a genotyping platform would report this as 1,738 cM. What if she matches a half-brother at a perfectly average 1,738 cM? But he matches her at 31.8% of his genome. Does he match her as a perfectly average half-sibling or is it 6.8 percentage points (473 cM, for him) higher? Well, if you insist on using cM, both statements are true.
These are enormous differences! Genetic genealogists give advice daily about how certain they are that someone’s match is in a particular relationship category, but you can’t be so certain when you’re off by 6.8 percentage points. Let’s look at an example that isn’t just average.
Let’s say a genotyping site tells a woman that she shares exactly 1,147 cM with someone, or 16.5% of the sex-averaged genome. This is within the range of about 95% of first cousin relationships. It’s possible, but less likely, to be a paternal grandparent/grandchild relationship, but pretty much the only other options are exotic combinations of relationships. It’s too low to be a half-sibling, maternal grandparent/grandchild, or avuncular relationship. If the match is a man, he shares 21% of his genome with her, while it’s reported to each of them that they share 16.5%. That changes things. As a proportion of his genome, this match is now within 95% of values for half-siblings, grandparents, and avuncular relationships. And now it’s highly unlikely to be a first cousin relationship, which was the most likely relationship from what was reported by the genotyping site.
I understand that there’s already ambiguity and overlap between these relationships. But the cM method of counting adds much more confusion to a given scenario, and it adds ambiguity to many more scenarios.
When someone asks the genetic genealogy community for help, should they have to specify that they’re a woman matching to a woman or a man matching to a man so that we can use the respective genome length, or otherwise use the sex-averaged genome length? If we retain our irrational preference for cM, then we should be treating all of those cases separately.
Since the physical lengths of the genome are the same for men and women, shared DNA is most accurately reported as a percentage of the physical genome length.
All of this suggests that, when a genotpying novice asks a group what it means that they have a match at 25%, maybe the appropriate response isn’t for a dozen people to immediately bark at them, “What’s the cM?” It turns out that the novice knew better all along: percentages were perfectly fine, better even. So long as the percentage doesn’t include any X-DNA, we can easily guide a person towards likely relationships for a given percentage of shared DNA. And, even if the percentage included X-DNA, it’s totally possible to subtract the percentage (as a proportion of the whole genome) of X-DNA from the total percentage without ever converting to cM. (This, of course, is limited by the functionality of the platform. I don’t believe it’s currently possible even at 23andMe, which has been the pioneer of using percentages. Instead, one must currently find the cM on the X chromosome and subtract it from the total cM.)
I admit that percentages for distant relationships are cumbersome. I’d rather see that one of my matches at 23andMe shares 59 cM with me than have it presented as 0.79%. But there’s a higher point at which the benefits of percentages outweigh the benefits of cM. That point could be different for each person, but for me it probably falls somewhere between 1% and 3%, which means that I’d like to see relationships more distant than 2nd cousins reported to me as cM. But, then, I’ll have to decide whether I want to bother making conversions between platforms. Luckily, the differences aren’t as important at those lower levels. You might be thinking that 1% is a small amount of DNA to share with some are are therefore worried that the match isn’t real. But 1% isn’t very small. It’s equivalent to 69.5 cM at AncestryDNA. That could be one large segment or a couple of pretty good segments.
You might be thinking that the cM metric is necessary for providing a cutoff to eliminate false matches. I can’t stress enough that the world’s leading geneticists know what they’re doing. When they count up the amount of IBD sharing between individuals using the genomic metric, they obviously can’t include every matching base-pair as a match. We know that people who aren’t closely related still share over 99% of their DNA. Geneticists use a cut-off when counting base pairs just like they do when counting cM. One can even use a cM cutoff when counting bp, i.e. add up all of the matching bp, but only for segments ≤ 7 cM.
Out of all the criticisms of percentages that I’ve heard so far, none were good reasons. But there is one issue with reporting percentages. While cM takes into account recombination, the bp metric doesn’t. For close matches, such as first cousins, it’s highly unlikely that the ~12.5% of DNA you share all came from high recombination or low recombination segments of the genome. It’s much more likely that it will be made up of average recombination regions and/or some that are high and some that are low. The cM for first cousin matches will average out so that bp sharing in percentages is a great measurement, better than cM even. However, if you only share one 0.14% segment with someone, that could be from a high or low recombination region and therefore isn’t necessarily a 10 cM segment. But, then again, you know that a 0.14% segment is at least 7 cM if that’s the threshold that was used. Still, it’s best to report distant matches in cM and close matches in percentage. For this reason, I don’t think either metric will ever go away.
People may think that using percentages would entail going back to a “dumbed-down” unit, but they would be wrong. Sometimes a simpler solution is actually better, or more accurate, such as in this case. The best solution is to use a genomic (bp) measurement of IBD sharing, but with a low cM cutoff threshold, and then to report close matches in percentage and distant matches in cM. I hope to see more people using percentages in the future.
Edit, 7 February 2021: I really don’t want to alarm anyone or undermine anyone’s trust in the genetic genealogy tools that are currently available. I just want improvement in our tools when there’s room for it. I brought this topic up mostly because it seemed that nobody else was talking about it. (Although, after publishing this article, I did find some similar points made here.) I would love to find out that my worries are unfounded. For now, I trust the results that are reported to us to some degree. And, except for where there was already considerable overlap between relationships, I think that most cM amounts that fall solidly within the range of one group of relationships truly belong in that relationship group.
Finally, one last thing I have to relate to you. People comically claim that they remember cM better than percentages. Ok, let’s test that out. If you ever catch me on the spot without any paper or electronics, I’ll happily recite to you that the population averages of shared DNA are 50% for siblings; 25% for half-siblings, grandparents, and aunts/uncles; 12.5% for great-grandparents, great-aunts/uncles, or first cousins, etc. Will you do the same for me, i.e. tell me off the top of your head, that at AncestryDNA, the averages are 3,475 cM for full-siblings, 1,737.5 cM for the next group, and 868.75 cM for great-grandparents, great-aunts/uncles, and first cousins, etc? If you do, it will be because you intentionally memorized the numbers, whereas it took me no effort to learn the percentages.
Cover photo by Annie Spratt. Feel free to ask me about modeling & simulation, genetic genealogy, or genealogical research. To see my articles on Medium, click here. And try out a nifty calculator that’s based on the first of my three genetic models. It lets you find the amount of an ancestor’s DNA you have when combined with various relatives. And most importantly, check out these ranges of shared DNA percentages or shared centiMorgans, which are the only published values that match known standard deviations.