The amount of centiMorgans (cM) we share with a relative tells us how closely related we are to them. A cM is the probability, in percentage, that a crossover will occur between two genetic loci during a given meiosis event. The units of cM are right there in the definition: it’s a probability, in percentage. However, if you’re a student of genetics you’ll know, or if you do a quick internet search for “centiMorgans” you’ll see, that almost everyone refers to it as a length or a distance. That’s because the purpose of a cM is to quantify how far apart two different markers are in the genome. Those of us who are very familiar with genetics understand the above definition, but why would we use a unit with such a strange definition?
Well, we’d use cM if didn’t know the genomic distance in base pairs (bp) or if we’re genotyping our DNA based on a limited number of SNPs. This is how it was done long before scientists started sequencing genomes. Fortunately, whole-genome sequencing is getting progressively cheaper and we now have a very good idea about how many base pairs are in the human genome. For our decently close DNA matches, this will do away with the need to make approximations using the genetic linkage metric (cM). But what do we do now, when most of us who have had our DNA tested only have a small percentage of our SNPs genotyped? It turns out that, even if the genetic linkage metric is used to determine the amount of DNA shared, there’s no reason that the units of shared DNA have to remain in cM afterwards (as you can see in most science articles). In fact, there are very good reasons to convert the amount of shared DNA from cM to something else, as you’ll see below.
Percentages or fractions of shared DNA are much more intuitive concepts to genetic genealogy novices and are preferred by scientists as the most natural metric (from correspondence with a prominent, published geneticist). Additionally, people who understand cM also understand percentages, while the reverse is far from true.
The cM metric has plenty of benefits and is one of the most used terms/concepts in genetic genealogy. It seems to be used orders of magnitude more than simple “percentage,” which would work just as well or better in more cases.
People often compare cM numbers across platforms as if those numbers mean the same thing. This is not the case. AncestryDNA tests a number of single nucleotide polymorphisms (SNPs) such that the maximum shared cM is 6,950. Conversely, 23andMe has a maximum of 7,074 cM. Below is a table that shows the total autosomal cM across the largest platforms.
|Site||Half (cM)||Full (cM)|
Comparison of the total number of autosomal cM possible at the five largest platforms. FTDNA stands for Family Tree DNA. The second column denotes the amount of cM possible in one copy of the genome, while the third column shows the amount in both copies. Numbers close to these can be found by comparing a kit to itself, a person to their parent, or comparing identical twins, but note that some platforms are going to show only half of the available cM because they show half-identical regions (HIR), by default, and some sites will show the full amount. MyHeritage numbers come from their FAQ page. All numbers except for MyHeritage come from here.
You might be wondering how GEDmatch can have total cM greater than any of the individual platforms from which it gets data. That’s a good question. Looking at just one chromosome will be an instructive lesson. Let’s compare 23andMe to GEDmatch.
When you compare a parent/child match at 23andMe it will show you a full match on chromosome 6, but it’s called a half-identical match because it’s on only one of the two chromosome copies. 23andMe uses their own genetic map. There’s a theoretical genetic map that contains all of the base pairs in our genome. And there are condensed genetic maps that are a result of only sampling parts of the genome. This is what genotyping companies do—they only test certain SNPs. There are certain SNPs that you could remove without affecting the cM as much as other SNPs. This is because recombination varies across different parts of the genome. When GEDmatch does a comparison of kits from 23andMe, they know where the kit came from and they could use the genetic map from 23andMe. But GEDmatch uses a better genetic map in that they’ve included more SNPs. So why would they write algorithms that have to pull up a genetic map for whatever site a kit is from when theirs is better? And how would they compare kits from different platforms? I’ve checked a parent/child match at 23andMe on chromosome 6. The match was 191.84 cM. Those same kits for the same parent/child match, when uploaded to GEDmatch, result in 194.1 cM at GEDmatch. Likewise, the totals for all chromosomes match the half match totals in the table above. GEDmatch uses their own genetic map when comparing kits, rightly so.
Now that we’re comfortable with the differences across platforms, I’d like to compare some numbers between the one with the highest cM (GEDmatch) and the one with the lowest cM (FTDNA). One could also show stark differences between 23andMe and FTDNA, or between any two platforms for that matter.
For close relationships, there’s a real difference between GEDmatch and FTDNA. A paternal grandparent could easily share 35% DNA with you. That would be 2,511 cM at GEDmatch or 2,369 cM at FTDNA, a difference of 142 cM. Genetic genealogists have serious arguments over much lower values and people asking for advice are often told that their family has secrets based on differences this high. The differences aren’t as large between other platforms, but are still significant. In order to do an exact comparison of cM from AncestryDNA to a cM value at GEDmatch, one would have to multiply by the fraction 7,174 / 6,950. Essentially, a person is converting from one platform, to percentage, and then to another platform when they do that calculation.
For this reason, percentages are a universal unit, while values in cM aren’t comparable between platforms. It’s no wonder that scientists prefer fractions or percentages of shared DNA over cM.
This brings up an interesting problem. Carefully controlled scientific studies apply the same methods to all of the data involved in the study. But datasets that combine cM from multiple direct-to-consumer genotyping sites have some level of inaccuracy necessarily baked in, as the number of total cM is different for every site. At the very least, one would have to know exactly what proportion of the data came from each platform. And, hopefully, the total cM didn’t change at any of those platforms during the course of data collection.
In my very accurate tables that show ranges of shared DNA, and do so for sex-specific relationships in most cases, I prefer to use percentages. This way, it only takes one step for someone to calculate what cM that would imply at a particular platform. And if I didn’t report my statistics in percentages, it would be hard to decide which platform’s total cM numbers I would use to convert those percentages to cM. Whichever site I chose, people would likely then compare the numbers to other sites without properly converting them.
All of this suggests that, when a genotyping novice asks a group what it means that they have a match at 25%, maybe the appropriate response isn’t for a dozen people to immediately bark at them, “What’s the cM?” It turns out that the novice knew better all along: percentages were perfectly fine, better even. So long as the percentage doesn’t include any X-DNA, we can easily guide a person towards likely relationships for a given percentage of shared DNA.
I admit that percentages for distant relationships are cumbersome. I’d rather see that one of my matches at 23andMe shares 59 cM with me than have it presented as 0.79%. But there’s a higher point at which the benefits of percentages outweigh the benefits of cM. That point could be different for each person, but for me it probably falls somewhere between 1% and 3%, which means that I’d like to see relationships more distant than 2nd cousins reported to me as cM. But, then, I’ll have to decide whether I want to bother making conversions between platforms. Luckily, the differences aren’t as important at those lower levels. You might be thinking that 1% is a small amount of DNA to share with some are are therefore worried that the match isn’t real. But 1% isn’t very small. It’s equivalent to 69.5 cM at AncestryDNA. That could be one large segment or a couple of pretty good segments.
You might be thinking that the cM metric is necessary for providing a cutoff to eliminate false matches. I can’t stress enough that the world’s leading geneticists know what they’re doing. When they count up the amount of IBD sharing between individuals using the genomic metric, they obviously can’t include every matching base-pair as a match. We know that people who aren’t closely related still share over 99% of their DNA. Geneticists use a cut-off when counting base pairs just like they do when counting cM. One can even use a cM cutoff when counting bp, i.e. add up all of the matching bp, but only for segments ≤ 7 cM.
Out of all the criticisms of percentages that I’ve heard so far, none were good reasons. But there is one issue with reporting percentages. While cM takes into account recombination, the bp metric doesn’t. For close matches, such as first cousins, it’s highly unlikely that the ~12.5% of DNA you share all came from high recombination or low recombination segments of the genome. It’s much more likely that it will be made up of average recombination regions and/or some that are high and some that are low. The cM for first cousin matches will average out so that bp sharing in percentages is a great measurement, better than cM even. However, if you only share one 0.14% segment with someone, that could be from a high or low recombination region and therefore isn’t necessarily a 10 cM segment. But, then again, you know that a 0.14% segment is at least 7 cM if that’s the threshold that was used. Still, it’s best to report distant matches in cM and close matches in percentage. For this reason, I don’t think either metric will ever go away.
You may think that using percentages would entail going back to a “dumbed-down” unit, but they would be wrong. Sometimes a simpler solution is actually better, or more accurate, such as in this case. The best solution is to use a genomic (bp) measurement of IBD sharing, but with a low cM cutoff threshold, and then to report close matches in percentage and distant matches in cM. I hope to see more people using percentages in the future.
Finally, one last thing I have to relate to you. People comically claim that they remember cM better than percentages. Ok, let’s test that out. If you ever catch me on the spot without any paper or electronics, I’ll happily recite to you that the population averages of shared DNA are 50% for siblings; 25% for half-siblings, grandparents, and aunts/uncles; 12.5% for great-grandparents, great-aunts/uncles, or first cousins, etc. Will you do the same for me, i.e. tell me off the top of your head, that at 23andMe, the averages are 3,587 cM for full-siblings, 1,794 cM for the next group, and 897 cM for great-grandparents, great-aunts/uncles, and first cousins, etc? And will you be able to recite the different averages for each other platform? If you do, it will be because you intentionally memorized the numbers, whereas it took me no effort to learn the percentages.
Cover photo by Annie Spratt. If you had access to the most accurate relationship predictor, would you use it? Feel free to ask a question or leave a comment. And make sure to check out these ranges of shared DNA percentages or shared centiMorgans, which are the only published values that match peer-reviewed standard deviations. Or, try a calculator that lets you find the amount of an ancestor’s DNA you have when combining multiple kits. I also have some older articles that are only on Medium.