That’s a common questions I see from people researching their DNA matches. Until now the only answer we could give was “because DNA is variable.”
That’s a correct answer. But we could provide something more useful: A quantitative answer. When someone asks, “How can my sibling match an unknown distant cousin at 35 cM when I don’t match that cousin at all?,” we could say that when a 3rd cousin doesn’t match you, they’ll match your sibling between 30 and 40 cM almost 10% of the time. Below you’ll see that we have that answer for the first time.
For normal ranges of shared DNA between relatives, click here, or get probabilities from this relationship predictor. Both pages are built on the only data that reflect the high variability of DNA sharing shown by Veller et al. (2019 & 2020).
There are a few reasons that the discoveries shown here could have never been made with empirical data.
- Empirical data are inaccurate. Not only do they underestimate standard deviations, and therefore underestimate the range of shared DNA, but they also get the averages wrong.
- Empirical datasets tend not to include non-matches. This is the nature of data collection. It’s less likely that a value of zero will be submitted for a non-matching cousin to a crowd-sourced database. That’s why empirical datasets drastically overestimate averages of shared DNA.
- It would be very difficult to compile an empirical dataset of 500,000 matches, let alone 500,000 3rd cousin matches to two siblings, 500,000 4th cousin matches to two siblings, etc.
I’ve provided individual histograms for 3rd cousins up to 4th cousins, once removed. You’ll also see a set of histograms in which all cousins from 3rd to 6th are included in one graph, for use when you don’t know the degree of match.
Let’s take a look at these histograms! Each figure is divided into ten subplots. The subplot title will show the range of match, in centiMorgans (cM) that one sibling has to a cousin of particular degree. Each figure has ten subplots. For each one, the first subplot is for the case when the first sibling shares no DNA with the cousin; for every other subplot, the first sibling shares an amount of DNA within a 10 cM range. The histogram bars, always in 10 cM bins, will show the frequency of shared cM that the cousin would share with the second sibling.
Figure 1. Shared cM frequencies between a sibling and a 3C when the other sibling shares the cM range listed in the subplot title. Ranges for histogram bars as well as those listed in the subplot titles are inclusive on the left and inclusive on the right for all figures on this page.
The first six subplots in Figure 1 (a-f) show that the highest probability histogram bar reflects the amount given in the subplot title. That means that when one sibling shares an amount of DNA under 50 cM with a 3rd cousin (3C), the 10 cM range that contains the same cM is the most probable range for the other sibling. This isn’t surprising because these are not independent events—if you share low cM with a 3C, it’s probably because your parent does too, and thus your sibling is also likely to share a low amount with that 3C. We also see that the mean shared cM for one sibling to a 3C is only in the range of the other sibling when that range is 40-50, 50-60, or 60-70 cM. Average shared cM for 3C, without any conditional probabilities involved, is 54 cM.
One thing to note is that these probabilities aren’t reversible. The way to read the histograms is as follows: If one sibling shares the cM in the subplot title, the other sibling has the likelihood shown by the histogram bars for the various cM ranges. For example, if one sibling shares less than 10 cM with a 3C (Figure 1b), the other sibling has about a 15% chance of sharing 20-30 cM with that 3C. However, if one sibling shares 20-30 cM with a 3C (Figure 1d), the other sibling has only about a 10-11% chance of sharing less than 10 cM with that 3C.
Figure 2. Shared cM frequencies between a sibling and a 3rd cousin once removed when the other sibling shares the cM range listed in the subplot title.
For 3rd cousins once removed (3C1R), we see again that the highest probability histogram bar for one sibling contains the cM value shared by the first sibling and listed in the subplot title, not only for the first six subplots (a-f), but also for Figure 1h. We also see non-monotonicity in some of the subplots (d-f and h). This is because of the conditional probabilities. The curves would be smoother if the ranges of shared cM were made smaller, but we would still see non-monotonicity, whereas a histogram of one sibling’s DNA share with a 3C1R would be monotonically decreasing from left to right for all cM values. Figure 2f shows the highest probability bar is the same as that in the subplot title (40-50 cM), while the second highest probability bar (20-30 cM) is the range that contains the normal 3C1R mean (27 cM). It’s clear that there are multiple factors involved in these probabilities.
Figure 3. Shared cM frequencies between a sibling and a 4th cousin when the other sibling shares the cM range listed in the subplot title.
At 4C and more distant, the highest probability histogram bar is no longer the same range as that in the subplot title except when under 10 cM. Now that the average cM (for 4C in general) is only 14 cM, the tendency for low cM sharing simply outweighs the tendency for two siblings to share in the same 10 cM range with a 4C. Will still see that the mean cM for the second sibling is in the same range as that listed in the subplot title for the first three subplots with ranges (b-d).
Figure 4. Shared cM frequencies between a sibling and a 4th cousin once removed when the other sibling shares the cM range listed in the subplot title.
Creating all of the above plots was very time-consuming. I’m going to take a break from producing plots for a specific cousin type and show one more figure that’s probably a lot more useful. Below several cousin types (3C-6C) are included in the same histograms, which probably applies to most cases—when we don’t actually know how we’re related to a particular DNA match.
Figure 5. Shared cM frequencies between a sibling and a 3rd to 6th cousin when the other sibling shares the cM range listed in the subplot title.
Now we’re back to the tendency for both siblings to share within the same 10 cM range. The only exceptions are for the 10-20 cM range (Figure 5c) and above 70 cM, where the second sibling shows a lower cM range as the most probable histogram bar when compared to the first sibling’s range given in the subplot title. We also see that the mean shared DNA for the second sibling is within the range of the first sibling for the first six subplots with ranges (b-g).
This analysis concentrated on cousins who share less than 100 cM in total. The higher amounts are part of the calculation (and would thusly needed to be added in order for the probabilities to total 100%), but aren’t shown. Of course I could also do analyses for cousins who share much higher amounts of DNA than expected. In that case, I might include much closer relative types for comparison. However, I’d see no reason to include siblings—if you know that you share 58% DNA with a full-sibling, the expected value for another sibling is still 50%. Please leave a comment if there’s a particular relationship type and/or cM range that you would have liked to see included. I hope you’ve found these results useful.
Feel free to ask me about modeling & simulation, genetic genealogy, or genealogical research. And make sure to check out these ranges of shared DNA percentages or shared centiMorgans, which are the only published values that match peer-reviewed standard deviations. That model was also used to make a very accurate relationship prediction tool. Or, try a calculator that lets you find the amount of an ancestor’s DNA you have when combining multiple kits. I also have some older articles that are only on Medium.