#### Relationship predictions are now updated to include differences in maternal and paternal recombination rates as well as validation of ranges by peer-reviewed standard deviations

*The relationship predictor can be found here.*

I’ve previously published exact averages and very accurate ranges of shared DNA for any genealogical relationship that can be imagined. The model that produces these results is validated by the standard deviations of Veller et al. (2019 & 2020). Since the data that come out of this model are so accurate, and since they can be calculated for sex-specific genealogical relationships, which had never been done for relationship prediction, I knew that this tool had to be created.

**Probability curves for different relationship types**

The most striking thing about the figures shown here is the curve for grandparent/grandchild relationships, which features two distinct peaks. Who would’ve thought that those relationships are so different than avuncular and half-sibling relationships? Genetic genealogists have been treating them all the same. We now see that treating them as a homogenous group is a gross oversimplification.

**Figure 1**. Probability curves for relationship types 5C1R to full-siblings at AncestryDNA. The y-axis shows the probability of each relationship type relative to all others included. All types here are sex-averaged, although the calculator gives sex-specific probabilities for half-avuncular, 1C, avuncular, half-sibling, and grandparent/grandchild relationships. 1C1R = 1st cousin, once removed; cM = centiMorgan, HIR = half-identical regions. The second cousin (2C) curve is higher because it’s the first curve to be the only one from its group (it has little competition near its center).

The first thing that came to mind when I saw the probability curves in Figure 1, other than surprise, was a discovery that I had made and written about just one week earlier. At that time, I had found that a person is actually more likely to share 22% or 28% DNA with a grandparent than 25%, despite 25% being the expected value. But it turns out that that rule isn’t the reason for the two peaks on the grandparent/grandchild curve, at least not directly. In fact, the two peaks are actually much farther apart than 22% and 28%. And the histogram for grandparent/grandchild relationships only has one peak, as shown in Figure 2.

**Figure 2**. Normalized histogram for 500,000 grandparent/grandchild pairs. These are the same data points that went into the probability calculator. The individuals were simulated as 250,000 paternal grandparent/grandchild pairs and 250,000 maternal grandparent/grandchild pairs, but the fractions of shared DNA for each were not differentiated when creating the histogram. For that reason, despite not being labeled as paternal or maternal, values near 0.25 on the x-axis are more likely to come from maternal grandparent/grandchild pairs and values at the far ends of the histogram are much more likely to be from paternal grandparent/grandchild pairs.

The reason for the two peaks in Figure 1 is that grandparent/grandchild relationships have far more variance than all other relationships (Veller et al., 2019 & 2020). Since this subject of relationship probabilities concerns the *relative* probabilities of relationship types, a gap between two curves has to be filled by one or more other relationship curves. And the largest gaps occur between the group that includes grandparents and the two groups on either side of it. The difference is even more striking when looking at IBD data such as in Figure 3. (IBD stands for identical by descent. It’s the total amount of DNA that two people are reported to share. It can be contrasted with half-identical region (HIR) sharing, which counts fully-identical regions (FIR, or IBD2) as if they are HIR). Reporting the total amount of DNA that full-siblings share moves the curve for that relationship even farther to the right of grandparent/grandchildren relationships.

**Figure 3** Probability curves for relationship types 5C1R to full-siblings at 23andMe. IBD = identical by descent, which includes both HIR and FIR shared DNA. All other parameters and abbreviations are the same as in Figure 1.

Figure 3 shows a drastic increase in the height of the right-most peak for grandparent/grandchild relationships when compared to Figure 1. The probability of this relationship type peaks at 78.7% around 2,510 cM as would be reported by 23andMe. This is due to moving the full-sibling curve far to the right, from the 37.5%, on average, that would be reported by AncestryDNA to the 50%, on average, that full-siblings actually share. In contrast, half-siblings are only 12.1% likely and avuncular relationships only 3.2% likely at 2,510 cM. An added benefit of IBD sharing platforms is that half-siblings are more easily distinguished from avuncular relationships, which is very apparent from about 2,200 cM to 2,500 cM.

Is it really possible for the likelihood that you’ve found a grandparent at 2,510 cM to be that much greater than a half-sibling, aunt, or uncle? Because of how unlikely it is for half-siblings or avuncular pairs to share 2,510 cM, the answer is yes. The caveat to that is that a grandparent/grandchild might be less likely because of age or representation in the population. But, as time progresses and DNA kits remain in the database, the likelihood of finding grandparents will likely increase. You would have to weigh the probabilities against those other factors. And, of course there are other relationship types that are possible at this number of cM. It could be 3/4 siblings_{ranges, prediction}, for example, and the amount of FIR sharing should be analyzed separately in cases such as this.

**Comparison to a previously used probability curve**

I calculated these probabilities presumably the same way that it was done in the AncestryDNA white paper. Their probability curves from that paper have been the most widely used method of determining relationship probabilities. However, in their methodology, relationship types are lumped into groups, and sex-specific probabilities aren’t calculated.

I wasn’t sure what to expect once I developed a way to compare my model results to AncestryDNA’s model results. Very few details are given about their methods or data, including anything that could be used to validate their methods or probability results. I find that the white paper probability curves look very similar to the curves that I plotted. Since the simulation I use is validated by standard deviations from Veller et al. (2019 & 2020), this means that the AncestryDNA numbers are probably fairly good. That’s because they used a simulation. Despite my love for data, in genetic genealogy bad data is the name of the game.

**Figure 4**. Relationships probabilities from my simulations on the left compared to those from AncestryDNA on the right. Units are the same for both graphs. The y-axes for both graphs are on a logarithmic scale. This was done at AncestryDNA in order to show the differences in more distant relationships, which were otherwise bunched-up.

The differences for distant cousins can be accounted for by the fact that the probabilities in my dataset were calculated against other, more distant relationships that are not shown here in order to correspond to the AncestryDNA chart. The 3C1R, 4C, etc. probabilities on my graph now don’t add up to 1. They did when 4C1R, 5C, and 5C1R were included, but those are now left out. For relationship types such as the half-sibling/grandparent group, I was able to add up all of the probabilities to make one curve. I could go back and re-calculate the probabilities for 3C1R, 4C, etc. without including more distant relationships, but I think the comparison of graphs is clear as-is.

**Methodology**

To calculate probabilities for the new tool, 500,000 individual pairs were compared from each relationship type. Each pair will share a certain number of cM. Bins 1 cM wide were created, centered on integer values, and the number of pairs for each relationship type were counted for each bin. Those counts are then used to determine the probability of each relationship type at a given cM value. For 500,000 half-siblings, 250,000 paternal and 250,000 maternal half-sibling pairs were included. That would allow half-siblings to be equally weighted against grandparent/grandchild relationships, which share the same mean. First cousins include four different sex-specific paths, therefore each type consisted of 125,000 pairs. Sex-specific probabilities were calculated for relationships including 1st cousins and closer. Sex-specific probabilities are not as different for more distant relatives, plus the number of sex-specific paths increases exponentially (16 types of 2nd cousins), so those differences weren’t included.

The amount of shared DNA between individuals is highly variable. Smoothing of the data was very much necessary, and it was by far the hardest step of the process. Figure 5 shows how un-smooth the curves are for raw data. These curves are actually less realistic than the smoothed curves. For a given set of assumptions and parameters, even in real life, there is some definite probability for each relationship type at each cM value. It is *not* a fuzzy probability. If I increased the number of individual pairs for each relationship type, perhaps to one million or several million, then the probability curves wouldn’t require smoothing. Imagine trying to get an empirical database that large, which would then contain a lot of erroneous data and/or be missing a lot of data erroneously labeled as “outliers.”

**Figure 5**. Un-smoothed probability curves for relationship types 5C1R to full-siblings at AncestryDNA. The y-axis shows the probability of each relationship type relative to all others included. All types here are sex-averaged, although the calculator gives sex-specific probabilities for half-avuncular, 1C, avuncular, half-sibling, and grandparent/grandchild relationships.

I ensured that the smoothing didn’t flatten the curves. I only applied as much smoothing as was necessary to get the curves monotonic over the applicable ranges and then ensured that the probability values were unchanged from what would be expected if you were to draw a curved line along the center of the above probability curves. It’s easy to see in the un-smoothed graph: Grandparent/grandchild relationships are quite different than avuncular and half-sibling relationships.

**Advantages of this probability calculator**

- Some relationship types within a group are too different to be treated the same: Grandparents are far different than half-siblings and avuncular relationships. This calculator treats them differently. This and the next point make this calculator especially accurate for close relatives.

- There are significant differences between paternal and maternal recombination rates. This results in much wider ranges of shared DNA between paternal relatives than for maternal relatives. The probability calculator used here allows for those differences.

- The data for IBD probability curves, such as that for 23andMe data, come from IBD data. This is an exceedingly important point. It is not a good idea to use an AncestryDNA graph to try to distinguish between relationships at 23andMe

- The data used to calculate the probabilities are from the same model and version that made the most accurate tables of shared DNA currently published.

- The probabilities used in this calculator can’t be influenced by erroneous data, whether mislabeled, affected by endogamy, or potentially includes multiple unknown relationships.

There are important differences that can be seen with this tool.

For AncestryDNA data, 1,272 cM is the value at which grandparents and great-grandparents are equally likely, at about 25.6% probability each. Half-avuncular relationships are 18.6% likely, half-siblings are 11.9% likely, and avuncular relationships are 7.8% likely. This makes a total of 46.3% for the group that includes grandparents, half-siblings, and avuncular relationships and leaves 53.7% for the next group. This is similar to the 50/50 split that AncestryDNA reports, except the former values are broken down by multiple relationship types (including paternal and maternal, which aren’t shown in this example but are included in the calculator), and are validated by peer-reviewed statistics. AncestryDNA hasn’t released any kind of statistics to validate their data.

**Other important notes**

All probabilities are for autosomal DNA only. Please subtract any X-DNA before using the calculator. Also, I recommend subtracting any shared DNA from segments less than 7 cM that may have found their way into your total. Family Tree DNA includes very small segments in their total cM calculations.

The above probabilities assume no endogamy or other pedigree collapse. Those cases should be treated separately.

Multiple cousin relationships are not included here, but you can see the averages and ranges or use a multiple cousin relationship predictor for double 1st cousins and 3/4 siblings.

Parent/child relationships are not included here. They are easy to distinguish from other relationships, including full-siblings. Parent/child relationships consist of a half-identical match across the whole length of the genome. Full-siblings share 25% fully-identical regions, on average. Genotyping sites will take this into account in their relationship prediction. If a relationship is predicted to be parent/child, full-sibling is not a possible relationship and there is no need to analyze the shared DNA amount here.

Relationships more distant than 1C1R and half-1C are grouped together by those with the same average shared DNA. Also, half-avuncular relationships are treated the same as siblings of grandparents, which are called great- or grad-avuncular relationships. They are treated the same because the curves are the same, as are any other relationship types that share the same curve. For each curve shown in the figure at the bottom of the page, 500,000 pairs were simulated. Therefore, relative probabilities of each relationship type are based on the assumption that an equal number of each are possible in the population. While this assumption isn’t true, it’s the best way to generate probabilities. Age and other factors, such as the likelihood that your unknown great-grandparent or great-grandchild is the DNA match you’ve found, should be taken into consideration. It’s probably more likely that a 1,200 cM match is a half-avuncular relationship than a great-grandparent, despite the fact that, if they were equally likely relatives to find as DNA matches, the cM value alone suggests great-grandparent is more likely.

These probabilities are only calculated as far back as 5C1R. The huge advantage of this tool, other than the accuracy of the data, is that it treats close relatives as not being in the same group because the curves are significantly different. For distant relatives, there’s much less certainty about the genealogical relationship for your DNA matches. Matches as low as 8 cM are allowed here, however the relationship may be farther back than 5C1R. However, the relative probabilities may be accurate even at those low values. Indeed, any of the probabilities shown above are only relative to the other relationships listed, therefore they’re only meaningful in comparison to the other relationships. And there’s no cM value at 8 cM or above at which even a 4C1R is the most probable relationship. So, while the probability of an 8 cM match may be higher for “4C1R or more distant,” listing each relationship type separately would not result in more useful information. Not only are very low cM values difficult to assign to a recent ancestor, but segments of 20 cM or 30 cM may be on pile-up regions and therefore come from very distant ancestors.

Totals will not always add up to 100%. When multiple relationship types are present, the chances of rounding errors increases. I don’t believe that the totals are ever off by more than 0.2 percentage points.

This is not the first tool to show relationship probabilities based on a user input of shared DNA. Jonny Perl has done amazing work at DNA Painter, including probability calculations that can be built-in to your family tree, and Genetic Affairs has also displayed relationship probabilities.

Here’s a full list of the relationship prediction tools now available on this site:

*If you had access to the most accurate relationship predictor, would you use it? Feel free to ask a question or leave a comment. And make sure to check out these ranges of shared DNA percentages or shared centiMorgans, which are the only published values that match peer-reviewed standard deviations. **Or, try a calculator** that lets you find the amount of an ancestor’s DNA you have when combining multiple kits. I also have some older articles that are only on Medium.*

Interesting stuff Brit, perhaps it can be used by DNA Painter/WATO as an alternative to the probabilities that are currently used?

I might get back to you concerning an upcoming project I am planning, your data might the right fit for that. Keep up the good work.

Thanks! I’m glad to share the data.

I second EJ’s comment and it would also help the “Your DNA family” app for a new feature that we’ve recently launched that currently only shows centiMorgan values. Using your more accurate prediction would certainly help in adding more clarity to the users as to what relationship is most likely.

Brit, this is Brilliant. I have been factoring in AtDNA drop off but did not account for gender, although it has been showing up as a significant factor particular female to female. Id love to correspond (email attached).

Thanks EJ for pointing me to this information. 🙂

Can you put this in book form so I can underline stuff and take it with me when I travel?

Hi Mary. Is it the pop-up with relationship predictions at GEDmatch that you’d like to have on paper? You should be able to print to a PDF or screenshot any webpage if you want a copy.

Great stuff! I’m wondering if using the number, mean length, and length variance of shared segments would be useful to make prediction even more accurate?

Hi Ted. Thanks! Segment information could definitely be useful for predicting paternal and maternal sides. And the largest segment size would help with endogamy. Unfortunately, I haven’t ever kept data on segment size. But I’d be interested in studying that in the future.

The Total cM column under autosomal does not have clickable links. What am I doing wrong?

Hi Barbara. There are a few versions of the One-to-Many tool in which this feature isn’t available yet. For now, it’s only available to Tier 1 members and you have to be using the tool named “One-to-Many – Full Version.”

my question is that total cM on my profile and that of my brothers shows one dna relative to be a half sibling to me but a grandchild or grandparent of my brothers that just not seem right at all. so here is the question How is that even possible?

Hi Angie. Half-sibling and grandparent/grandchild relationships share the same average: 25%. Aunt/uncle/niece/nephew relationships are also in the same group. So a prediction of half-sibling or grandparent/grandchild based on cM is almost always a guess at one of the possibilities. Some of the predictions at DNA testing sites often don’t make too much sense. They might usually be based on age, but if you and your brother are close in age, then I would’ve expected them to give you two the same prediction. If you got that information from my relationship prediction tool, there are almost always possibilities other than the most likely relationship. One thing that’s possible is a value so low or high that grandparent/grandchild is possible but half-sibling isn’t. For example, a match of over 2,500 cM is very unlikely to be a half-sibling or grandparent/grandchild. But if you had to choose between only those two options, half-sibling is almost impossible, making grandparent/grandchild far more likely, despite being very unlikely compared to something like 3/4 or full siblings.

Brit, I have a parent/son relationship that shares 3456 cM in Ancestry, or what I calculate to be 49.73%, which seems reasonable, but the calculator generates an error for values above 46.684% HIR or 3245 cM. The DNA Painter tool does not start generating errors until we get above 50.006%. I wonder if there is a problem with the calculator?

Hi Tim,

I think you’re talking about the predictor on my site (https://dna-sci.com/tools/brit-cim/), right? I didn’t put parent/child relationships into that one from the start for a few reasons. One reason is that I think it’s kind of silly. The same goes for full-siblings most of the time, but I’ve included them. For both relationship types, it’s very easy to see what the relationship is without using a relationship predictor. A match that’s about 50% IBD and entirely comprised of half-identical regions (HIR), i.e. one and only one copy of the entire genome, is a parent/child relationship. A match that’s about 50% IBD or 37.5% HIR, but that includes about 12.5% fully-identical regions (FIR), is a full-sibling match. While there’s some overlap between 3/4 siblings and full-siblings some of the time, the average FIR is much lower (6.25% FIR). For either parent/child or full-sibling relationships, just trust the label given at the original testing site. Or, it’s very easy to see from the One-to-One matching page. Or from the One-to-Many total cM, although self or identical twin will show the same there as for a parent/child.

The reason I included full-siblings is to differentiate from 3/4 siblings, although it isn’t really needed except on the multiple cousin predictor (https://dna-sci.com/tools/multiple-cousin-cim/). But it doesn’t hurt to include full-siblings on all predictors. It had to be one or the other with regard to parent/child or full-sibling, and I think it’s better to include full-siblings. That’s for IBD predictions, because then there’s significant overlap between the two, i.e. the average for full-siblings (50%) is exactly where the parent/child relationships should be. However, for HIR relationship prediction, it’s possible to call anything higher than the range of full-siblings a parent/child relationship. That’s what I’ve done with the new GEDmatch predictions. And I may integrate that into my own relationship predictor soon. But there is no solution for the IBD predictions, which are the default for the 23andMe and percentage input boxes.

The DNA Painter tool includes parent/child because it only works for AncestryDNA data, which is always HIR. So they don’t have the issue of overlap between full-siblings and parent/child. But I’ll note that IBD predictions give much more conclusive results. And I’ll also note that the DNA Painter tool is completely unusable for IBD full-siblings, and thusly unusable for 23andMe total cM or percentages for full-siblings (https://dna-sci.com/2021/11/05/has-relationship-prediction-drastically-improved/). So, for now, different predictors bring different things to the table. I’ve chosen what I deem to be the most important ones for the relationship predictors at this site, but I hope to make improvements where possible.

I hope that helps.

Got it, thanks.