Introducing the first relationship predictions to come from a peer-reviewed data source

Two new relationship predictors are now available, one with population weights and one without weights. The data came from Ped-sim . This is the future of relationship prediction. These innovative tools are collectively named “Orogen,” which derives from a geologic term meaning “mountain building.”

Update, 3 Feb. 2023: You can now get relationship predictions from number of segments and total cMs. This tool often tells you the exact relationship for close family members, including which side the match is on (paternal or maternal).

For the vast majority of our needs, the population weighted predictor is the one to use. Population weights are necessary because we have vastly more distant cousins than close cousins. In the uncommon case that you had a known relative take a DNA test so that you could compare your results, you can use the relationship predictor with no population weights to see if you share a normal amount of DNA for the known relationship type.

Building relationship predictors takes a lot of work, but the process is fairly simple. To get the probabilities of each relationship type relative to other relationship types, one simply creates “bins” for the frequencies of each relationship type. Orogen predictors use 1 centiMorgan (cM) bins. After that, the probabilities for each relationship type can be calculated by dividing the counts of each relationship type in a bin by the total count of all relationship types in that bin. There’s a little more to it, which I’ll describe below, and the work is arduous, but the result of the analysis is a figure like the one below.

Orogen probabilities from Ped-sim: half-identical (HIR) relative probabilities for AncestryDNA relationship prediction

Figure 1. Probabilities for several relationship types as reported in centiMorgans (cM) at AncestryDNA. These probabilities should be understood to be relative to the other relationship types listed. The curves show what the probability is at a particular cM value, which is what you see when you enter a value into one of the new relationship prediction tools. In the two new tools, relationships go all the way to 8th cousins once removed (8C1R).

Figure 1 above and Figure 2 below look remarkably similar to the probability curves I published in April of 2021. However, there will be some differences in the tools when zoomed in on the probabilities. And one great benefit to the new tools is that the data source has been through the process of peer-review.

AncestryDNA uses the half-identical region (HIR) metric of reporting cMs, which will show full-siblings sharing 37.5% of the total cMs, on average, even though we know the true average is 50%. (Confusingly, they report percentages by the total identical-by-descent (IBD) metric, and as a range.) Below is a figure for 23andMe probabilities. In Figure 2, you can see clearly what a benefit it is to use the IBD metric rather than HIR. When all of the fully-identical regions (FIRs) are included, the full-sibling curve is shifted to the right. This provides more room for the grandparent/grandchild relationship relationship to stand out from the other close family relationships.

Orogen probabilities from Ped-sim: (IBD) relative probabilities for 23andMe relationship prediction

Figure 2. Probabilities for several relationship types as reported in centiMorgans (cM) at 23andMe. These probabilities should be understood to be relative to the other relationship types listed. The curves show what the probability is at a particular cM value, which is what you see when you enter a value into one of the new relationship prediction tools. In the two new tools, relationships go all the way to 8th cousins once removed (8C1R).

There is a great site called DNA Painter that’s full of useful tools. One of those tools is very popular and has been skillfully built by Jonny Perl. The data source for the probabilities came from a plot from simulations done at AncestryDNA. This was groundbreaking when it came out in 2016. The methodology doesn’t discuss much about how the probabilities were generated. The relationship predictor at DNA Painter uses data points that were obtained from the graph using an online plot digitizer. (It probably shouldn’t have to be mentioned that it would’ve been ideal to obtain the original data that made the curves.) Moving forward six years, we now we have relationship predictors from Ped-sim, which has been peer-reviewed in a science journal. This makes the predictors at DNA-Sci much more trustworthy than the data from Ancestry simulations. Additionally, at least 18 million data points went into the new relationship predictors versus a mere few dozen that are visible in Figure 3, resulting in much smoother curves for Figures 1 and 2.

AncestryDNA probablility curves from the matching white paper

Figure 3. Probability curves from the 2016 AncestryDNA matching white paper.

Methodology

Update: 29 Sept. 2022: the methodology for Orogen is now published in a science journal.

The data used for the Orogen predictions came from Ped-sim. In this case, the refined genetic map of Bhérer et al. (2017) was used as well as the crossover interference parameters of Campbell et al. (2015). I compiled 500,000 data points for each relationship type.

After obtaining all of the data, I applied low cM cutoff values for each of the two DNA sites. For 23andMe, a match with all segments less than 7 cM was discarded. If a match does have segments of 7 cM or more, then only segments below 5 cM were discarded. For AncestryDNA a simple cutoff of 8 cM was used, so any segments less than that were discarded.

I also had to do a conversion. I used the lengths of the genetic map lengths from this article to convert all segments proportionally to the centiMorgan length of the Bhérer et al. map. The low-cM cutoff values for 23andMe and AncestryDNA also took this into account.

The parent/child relationship was the only one for which I wasn’t able to obtain variable data for from Ped-sim. For relationship prediction, including parent/child and full-sibling relationships isn’t entirely necessary, as it’s very easy to tell the difference in other ways and the DNA sites will assign the appropriate relationship label for you based on those methods. However, I wanted to include those two relationship types just in case, for example, two people aren’t on the same site, but have both uploaded to GEDmatch. For parent/child relationships, I generated a normal distribution of data that approximates the data found in the figure below.

AncestryDNA graph that shows the distribution of parent/child relationships after genotyping errors

Figure 4. Graph from AncestryDNA in 2020 that shows the range of DNA possible for parent/child matches after genotyping errors.

Once the counts for each relationship type were placed into 1 cM bins, I had to smooth the data curves. It’s an arduous process and I was meticulous in my work to ensure that each final curve was smooth without being altered on the whole. For each relationship type, I generated individual plots showing the original, fuzzy data, and checked to see that the smooth curve was contained within the original data points. The figure below shows what the curves would look like if the smoothing hadn’t been performed.

Unsmoothed probabilities from Ped-sim: half-identical (HIR) relative probabilities for AncestryDNA relationship prediction

Figure 5. Unsmoothed probabilities for several relationship types based on the AncestryDNA genetic map. This figure can be compared to Figure 1 to see how smoothing is necessary.

After smoothing the curves, I simply had to calculate the probabilities. This is done for each 1 cM bin by dividing the count for each relationship type by the total count of all relationship types. Then I saved a file of probabilities, which can then be looked up by a web-based tool like Orogen. This is how I generated the probabilities for the tool with no population weights. But I also had to create a relationship predictor with population weights.

The way that I added population weights was very simple. Since each person’s family tree is different, we will never know the exact number of cousins for each person. But what’s really important is that everyone has vastly more distant cousins than close cousins. And this isn’t very hard to approximate on average. I’ve used a typical rate, similar to a birth rate, that others have also used for population weights: 2.5 surviving children per family. This results in five times as many cousins with each generation of ancestors, for example there would be five times as many 2nd cousins as 1st cousins.

Since April of 2021, I’ve been using the formulas found in this article to determine the number of cousins at a given generational distance. These formulas have the same effect as those used in Henn et al. (2012), as I can see that the values in Table 2 are identical to those from my formulas with two exceptions: they round to the nearest tens or hundreds place in most cases and they seem to not include “removed” cousins in their table.

Please note that a “population growth model” doesn’t require a population to grow. We have many more 5th cousins than 1st cousins regardless of whether or not the population is growing. Even if the rate of surviving children per family were only 1.9, we would have population decline, or negative growth, and we would still have way more 5th cousins than 1st cousins. To avoid confusion I call the above process the applying of “population weights.”

When I had the idea to do all of this work, even when I had already done it for several other relationship predictors, I knew that going with Ped-sim was the smartest thing to do. This is the future of genetic genealogy. I plan to continually improve the process for generating and displaying the probability curves shown here and would happily work with other scientists to do so.

All of the most advanced relationship predictors use the same peer-reviewed data source. Here are the tools of note:

DNA-Sci — advancing the science of relationship predictions. You can also find mobile apps. for relationship predictions in the Apple Store and on Google Play. Feel free to ask a question or leave a comment. And make sure to check out these ranges of shared X-DNA, shared atDNA percentages, and shared atDNA centiMorgans. Or, try a tool that lets you find the amount of an ancestor’s DNA you cover when combining multiple kits. I also have some older articles that are only on Medium.

6 Comments

David A Stumpf, MD, PhD on February 22, 2022 at 1:53 pm

To increase the use of these tools you might consider an API where the cm are in the request and the reply contains a json with the results. This would encourage apps to use you good work!
- Brit Nicholson on February 22, 2022 at 3:59 pm
  
  Hi Dr. Stumpf,
  
  I’ve gotten by so far learning the bare minimum of web development. It was never a goal of mine to learn those skills, but I see more and more how handy it would be to have those!
  - James Carne on February 23, 2022 at 8:40 pm
    
    Thanks Brit, great to see this advancement. Curious about the “saddle-dip” around 1700 cM for the GP/GC chart. Seems an anomaly compared to the other curves. Any idea why?
    - Brit Nicholson on February 24, 2022 at 2:46 pm
      
      Hi James,
      
      This is one of my favorite questions! When I first saw these curves a year ago what you’re describing blew me away. I only slightly understood it at first, but it came to me over the next few minutes. Each relationship type shown in the curve is represented by 500,000 data points. Relationships with higher standard deviations/wider ranges, but with the same average, will therefore have much fewer than 500,000 data points near the mean (25%). Grandparent/grandchild relationships pretty much take the cake for variable relationships. That will result in low-variability relationships like aunt/uncle/niece/nephew to be better represented near the average, resulting in a higher peak for those. Meanwhile, grandparent/grandchild relationships, which is the only one out of the 25% average category that can reach some far values away from the mean, will have peaks at the outer limits. In relative probability plots, the curves all have to add up to 100%, so where there would be a gap between two categories one has to fill in the space. Where there are multiple categories occupying the same space (for the independent variable), the curves will be depressed. When I figured this out I wondered why I hadn’t seen it coming. But nobody did. I hope all of that makes sense.
Tim Forsythe on May 2, 2022 at 1:58 pm

Brit, thanks for the new predictors. FYI, my software can now import pre-built probability matrices based on your weighted and unweighted calculators. This will allow users to validate their relationships using their preferred source. I’ve also provided a link to this article on several pages, and the weighted calculator on my “What Are DNA Charts?” page ( https://gigatrees.com/what-are-dna-charts).
- Brit Nicholson on May 2, 2022 at 2:15 pm
  
  Hi Tim,
  
  Thanks for letting me know. I’ve taken a look and I think it’s a great idea!