Introducing the first ever relationship predictions to come from a peer-reviewed data source
Today I’ve released two new relationship predictors, one with population weights and one without weights. The data came from Ped-sim, which you can read about in this article. I believe that Ped-sim is the future of relationship prediction. The new tools are given the name “Orogen,” which derives from a geologic term meaning “mountain building.”
March 9, 2022 update: 18 types of double cousin relationships are now included in yet another tool based on Ped-sim data!
April 27, 2022 update: Orogen tools become the first relationship predictors to include X-DNA data in percentages, which is how 23andMe reports them.
For the vast majority of our needs, the population weighted predictor is the one to use. Population weights are necessary because we have vastly more distant cousins than close cousins. In the uncommon case that you had a known relative take a DNA test so that you could compare your results, you can use the relationship predictor with no population weights to see if you share a normal amount of DNA for the known relationship type.
Building relationship predictors takes a lot of work, but the process is fairly simple. To get the probabilities of each relationship type relative to other relationship types, one simply creates “bins” for the frequencies of each relationship type. I’ve chosen to create 1 centiMorgan (cM) bins for each site included in the tools, which are currently just 23andMe and AncestryDNA. After that, the probabilities for each relationship type can be calculated by dividing the counts by the totals. There’s a little more to it, which I’ll describe below, and the work is arduous, but what you end up with is a figure like the one below.
Figure 1. Probabilities for several relationship types as reported in centiMorgans (cM) at AncestryDNA. These probabilities should be understood to be relative to the other relationship types listed. The curves show what the probability is at a particular cM value, which is what you see when you enter a value into one of the new relationship prediction tools. In the two new tools, relationships go all the way to 8th cousins once removed (8C1R).
Figure 1 above and Figure 2 below look remarkably similar to the probability curves I published in April of 2021. However, there will be some differences in the tools when zoomed in on the probabilities. And one great benefit to the new tools is that the data source has been through the process of peer-review.
AncestryDNA uses the half-identical region (HIR) metric of reporting cM, which will show full-siblings sharing 37.5% on average even though we know that the true average is 50%. Below is a figure for 23andMe probabilities. In Figure 2, you can see clearly what a benefit it is to use the identical-by-descent (IBD) metric for DNA sharing rather than the HIR metric. The probabilities of the grandparent/grandchild relationship stand out much more when they’re given room in between full-siblings and the other relationship types. What I mean is that full-siblings are shifted to the right on the plot when all of the fully-identical regions (FIR) are included.
Figure 2. Probabilities for several relationship types as reported in centiMorgans (cM) at 23andMe. These probabilities should be understood to be relative to the other relationship types listed. The curves show what the probability is at a particular cM value, which is what you see when you enter a value into one of the new relationship prediction tools. In the two new tools, relationships go all the way to 8th cousins once removed (8C1R).
There is a wonderful site called DNA Painter that’s full of very useful tools. One of those tools is really popular and has been skillfully built by Jonny Perl. The data source for the probabilities came from a plot from simulations done at AncestryDNA. This was groundbreaking when it came out in 2016. The methodology doesn’t discuss much about how the probabilities were generated. The relationship predictor at DNA Painter uses data points that were obtained from the graph using an online plot digitizer. Moving forward six years, we now we have two relationship predictors from Ped-sim, which has been peer-reviewed in a science journal. Additionally, 18 million data points went into the two new relationship predictors versus a mere few dozen that are visible in Figure 3, resulting in much smoother curves for Figures 1 and 2.
Figure 3. Probability curves from the 2016 AncestryDNA matching white paper.
The data used for the Orogen predictions came from Ped-sim. In this case, the refined genetic map of Bhérer et al. (2017) was used as well as the crossover interference parameters of Campbell et al. (2015). I compiled 500,000 data points for each relationship type.
After obtaining all of the data, I applied low cM cutoff values for each of the two DNA sites. For 23andMe, a match with all segments less than 7 cM was discarded. If a match does have segments of 7 cM or more, then only segments below 5 cM were discarded. For AncestryDNA a simple cutoff of 8 cM was used, so any segments less than that were discarded.
I also had to do a conversion. I used the lengths of the genetic map lengths from this article to convert all segments proportionally to the centiMorgan length of the Bhérer et al. map. The low-cM cutoff values for 23andMe and AncestryDNA also took this into account.
The parent/child relationship was the only one for which I wasn’t able to obtain variable data for from Ped-sim. For relationship prediction, including parent/child and full-sibling relationships isn’t entirely necessary, as it’s very easy to tell the difference in other ways and the DNA sites will assign the appropriate relationship label for you based on those methods. However, I wanted to include those two relationship types just in case, for example, two people aren’t on the same site, but have both uploaded to GEDmatch. For parent/child relationships, I generated a normal distribution of data that approximates the data found in the figure below.
Figure 4. Graph from AncestryDNA in 2020 that shows the range of DNA possible for parent/child matches after genotyping errors.
Once the counts for each relationship type were placed into 1 cM bins, I had to smooth the data curves. It’s an arduous process and I was meticulous in my work to ensure that each final curve was smooth without being altered on the whole. For each relationship type, I generated individual plots showing the original, fuzzy data, and checked to see that the smooth curve was contained within the original data points. The figure below shows what the curves would look like if the smoothing hadn’t been performed.
Figure 5. Unsmoothed probabilities for several relationship types based on the AncestryDNA genetic map. This figure can be compared to Figure 1 to see how smoothing is necessary.
After smoothing the curves, I simply had to calculate the probabilities. This is done for each 1 cM bin by dividing the count for each relationship type by the total count of all relationship types. Then I saved a file of probabilities, which can then be looked up by a web-based tool like Orogen. This is how I generated the probabilities for the tool with no population weights. But I also had to create a relationship predictor with population weights.
The way that I added population weights was very simple. Since each person’s family tree is different, we will never know the exact number of cousins for each person. But what’s really important is that everyone has vastly more distant cousins than close cousins. And this isn’t very hard to approximate on average. I’ve used a typical rate, similar to a birth rate, that others have also used for population weights: 2.5 surviving children per family. This results in five times as many cousins with each generation of ancestors, for example there would be five times as many 2nd cousins as 1st cousins.
Since April of 2021, I’ve been using the formulas found in this article to determine the number of cousins at a given generational distance. These formulas have the same effect as those used in Henn et al. (2012), as I can see that the values in Table 2 are identical to those from my formulas with two exceptions: they round to the nearest tens or hundreds place in most cases and they seem to not include “removed” cousins in their table.
Please note that a “population growth model” doesn’t require a population to grow. We have many more 5th cousins than 1st cousins regardless of whether or not the population is growing. Even if the rate of surviving children per family were only 1.9, we would have population decline, or negative growth, and we would still have way more 5th cousins than 1st cousins. To avoid confusion I call the above process the applying of “population weights.”
When I had the idea to do all of this work, even when I had already done it for several other relationship predictors, I knew that going with Ped-sim was the smartest thing to do. This is the future of genetic genealogy. I plan to continually improve the process for generating and displaying the probability curves shown here and would happily work with other scientists to do so.
Feel free to ask a question or leave a comment. And make sure to check out these ranges of shared DNA percentages or shared centiMorgans, which are the only published values that match peer-reviewed standard deviations. Or, try a tool that lets you find the amount of an ancestor’s DNA you cover when combining multiple kits. I also have some older articles that are only on Medium.