The most important aspect of genetic genealogy that nobody talks about
It first occurred to me to write this article about a year ago. I thought that I had much more important articles to write and tools to create, but I’ve recognized that the need for an article about data accuracy increased with each passing day. It was probably never a safe assumption for me to make that any of the information here is obvious. We’ve never had a set a criteria by which to evaluate the accuracy of any of the datasets we use.
Three simple rules
What we normally want out of a genetic genealogy dataset is an accurate range of shared DNA for a particular relationship type. This may be a 99% confidence interval or any other range. There are three qualities that must be met in order for a genetic genealogy dataset to be accurate:
Rule 1. The averages must be correct. This is the easy one, although popular datasets sometimes don’t meet this requirement.
Rule 2. The standard deviations or variances must be correct. These are easy to calculate for any dataset but traditionally they haven’t been made available.
Rule 3. The histograms or probability density functions must have the right shape. this is the most ambiguous of the three requirements but we’ll see that it does matter.
We could state these rules in other ways that would be equally valid. But there must be a rule to cover the left-right position of the probability density function, such as an accurate mean or median, as in Rule 1. There also must be rules that cover the left-right spread, such as the standard deviation in Rule 2; and the height of the probability density function or histogram bars, which will be accurate if Rules 2 and 3 are both met. While the range is what we want, there are currently no peer-reviewed science articles with ranges that we could use to validate other datasets. But the standard deviation, which is the square root of the variance, is closely tied to the range and is much easier to work with. More importantly, there are published standard deviations for several relationship types that can be found in science journals. We can compare other standard deviations to those. But first let’s talk a bit more about data distributions and their shapes.
The best known of all distributions is the normal distribution, often called a bell curve. This type of distribution is easily recognizable as it’s symmetric about the mean and has the shape of a bell. Many genetic relationship types are not normally distributed because they have some probability of no match, therefore the distributions are squished up against the left side (since there will be no cases of negative shared DNA). Often cited examples of these relationships are 3rd cousins and more distant. In these cases, the distributions will have a more abrupt cutoff on the left and a long tail to the right, also called positive skew.
Full-siblings are a great example of a relationship type that appears to have a normal distribution; as they share 50% DNA, on average, and there are no known (non-identical twin) sibling pairs who share anywhere near 0% or 100% DNA. If the full-sibling curve is indeed a normal distribution, then we could actually get a range of shared DNA between full-sibling pairs by calculating the 95% confidence interval (95% CI) using this formula: 95% CI = SD * 4, where SD is the standard deviation. We can test that later, as both the SD and 95% CI of full-sibling pairs are already known.
If that formula doesn’t work, we could call the full-sibling curve “approximately normally distributed,” as it appears to be a normal distribution but doesn’t meet all of the requirements. For the sake of Rule 3 above, that might be the best we can do—as Rule 3 is the most ambiguous, we can only say whether or not a distribution appears to have the right shape. Rules 1 and 2, however, allow us to immediately label a dataset as inaccurate if the requirements aren’t met.
Means and standard deviations
The averages for relationship types can easily be calculated in your head, with a calculator, or with a pencil and paper. Standard deviations can bee found in peer-reviewed science articles for a few different relationship types. The value that empirical datasets seem to reach for the standard deviation of full-siblings is 3.6-3.8%. Veller et al. (2019 & 2020), on other other hand, have calculated standard deviations for several relationship types based on mathematical formulae.
One thing is for sure—it’s very easy to check to make sure that the Rule 1 and 2 requirements are met. That means that we can quickly dismiss for most uses any dataset that has an incorrect mean or standard deviation, although we still might be able to gain some insights by analyzing such a dataset. We should insist that anyone who markets to us a dataset or a tool based on a dataset must also reveal the means and the standard deviations, otherwise we’re left to wonder about the inaccuracy of those statistics. Let’s take a look at some popular genetic genealogy datasets and see how accurate they are.
Comparison of Standard Deviations
Of the three rules for an accurate dataset, the most interesting one is Rule 2 for standard deviations. After quite a few people had requested it for a few years, in March of 2020 the standard deviations were released for the crowd-sourced Shared centiMorgan (cM) Project. Just four months earlier we had seen the first published calculations of what standard deviations should be for a few very close relationships. And the same authors passed peer-review for their standard deviations soon after, in December of 2020. One of the greatest aspects of the calculations from Veller et al. is that they show the sometimes very large difference between paternal and maternal ranges of shared DNA. The only standard deviations that we previously had were sex-averaged statistics for full-siblings, published in 2006 by Visscher et al., and another published by Caballero et al. in 2019.
Table 1 below compares standard deviations of the percentage of shared DNA between popular and peer-reviewed datasets. The four close relationship types included are the only relationship types for which we have standard deviations other than those found here.
Table 1. Standard deviations from various sources for four different relationship types. N.P. = not provided. Pat. = paternal, Mat. = maternal, GP = grandparent/grandchild, Sib. = sibling. Values for the Shared cM Project and Green Chart were converted from cM to percentage by multiplying by 6,950 cM and dividing by 100%. That cM value is the genetic length for AncestryDNA, which likely made up the bulk of the submissions.
There are some amazing findings in Table 1. The statistics from Veller et al. (2019 & 2020) are the only complete set of peer-reviewed standard deviations for all four of the relationship types shown. The dataset that powers the Brit ciM tool has by far the closest standard deviations to Veller et al. (2019 & 2020). This was intentional, as the natural model was trained to approximate those standard deviations as closely as possible.
The next three standard deviations come from peer-reviewed science articles. In order from highest to lowest standard deviations they are SAMAFS, which is a very small empirical dataset of 1,128 sibling pairs used by Caballero et al. (2019) to validate their simulation called “Ped-sim,” which is the next highest, and then Vischer et al. (2006). Unfortunately, “full-sibling” is the only relationship type for which we have standard deviations from those three sources.
With incredibly low standard deviations for all four relationship types, the next source listed is the Shared cM Project. These low values likely come about for the following reasons: removal of good data mislabeled as “outliers,” reluctance of users to submit data they believe might be erroneous, and the naturally lower number of data submissions for non-matching cousins. There is likely one factor that actually erroneously increases the standard deviations of the crowd-sourced dataset: the labeling of one relationship type as another. If it weren’t for those errors, the already low standard deviations would be even lower. Another major drawback is the lack of differentiation between paternal and maternal relationships. The differences can be large and are very real, as proven by the more recent three scientific papers linked to above.
The lowest standard deviations out of all of the sources considered come from what people refer to as the Green Chart. There are no standard deviations reported there, but the ranges are even narrower than those of the Shared cM Project. Users might find the lower overlap between relationship types to be convenient, but that makes the ranges grossly inaccurate. There will also be some confusion when looking at the full-sibling ranges the Green Chart and the Shared cM Project. Both sources mix half-identical region (HIR) data with IBD (identical-by-descent, or HIR plus fully-identical regions) data. This leads to the bizarre ranges of 1,613-3,488 cM (~23-50%) for the Shared cM Project and 2,300-3,900 cM (~33-56%) for the Green Chart, where the values at the low ends for both charts are obviously HIR and the high ends are IBD. Using the reported averages and the low ends of the 99% confidence intervals, one can approximate the high ends of HIR sharing for both charts: 3,613 cM for the Shared cM Project and 2,950 cM for the Green Chart. A range of 2,300-2,950 cM for full-siblings in the Green Chart is drastically narrower than any dataset, including the Shared cM Project (1,613-3,613 cM), which despite the very low reported standard deviation is actually far too wide of a range. I’m not sure if it’s the standard deviation that’s correct, in which case it’s really low; or the range that’s correct, in which case it’s the widest range of the seven sources considered.
Comparison of Means
Getting the right means should be trivial, since you can calculate them in your head: siblings share 50%, you share 25% with an aunt or uncle, 1st cousins share 12.5%, 1st cousins once removed (1C1R) share 6.25%, etc. Slight deviations from these averages are understandable if the sample size isn’t large enough. Visscher et al. (2006) found that the average shared DNA between full-siblings is 48%. I’ve found that 2nd cousins share 3.1225% rather than 3.125% even after 1 million trials on occasion. However, there are some examples of large datasets with averages that are very far away from the theoretical value. It doesn’t appear that any of the averages at the Shared cM Project are exactly right. But the difference between actual and reported averages is more noticeable with increasing distance of relationship in that chart. Table 2 shows the increase of the percent error with increasing distance of relationship.
Table 2. Comparison of theoretical means to those of the Shared cM Project. Empirical values were converted from cM to percentage by multiplying by 6,950 cM and dividing by 100%. Empirical values were converted from cM to percentage in the same way as for Table 1.
All of the empirical means shown in Table 2 are overestimated. This is a strong indication that non-matching cousins are underrepresented in the crowd-sourced project, as would be expected.
I’ve already mentioned that, of the three rules for an accurate dataset, this is the most difficult to judge. It will likely be some time before anyone could say with high confidence what shape each histogram for each relationship type should look like. I would’ve liked to use histograms from Ped-sim, which is a very realistic and accurate model, to compare to other histograms. Surprisingly, I was unable to find any.
The only full-sibling histogram that I know of and can attest to the accuracy of is from my model. I also happen to know how to make a histogram of full-siblings that has the right average, the right standard deviation, and the wrong shape. For the one with the wrong shape, which I’ll call the “unrealistic model,” I made a ridiculous simulation in which there were two autosomal chromosomes rather than 22. I made one of the chromosomes quite large, at 94.6% of the total base pairs, and 99.7% of the recombination occurred on that chromosome. Other than that, the averages for paternal, maternal, and total number of recombinations were very realistic. The standard deviations from this purposely bad simulation were very close to those found in Veller et al. (2019 & 2020), just like when I simulate 22 autosomal chromosomes, and of course the averages are correct for all simulations. So let’s see a comparison of full-sibling data from the unrealistic model to that of the realistic model.
Figure 1. Histogram comparison of two datasets of IBD sharing in full-sibling pairs. The realistic model is shown in purple and the unrealistic model in green. Both datasets have exact means and standard deviations very close to those of Veller et al. (2019 & 2020), but the histogram shapes are different.
Figure 1 shows the importance of histogram shape. Both the realistic and unrealistic models result in the correct standard deviations and averages, but the shape is clearly different. This will lead to different ranges. Data from the unrealistic model show a 99% confidence interval of 40.1-59.8%, while the 99% confidence interval for the realistic model is 39.5-60.5% shared DNA. We have every reason to trust the realistic model over the unrealistic one. The range for the unrealistic model is incorrect. You can say with certainty that a dataset with the wrong standard deviations and averages is inaccurate, but a dataset that meets those requirements isn’t guaranteed to be accurate.
More than Just Ranges
After verifying that a dataset is accurate, it can be used to build very powerful tools. Despite the fact that I created the only relationship prediction tool from a dataset with validated means and standard deviations, I have to confess that I never saw any use in a relationship prediction tool until this past April. (Although I understood the great value of tools use trees and the shared cM between multiple people.) My logic was that, if you have a cM value, shouldn’t it be easiest to just look at the average that it’s closest to and to also make sure that the value is within range? It was only in building a relationship predictor with more options that I realized how necessary it was. There’s just no way that a person could do any of the following in their head:
- Calculate the differences in maternal and paternal relationships.
- Calculate how much more accurate a prediction of “half-sibling” is than “grandparent/grandchild” when two people match at 25%.
- Calculate how much more likely “grandparent/grandchild” is than “half-sibling” when far from the average of 25%.
- Apply population weights to the data.
It was only after adding these features to my tool that I ever used a relationship predictor. Now I use my own dozens of times per day. If you’re one of those people who never saw the benefit, I agree. But now it has features that we can’t afford to miss.
Mine isn’t the only relationship predictor out there. There’s also one hosted at DNA Painter based on AncestryDNA simulations. Unfortunately, we can’t evaluate whether or not Ancestry’s simulated data meet any of the requirements set in Rules 1-3 above, as they’ve revealed almost no information about their simulations. The output data from those simulations actually look ok. I can replicate their probability curve graph pretty well with data of my own, shown in Figure 2.
Figure 2. Relationships probabilities from my simulations on the left compared to those from AncestryDNA on the right. Units are the same for both graphs. The y-axes for both graphs are on a logarithmic scale. This was done at AncestryDNA in order to show the differences in more distant relationships, which were otherwise bunched-up.
Figure 2 shows that the AncestryDNA simulated data used at DNA Painter have probability curves that look ok next to a validated dataset, but there’s one thing that’s amiss. The more distant relationships appear to be skewed high in their probabilities. This is exactly what we saw with the Shared cM Project averages above, and the cause is likely the same, too: the simulations at Ancestry likely don’t include non-matching cousins. That makes it appear that more distant cousin relationships are actually more likely than less distant ones. This increased “weight” for distant cousins moves the probabilities part of the way towards what you would see after applying population weights, which is a very important component to include in a relationship predictor. The problem is that this method isn’t grounded in reality and thusly isn’t the proper way to do it, plus it only gets the probabilities a small fraction of the way towards what they would be with population weights.
Relationship predictors should use data with validated means, validated standard deviations, and the right histogram shape. It’s also very important to include the option of population weights. If want to know how two matches may be related and you’re not using this relationship predictor, you’re really shortchanging yourself.
I mentioned at the beginning of this article that I’d use an equation to check if the probability density function for full-siblings is a normal distribution. Let’s try that out. The formula I gave earlier was 95% CI = SD * 4. The 95% confidence interval for my realistic model—the only 95% confidence interval for full-siblings that I know of—shows a range of 39.5-58.0%, or 18.5%. The standard deviation for full-siblings from that model is 4.098%. This results in the following equation: 18.5% ?= 4.098% * 4 = 16.4%. So it appears that the full-sibling probability density function isn’t exactly a normal distribution. That’s what I expected, since one study already showed that genetic relationships distributions are normal with respect to the square-root of the amount shared.
I’ve established three long-overdue rules for determining if a genetic genealogy dataset is accurate. They could be written in a slightly different manner, but I believe that the rules stated here will be easiest to work with in the future. I’ve evaluated multiple datasets and their adherence to those rules. For some of the datasets, the standard deviations aren’t reported. For others, the statistics are far off the mark. The data found at dna-sci.com correspond nicely to the peer-reviewed literature. This makes the relationship prediction tool exceedingly accurate. Finally, we had a bit of math fun and determined that the full-sibling probability curve isn’t normally distributed, but that’s to be expected given previous research.
If you had access to the most accurate relationship predictor, would you use it? Feel free to ask a question or leave a comment. And make sure to check out these ranges of shared DNA percentages or shared centiMorgans, which are the only published values that match peer-reviewed standard deviations. Or, try a calculator that lets you find the amount of an ancestor’s DNA you have when combining multiple kits. I also have some older articles that are only on Medium.