What is an accurate dataset in genetic genealogy?

by DNA-Sci | Aug 19, 2021 | Blog, DNA Science | 6 comments

The most important aspect of genetic genealogy that nobody talks about

It first occurred to me to write this article about a year ago. I thought that I had much more important articles to write and tools to create, but I’ve recognized that the need for an article about data accuracy increased with each passing day. It was probably never a safe assumption for me to make that any of the information here is obvious. We’ve never had a set a criteria by which to evaluate the accuracy of any of the datasets we use.

Three simple rules

What we normally want out of a genetic genealogy dataset is an accurate range of shared DNA for a particular relationship type. This may be a 99% confidence interval or any other range. There are three qualities that must be met in order for a genetic genealogy dataset to be accurate:

Rule 1. The averages must be correct. This is the easy one, although popular datasets sometimes don’t meet this requirement.

Rule 2. The standard deviations or variances must be correct. These are easy to calculate for any dataset but traditionally they haven’t been made available.

Rule 3. The histograms or probability density functions must have the right shape. This is the most ambiguous of the three requirements but we’ll see that it does matter.

We could state these rules in other ways that would be equally valid. But there must be a rule to cover the left-right position of the probability density function, such as an accurate mean or median, as in Rule 1. There also must be rules that cover the left-right spread, such as the standard deviation in Rule 2; and the height of the probability density function or histogram bars, which will be accurate if Rules 2 and 3 are both met. While the range is what we want, there are currently no peer-reviewed science articles with ranges that we could use to validate other datasets. But the standard deviation, which is the square root of the variance, is closely tied to the range and is much easier to work with. More importantly, there are published standard deviations for several relationship types that can be found in science journals. We can compare other standard deviations to those. But first let’s talk a bit more about data distributions and their shapes.

Normal distributions

The best known of all distributions is the normal distribution, often called a bell curve. This type of distribution is easily recognizable as it’s symmetric about the mean and has the shape of a bell. Many genetic relationship types are not normally distributed because they have some probability of no match, therefore the distributions are squished up against the left side (since there will be no cases of negative shared DNA). Often cited examples of these relationships are 3rd cousins and more distant. In these cases, the distributions will have a more abrupt cutoff on the left and a long tail to the right, also called positive skew.

Full-siblings are a great example of a relationship type that appears to have a normal distribution; as they share 50% DNA, on average, and there are no known (non-identical twin) sibling pairs who share anywhere near 0% or 100% DNA. If the full-sibling curve is indeed a normal distribution, then we could actually get a range of shared DNA between full-sibling pairs by calculating the 95% confidence interval (95% CI) using this formula: 95% CI = SD * 4, where SD is the standard deviation. We can test that later, as both the SD and 95% CI of full-sibling pairs are already known.

If that formula doesn’t work, we could call the full-sibling curve “approximately normally distributed,” as it appears to be a normal distribution but doesn’t meet all of the requirements. For the sake of Rule 3 above, that might be the best we can do—as Rule 3 is the most ambiguous, we can only say whether or not a distribution appears to have the right shape. Rules 1 and 2, however, allow us to immediately label a dataset as inaccurate if the requirements aren’t met.

Means and standard deviations

The averages for relationship types can easily be calculated in your head, with a calculator, or with a pencil and paper. Standard deviations can bee found in peer-reviewed science articles for a few different relationship types. The value that empirical datasets reach for the standard deviation of full-siblings falls in the range 3.6-3.8%. Veller et al. (2019 & 2020), on other other hand, have calculated standard deviations for several relationship types based on mathematical formulae.

One thing is for sure—it’s very easy to check to make sure that the Rule 1 and 2 requirements are met. If we’re going to frequently rely on a particular dataset for our research, we would want to know that the means and the standard deviations are accurate. Datasets that have an incorrect mean or standard deviation might be useful for some applications and not for others. Let’s take a look at some popular genetic genealogy datasets and see how the means and standard deviations compare.

Comparison of Standard Deviations

Of the three rules for an accurate dataset, the most interesting one is Rule 2, for standard deviations. After quite a few people had requested it for a few years, in March of 2020 the standard deviations were released for the crowd-sourced Shared centiMorgan (cM) Project. Just four months earlier we had seen the first published calculations of what standard deviations should be for a few very close relationships. And the same authors passed peer-review for their standard deviations soon after, in December of 2020. One of the greatest aspects of the calculations from Veller et al. is that they show the sometimes very large difference between paternal and maternal ranges of shared DNA. The only standard deviations that we previously had were sex-averaged statistics for full-siblings, published in 2006 by Visscher et al., and another published by Caballero et al. in 2019.

Table 1 below compares standard deviations of the percentage of shared DNA between popular and peer-reviewed datasets. The four close relationship types included are the only relationship types for which we have standard deviations other than those found here.

Comparison of standard deviations between popular genetic genealogy datasets

Table 1. Standard deviations from various sources for four different relationship types. N.P. = not provided. Pat. = paternal, Mat. = maternal, GP = grandparent/grandchild, Sib. = sibling. Values for the Shared cM Project and Green Chart were converted from cM to percentage by multiplying by 6,950 cM and dividing by 100%. That cM value is the genetic length for AncestryDNA, which likely made up the bulk of the submissions.

There are some interesting findings in Table 1. The statistics from Veller et al. (2019 & 2020) are the only complete set of peer-reviewed standard deviations for all four of the relationship types shown. The dataset that powers the Brit ciM tool has by far the closest standard deviations to Veller et al. (2019 & 2020). This was intentional, as the natural model was trained to approximate those standard deviations as closely as possible.

The next three standard deviations come from peer-reviewed science articles. In order from highest to lowest standard deviations they are SAMAFS, which is a very small empirical dataset of 1,128 sibling pairs used by Caballero et al. (2019) to validate their simulation called “Ped-sim,” which is the next highest, and then Vischer et al. (2006). Unfortunately, “full-sibling” is the only relationship type for which we have standard deviations from those three sources.

With incredibly low standard deviations for all four relationship types, the next source listed is the Shared cM Project. These low values likely come about for the following reasons: removal of good data mislabeled as “outliers,” reluctance of users to submit data they believe might be erroneous, and the naturally lower number of data submissions for non-matching cousins. There is likely one factor that actually erroneously increases the standard deviations of the crowd-sourced dataset: the labeling of one relationship type as another. If it weren’t for those errors, the already low standard deviations would be even lower. Another major drawback is the lack of differentiation between paternal and maternal relationships. The differences can be large and are very real, as proven by the more recent three scientific papers linked to above.

The lowest standard deviations out of all of the sources considered come from what people refer to as the Green Chart. There are no standard deviations reported there, but the ranges are even narrower than those of the Shared cM Project. Users might find the lower overlap between relationship types to be convenient, but convenience isn’t what we want if the ranges are actually wider. There will also be some confusion when looking at the full-sibling ranges the Green Chart and the Shared cM Project. Both sources mix half-identical region (HIR) data with IBD (identical-by-descent, or HIR plus fully-identical regions) data. This leads to the bizarre ranges of 1,613-3,488 cM (~23-50%) for the Shared cM Project and 2,300-3,900 cM (~33-56%) for the Green Chart, where the values at the low ends for both charts are obviously HIR and the high ends are IBD. Using the reported averages and the low ends of the 99% confidence intervals, one can approximate the high ends of HIR sharing for both charts: 3,613 cM for the Shared cM Project and 2,950 cM for the Green Chart. A range of 2,300-2,950 cM for full-siblings in the Green Chart is drastically narrower than any dataset, including the Shared cM Project (1,613-3,613 cM), which despite the very low reported standard deviation is actually far too wide of a range. I’m not sure if it’s the standard deviation that’s correct, in which case it’s really low; or the range that’s correct, in which case it’s the widest range of the seven sources considered.

Comparison of Means

Getting the right means should be trivial, since you can calculate them in your head: siblings share 50%, you share 25% with an aunt or uncle, 1st cousins share 12.5%, 1st cousins once removed (1C1R) share 6.25%, etc. Slight deviations from these averages are understandable if the sample size isn’t large enough. Visscher et al. (2006) found that the average shared DNA between full-siblings is 48%. I’ve found that 2nd cousins share 3.1225% rather than 3.125% even after 1 million trials on occasion. However, there are some examples of large datasets with averages that are very far away from the theoretical value. It doesn’t appear that any of the averages at the Shared cM Project are exactly right. But the difference between actual and reported averages is more noticeable with increasing distance of relationship in that chart. Table 2 shows the increase of the percent error with increasing distance of relationship.

Comparison of actual cM to the Shared cM Project

Table 2. Comparison of theoretical means to those of the Shared cM Project. Empirical values were converted from cM to percentage by multiplying by 6,950 cM and dividing by 100%. Empirical values were converted from cM to percentage in the same way as for Table 1.

All of the empirical means shown in Table 2 are overestimated. This is a strong indication that non-matching cousins are underrepresented in the crowd-sourced project, as would be expected.

Histogram Shape

I’ve already mentioned that, of the three rules for an accurate dataset, this is the most difficult to judge. It will likely be some time before anyone could say with high confidence what shape each histogram for each relationship type should look like. I would’ve liked to use histograms from Ped-sim, which is a very realistic and accurate model, to compare to other histograms. Surprisingly, I was unable to find any.

The only full-sibling histogram that I know of and can attest to the accuracy of is from my model. I also happen to know how to make a histogram of full-siblings that has the right average, the right standard deviation, and the wrong shape. For the one with the wrong shape, which I’ll call the “unrealistic model,” I made a ridiculous simulation in which there were two autosomal chromosomes rather than 22. I made one of the chromosomes quite large, at 94.6% of the total base pairs, and 99.7% of the recombination occurred on that chromosome. Other than that, the averages for paternal, maternal, and total number of recombinations were very realistic. The standard deviations from this purposely bad simulation were very close to those found in Veller et al. (2019 & 2020), just like when I simulate 22 autosomal chromosomes, and of course the averages are correct for all simulations. So let’s see a comparison of full-sibling data from the unrealistic model to that of the realistic model.

IBD Histograms for full-sibs., realistic and unrealistic models

Figure 1. Histogram comparison of two datasets of IBD sharing in full-sibling pairs. The realistic model is shown in purple and the unrealistic model in green. Both datasets have exact means and standard deviations very close to those of Veller et al. (2019 & 2020), but the histogram shapes are different.

Figure 1 shows the importance of histogram shape. Both the realistic and unrealistic models result in the correct standard deviations and averages, but the shape is clearly different. This will lead to different ranges. Data from the unrealistic model show a 99% confidence interval of 40.1-59.8%, while the 99% confidence interval for the realistic model is 39.5-60.5% shared DNA. We have every reason to trust the realistic model over the unrealistic one. The range for the unrealistic model is incorrect. You can say with certainty that a dataset with the wrong standard deviations and averages is inaccurate, but a dataset that meets those requirements isn’t guaranteed to be accurate.

More than Just Ranges

After verifying that a dataset is accurate, it can be used to build very powerful tools. Despite the fact that I created a relationship prediction tool from a dataset with validated means and standard deviations, I have to confess that I never saw any use in a relationship prediction tool until this past April. My logic was that, if you have a cM value, shouldn’t it be easiest to just look at the average that it’s closest to and to also make sure that the value is within range? It was only when building a relationship predictor with more options that I realized how necessary it was. There’s just no way that a person could do any of the following in their head:

- Calculate the differences in maternal and paternal relationships.
- Calculate how much more accurate a prediction of “half-sibling” is than “grandparent/grandchild” when two people match at 25%.
- Calculate how much more likely “grandparent/grandchild” is than “half-sibling” when far from the average of 25%.
- Apply population weights to the data.

Now I use my own relationship predictor dozens of times per day. If you’re one of those people who never saw the benefit, I can see why you’d think that. But now it has features that we can’t afford to miss.

Mine isn’t the only relationship predictor out there. There’s also one hosted at DNA Painter based on AncestryDNA simulations. Unfortunately, we can’t evaluate whether or not Ancestry’s simulated data meet any of the requirements set in Rules 1-3 above, as they’ve revealed almost no information about their simulations. The output data from those simulations actually look ok. I can replicate their probability curve graph pretty well with data of my own, shown in Figure 2.

Compare probability curves to those from AncestryDNA simulations

Figure 2. Relationships probabilities from my simulations on the left compared to those from AncestryDNA on the right. Units are the same for both graphs. The y-axes for both graphs are on a logarithmic scale. This was done at AncestryDNA in order to show the differences in more distant relationships, which were otherwise bunched-up.

Figure 2 shows that the AncestryDNA simulated data used at DNA Painter have probability curves that look ok next to a validated dataset, but there’s one thing that’s amiss. The more distant relationships appear to be skewed high in their probabilities. This is exactly what we saw with the Shared cM Project averages above, and the cause is likely the same, too: the simulations at Ancestry likely don’t include non-matching cousins. That makes it appear that more distant cousin relationships are actually more likely than less distant ones. This increased “weight” for distant cousins moves the probabilities part of the way towards what you would see after applying population weights, which is a very important component to include in a relationship predictor. The problem is that this method isn’t grounded in reality and thusly isn’t the proper way to do it, plus it only gets the probabilities a small fraction of the way towards what they would be with population weights.

Relationship predictors should use data with validated means, validated standard deviations, and the right histogram shape. It’s also very important to include the option of population weights. If want to know how two matches may be related and you’re not using this relationship predictor, you’re really shortchanging yourself.

Normal Distribution?

I mentioned at the beginning of this article that I’d use an equation to check if the probability density function for full-siblings is a normal distribution. Let’s try that out. The formula I gave earlier was 95% CI = SD * 4. The 95% confidence interval for my realistic model—the only 95% confidence interval for full-siblings that I know of—shows a range of 39.5-58.0%, or 18.5%. The standard deviation for full-siblings from that model is 4.098%. This results in the following equation: 18.5% ?= 4.098% * 4 = 16.4%. So it appears that the full-sibling probability density function isn’t exactly a normal distribution. That’s what I expected, since one study already showed that genetic relationships distributions are normal with respect to the square-root of the amount shared.

Conclusions

I’ve established three long-overdue rules for determining if a genetic genealogy dataset is accurate. They could be written in a slightly different manner, but I believe that the rules stated here will be easiest to work with in the future. I’ve evaluated multiple datasets and their adherence to those rules. For some of the datasets, the standard deviations aren’t reported. For others, the statistics are far off the mark. The data found at dna-sci.com correspond nicely to the peer-reviewed literature. This makes the relationship prediction tool exceedingly accurate. Finally, we had a bit of math fun and determined that the full-sibling probability curve isn’t normally distributed, but that’s to be expected given previous research.

Feel free to ask a question or leave a comment. And make sure to check out these ranges of shared DNA percentages or shared centiMorgans, which are the only published values that match peer-reviewed standard deviations. Or, try a calculator that lets you find the amount of an ancestor’s DNA you have when combining multiple kits. I also have some older articles that are only on Medium.

6 Comments

Bonnie Bossert on September 7, 2021 at 6:33 pm

Have you considered doing any analysis that would incorporate the number of shared segments or the length of segments?
Reply
- Brit Nicholson on September 11, 2021 at 9:31 pm
  
  Hi, Bonnie. Thanks for the question. This analysis only concentrated on statistics regarding the total shared percentage or cM of matches. The reason for that is that the number of segments doesn’t seem to be a good predictor of the degree of relationship. It’s true that the longest segment between two matches might be useful for endogamous populations. I’m not sure if that’s something you’re interested in. But that would be a whole other study. And, unfortunately, we don’t have any statistics such as averages and standard deviations for endogamous populations. Table 1 above has statistics from seven different sources for non-endogamous populations. None of those sources has released statistics for endogamous populations, so we don’t have any numbers to compare. And I think it will be quite a while before we have that many sources of endogamous statistics.
  Reply
Malcolm Peach on October 23, 2021 at 10:47 am

Just wondering where the quoted SDs for GP, Pat half-Sib and Full Siblings are given in Veller et al? Caballero et al quote similar SDs for Full-Siblings where a Poisson model has been employed… My understanding is that Ped-Sim has been calibrated to the 23&Me dataset (for the interference model) so has a practical as well as a theoretical basis.
Reply
- Brit Nicholson on October 23, 2021 at 12:00 pm
  
  Hi Malcolm,
  
  Grandparent/grandchild and full-sibling standard deviations are in Table 1 of Veller et al., 2020. The standard deviations for paternal half-siblings are in at the end of section 2.2 in Veller et al., 2019.
  
  These are the only sex-specific standard deviations that have been published. I had been waiting a couple of years for them. I know from experience that models can get a fixed standard deviation for full-siblings, but can get very different standard deviations for sex-specific relationships. I look forward to seeing empirical sex-specific standard deviations in the literature someday. But I also recognize the messiness of empirical data in the genetic genealogy and the difficulty in removing true outliers, not erroneously labeling good data as outliers, and ensuring that no multiple relationships exist where applicable. I think calculations calculations of standard deviations will be very useful going forward.
  Reply
Bob Reilly on January 8, 2022 at 12:54 pm

Hi Brit,

I apologize. Let me try that query again without the careless errors!

Could you send me the calculations you used to convert the following sibling SD values to your Table 1 values?
1. From the shared cM Project: SD value of 203 to your SD value of 2.935%
2. From Veller et al 2020: the SD values from their Table 1 to your SD value of 4.039%

Again, thanks very much.

Best,
Bob Reilly
Reply
- Brit Nicholson on January 8, 2022 at 3:35 pm
  
  Hi Bob,
  
  Thanks for your question. I’ll break the answers down by the associated number:
  
  1. This is just a conversion from cM to percentage. These are only estimations, but they will be very close for datasets across a population. It looks like I was very generous with the value I chose. Most of the data at the SCP are probably from Ancestry. But some are from other sites with much more cM in their genetic maps. Dividing by 70 would’ve put the value at 2.90%, which is exceeding low for full-siblings.
  
  2. My value isn’t a calculation or a conversion. What I did was develop a natural model of inheritance. It replicates the known properties of recombination, including the average male and female average rates from the scientific literature. Variability in the rates is also allowed in the rates according to known properties. It also includes recombination interference. Of course, parents pass half of their DNA to a child, but of course the amount passed from each grandparent is somewhat random based on the recombination properties. The standard deviations are the values that resulted from the simulation. They are fairly close to the values from Veller et al., and so I consider that to indicate validation of the model.
  
  Best regards,
  
  Brit
  Reply