It will be hard to capture all of the discoveries that have been made and posted at this site in the past twelve months.

A year ago I was using a very basic natural model of genetic inheritance. I had been publishing averages and ranges of shared DNA for relationships that to this day can’t be found anywhere else. I’ll go into a little more detail on that later, but my main goal here is to try to convey the amount of discoveries that have been made since then. I don’t think the field of genetic genealogy has never seen a year like 2021.

Taking advantage of multiple kits from siblings

In December of 2020 I realized that we could maximize our ability to take full advantage of access to multiple DNA kits. When we have multiple siblings tested, we have a lot more information than we had with just our own kit: 75% of a parent’s DNA, on average, with two total kits; an average of 87.5% with three total kits, etc. Common suggestions have been to take the average of the siblings’ match with a cousin, or to use the higher value, or to look at each one in a relationship predictor and eliminate options that don’t appear in both results. These methods fall way short of using all of the information found in both kits. And all that’s needed is a chromosome browser and a simple calculator to really take advantage of the data. Either column in Table 1 can be used to multiply by the total centiMorgans (cM) from distinct segments between a match and multiple tested siblings.

Table 1. Multiples to use after adding up the distinct segments that you and your siblings have with a match. Yes, these multiples could overestimate the amount of DNA your applicable parent would share with a particular match, but they’re also just as likely to underestimate that amount. Not using the multiple is even more likely to underestimate the amount. The multiples in the table get us the expected value of shared DNA.

CentiMorgans and percentages

Also in December of 2020 I published an analysis of the different genetic maps used by different genotyping sites. It turned out that the difference in cM is quite large, especially for closer relationships with higher shared cM. I showed that there are some advantages to the humble “percentage.” For example, while cM vary widely across sites, percentages are universal. Any differences, such as when 23andMe includes X Chromosome matching in its totals, are not inherent problems with percentages.

Can you add the averages and ranges for double relationships?

In January of 2021 I set out to answer a question that I was repeatedly seeing in forums and am still seeing these days. A typical version would be something like this: “I have a double 2nd cousin. What’s the average and range of DNA that I could expect to share with them?” The most frequent answers I saw recommended adding the averages for any two relationship types, and adding the two minimums together and the two maximums together to get a new range. I wanted to know if this was good advice. It turns out that, if the double cousinship doesn’t arise as a result of anyone’s parents being related to each other, you actually can add the averages of two relationship types. But the real ranges of shared DNA are narrower than if you add the minimums and maximums. For double 2nd cousins, the common recommendation would underestimate the lower limit of the 99% confidence interval (CI) by 92 cM and overestimate the upper limit by 186 cM. The reason for this is that more meiosis events are involved, which decreases variability. Furthermore, I found that you can’t even add the averages together in cases when parents are related to each other. If a double 2nd cousin is the result of someone’s parents being 1st cousins to each other, you would only want to multiply the average for 2nd cousins by about 1.88 in order to get the right value for double 2nd cousins.

A new autosomal model

January of 2021 may have been the most important month ever for this website. That’s when I completed the most up-to-date version of my natural autosomal model. I immediately set to work updating averages and ranges various relationship types, including the first data ever available for 3/4 siblings and double 1st cousins. I thought at the time that my published data included the only sex-specific ranges for any relationship types, but I later found out that Philip Gammon had produced ranges for grandparents as well as great-grandparents, described in an article by Roberta Estes. While I have sex-specific data for a lot of relationship published online, I have even more that’s ready to be put online when I find the time.

I soon made multiple discoveries with this model. One is that paternal half-siblings have a greater range of shared DNA than both types of avuncular relationships, but that maternal half-siblings have a narrower range than both. Another result is that 2nd cousins can sometimes share no DNA, albeit very rarely. And yet another is that there’s a discernible difference between 1st cousins once removed and half-1st cousins. I later found that the team at HAPI-DNA have corroborating evidence for all of these discoveries.

DNA coverage with multiple kits

In February of 2021 I revisited what had been the most popular project I had ever undertaken, and one of the very first. The question is this: How much of an ancestor’s DNA do you reproduce when combining the kits of multiple relatives? Effectively, this tells you how good of a Lazarus kit you could make at GEDmatch. In 2018 I wrote the first article to ever answer these questions. I also put a calculator online that lets you choose various relatives to combine and then shows you the result. (Unfortunately, the page is often inaccessible, especially without clearing browser history.) Now armed with a very accurate autosomal model, I wanted to recalculate those ranges. The new values are a little less optimistic about reproducing an ancestor’s DNA, showing that even as many as five children combined would rarely achieve full DNA coverage for one of their parents.

Ethnicity Estimations

In March of 2021 I took a little break from genetic genealogy to write about ethnicity. This is the first and only article I’ve written on that topic. But I had developed a formula back in 2015 that answers a very frequently asked question: “How far back is the ancestor who gave me a particular ethnicity?” Unfortunately, ethnicity estimations are pretty inaccurate. Plus, the amount of DNA that we have from a particular ancestor just a few generations back is highly variable. And the formula assumes that you got all of your French, for example, from one ancestor in a particular generation. My favorite comment to make on ethnicity results is that my third great-grandparents were born in the early 1800s, and that means that I only have 1/32nd of the DNA needed in order to predict where my ancestors were from only a couple hundred years ago. Still, I thought that it would be fun to calculate the following equation, which gives the average number of generations back, g, to an ancestor for a given percentage (perc.) of ethnicity.

Equation 1. Formula for finding how many generations back an ancestor would be, on average, in order to give you a particular percentage of ethnicity.

Using terminology that isn’t misleading for IBD haring and full-siblings

Another article I wrote in March of 2021 discussed the misleading phrase “23andMe counts fully-identical regions (FIR) twice.” It seems that the majority of people who come across this phrase believe that this is a bad thing. Indeed, I think that’s often the case for even the people who use the phrase. But FIR DNA is a double match, and thusly needs to be counted twice in order to get the full identical-by-descent (IBD) amount of shared DNA that we see in peer-reviewed journals, leading to the 50% average that we know is the true amount for full-siblings. AncestryDNA, MyHeritage, and GEDmatch, on the other hand, all display shared cM in half-identical (HIR) amounts, leading to the strange average of 37.5%. I propose a different way to describe the differences between companies: “AncestryDNA counts FIR as HIR.” While writing this article, I also dispelled the myth that AncestryDNA ignores FIR. As alluded to above, they count it, but only once.

A relationship for which the expected value doesn’t equal the average?

A third article in March of 2021 quantified a new concept in genetic genealogy. Grandparent/grandchild relationships fascinate me. There are unique properties of shared DNA for that relationship type. One of the most striking is the high variability of shared DNA. This has led some to mention how one might get much different percentages of shared DNA among their four grandparents, which will likely affect the number of matches they have from them. An article I mentioned earlier by Roberta Estes, which I hadn’t seen yet, showed that it’s not all that likely to share 24-26% with a grandparent, despite 25% being the expected value. But I realized via a little thought experiment that I could get the expected value of the split between grandparents.

We know that the amount of DNA we share with our paternal grandparents has to add up to 50% and likewise for our maternal grandparents. What if you had a large database of shared DNA for a grandchild with both paternal grandparents? Imagine you had a spreadsheet in which the first column always had the smaller value and the second column always had the larger value. You could then take the average of each column, and the result definitely wouldn’t be 25% for either one. I found that the average for the first column would be 22% and the average for the second column would be 28%. After rounding, maternal grandparents have the same values. We know that the average shared DNA for a grandparent/grandchild pair is 25%, but the expected value for the split between grandparents is 22%/28%.

Relationship prediction

Despite the above three articles I wrote in March, I also worked on relationship prediction for most of the month and published my results in early April. This included the most accurate probabilities for relationship types that have ever existed, so I turned them into a relationship prediction tool. Not only does the tool give you probabilities for any cM value or percentage, but it’s the first to include the differences between maternal and paternal relationships, which are often large, up to and including 1st cousins. This is also the only relationship prediction tool to allow for cM entries for companies other than AncestryDNA. Two months later I added population weights to the predictions, making it even more accurate than before. The predictions are much more realistic than any other tool at low cM. (I kept a copy of the tool without population weights, which is necessary in some situations and for some tools.)

Figure 1. Probabilities relative to other relationship types for AncestryDNA data. Note the large differences between groups that have been considered to have the same curves (and do have the same averages). The sometimes large differences between maternal and paternal relatives, which are included in the relationship predictor, are not shown here. I’ve had very intelligent scientists tell me that there’s a problem with my curves—that grandparent/grandchild histograms only have one peak. My response has been that I regard this to be one of the most important discoveries ever made in genetic genealogy. My histograms also have only one peak. But for probability curves that are relative to those of other relationship types, there can be no gaps in between. All probabilities for a given cM value must add up to 1.

Figure 2. Probabilities for a somewhat randomly chosen match of 2,255 cM. I bet you can see that the in-group differences (grandparent/grandchild vs. half-sibling vs. aunt/uncle/niece/nephew) and the paternal/maternal differences are significant enough to pay attention to.

The differences between the relationships in Figure 2 are quite significant. For one thing, full-siblings are easy to identify by FIR. They’ll be labeled as such by the original testing site and there’s almost no need to include them in relationship predictors. If a 2,255 cM match isn’t labeled as a full-sibling, then there’s a good chance (60%) that they’re a grandparent or grandchild. There’s an even higher probability (65.4%) that the match is paternal. It takes a lot of time to investigate the possible relationships, so a researcher can save a lot of time by investigating the most probable relationships first. In addition to all of the above, there are some values over which maternal relationships have effectively zero probability and they disappear from the predictor results.

Figure 3. Probabilities relative to other relationship types for 23andMe data. Compared to the AncestryDNA curves in Figure 1, the results for 23andMe show even more striking differences between relationship types that were considered to be in the same group.

Figure 4. A comparison of very low cM probabilities between relationship predictors at DNA-SCI (left) and DNA Painter (right). The DNA Painter tool is very well-built. However, the data come from an AncestryDNA simulation about which no methodology has been released. It also isn’t clear if the AncestryDNA data include essential population weights.

Note the differences between the relationship predictor at DNA-SCI and at DNA Painter for a low value of cM. The tool from this site excels at high cM, but it can also be seen here that it’s more reasonable at predicting relationships at low cM. For a 7 cM match, the top category from AncestryDNA simulations (right), which covers 6th to 8th cousins and further, shows a likelihood of 61%. Categories covering the same relationship types for DNA-SCI (left) have the probability at 70.9%, which is much more in line with what longtime genetic genealogists think about low cM matches, i.e. these are likely very distant matches. The DNA-SCI probabilities show that the possibility of being a 3rd cousin is almost zero for values as low as 7 cM.

More relationship prediction: multiple cousins

In May of 2021 I released another tool, the first and to this day the only existing 3/4 sibling and double 1st cousin relationship predictor. It’s highly recommended to upload to GEDmatch or test at 23andMe to compare 3/4 siblings or double 1st cousins. It was later replaced by this tool:

Figure 5. Relative probability curves for full-siblings, 3/4 siblings, double 1st cousins, and other relationships possible over that interval of cM at 23andMe.

Figure 6. Relationship prediction for 2,555 cM that indicates 3/4 sibling is by far the most likely relationship. The cM value was chosen somewhat randomly, but with the intent of showing a value with a high probability of being 3/4 siblings.

More relationship prediction: parents are related

In July of 2021 I released one more tool. This one is designed to help people who’s parents are related. Like with the multiple cousin predictor, this was the first tool to give probabilities for the possible relationships and continues to be to this day. I recently learned that Borland Genetics also has a tool that will show possible relationships, but not probabilities. The number to enter into this relationship predictor comes from the GEDmatch Are Your Parents Related? (AYPR) tool. A very important thing to note is that one shouldn’t multiply the value they get from GEDmatch by anything before entering it into the relationship predictor.

Figure 7. Relative probability curves for AYPR results at GEDmatch. The relationship type listed in the legend indicates a DNA tester’s father’s relationship to the mother.

Charles II, Habsburg of Spain

I skipped over this one earlier, but after releasing the traditional relationship predictor in April of 2021 I decided to tackle a problem that’s long and frequently seemed to captivate the minds of people interested in history, royalty, or … physical deformities, I guess—the interbreeding of the Habsburgs, culminating in the demise of the dynasty with Charles II. I sat down for probably less than an hour one night to program about seven generations of his tree into a simulation, although it takes a bit longer than that to run the simulation and write up the results. Previous studies estimated that Charles II had about 20% runs of homozygosity (ROH), which refers to the amount of DNA that’s identical on both copies. I’ve now updated that figure to at least 23.1%. This is barely shy of the 25% that would be expected if his parents were full-siblings to each other.

Additional information for parents who are related

Averages and ranges

Another article that I skipped over earlier includes the only published averages and ranges of shared DNA between relatives in the case of incest. Also included are ROH values for DNA testers whose parents are related. I wrote this article over a year ago, but promptly updated the results after developing a new model in January of 2021. Results from the AYPR tool at GEDmatch can be compared to the averages and ranges in the article, although I recommend the convenient AYPR/ROH probability calculator.

Improving upon the results of the AYPR tool

A year ago I wrote an article about predicting the relationship that parents have to each other when a DNA tester has a significant result from the AYPR tool at GEDmatch. Earlier on I mentioned what I thought was a brilliant method, developed by Kitty Cooper, that recommends multiplying the AYPR result by four before entering it into a relationship predictor. I was able to confirm that this method gives a better result than any other multiple when the DNA tester’s parents have a first degree relationship to each other (full-sibling or parent/child). However, I then discovered that the multiple has to be less than four for any other relationship, and that the right multiple gets lower with increasing genetic distance between the parents. This means that the method, the purpose of which is to find the relationship between the parents, requires a DNA tester to know the relationship before applying it. I had set out to show that the method works, but unfortunately showed that it doesn’t work in the vast majority of cases. I also provided multiples to use for when multiple full-siblings have tested, although I imagine that these multiples also only work in the limited cases described above. Fortunately, I later released the AYPR/ROH relationship predictor, described above, which did away with any need for a multiple.

Conditional probabilities for two tested siblings and a cousin

In July of 2021 I found a way to quantify the answer to a question that I’ve seen frequently asked. The question goes along the lines of this: “I match with a person at 35 cM but my full-sibling doesn’t match at all. How is this possible?” The typical answer that we’ve had to give was that DNA is variable and that full-siblings don’t share all of the same DNA. But I wanted to quantify this and was in a position to do so. The answer turned out to be that, when one sibling shares 30-40 cM with a 3rd-6th cousin, the other will share zero cM about 17.5% of the time, and zero cM is the second most likely option out of any ten cM range or zero shared cM.

Figure 8. Shared cM frequencies between a sibling and a 3rd to 6th cousin when the other sibling shares the cM range listed in the subplot title. Subplot e, with yellow bars, answers the question posed above about a full-sibling who shares 35 cM with a cousin and another sibling who shares none.

Predicted trees at GeneticAffairs

I felt privileged to work with GeneticAffairs in August of 2021 by providing probability data for the new AutoKinship tool at GeneticAffairs. This tool creates a predicted tree for you and a cluster of matches based on the output of the AutoCluster tool. You can read more about AutoKinship from Patricia Coleman and Roberta Estes. I’ve used it on my own DNA kit and found that the accuracy of the predicted trees is pretty astounding. Try it yourself!

Three-quarter siblings

Also in August of 2021 I developed an easy method for telling the difference between full-siblings and 3/4 siblings, refuting the common refrain that 3/4 siblings share amounts of DNA anywhere in the range of half-siblings and full-siblings. I had published several months earlier the ranges for 3/4 siblings, which were obviously different. But now I would show exactly how successful we can be at distinguishing between the two sibling types. I decided to test simple cutoff values for HIR, FIR, and identical-by-descent (IBD) sharing. If a value was above the threshold, I would predict “full-siblings.” If it was below, I would predict “3/4 siblings.” The first two metrics performed very well. But then I was surprised to see that IBD sharing (HIR + FIR), which is the best method, predicts the correct sibling type 94.5% of the time.

Figure 9. HIR cM vs. FIR cM for full-siblings and 3/4 siblings. The red line shows the best IBD cutoff value to use in order to differentiate between full-siblings and 3/4 siblings.

The accuracy of genetic genealogy datasets

A third project in August examined the accuracy of different datasets in genetic genealogy. I established three rules that allow for the easiest evaluation to determine if a dataset is accurate: the averages, standard deviations, and shape of the distributions must all be correct. The results reiterated what we’ve known for a long time—that empirical data are very messy in genetic genealogy and have trouble getting ranges of shared DNA that aren’t wildly inaccurate. Not only that, but the average shared DNA per relationship, which should be the easiest of the three rules to comply with, is often pretty far off in empirical datasets.

X Chromosome data

September saw a huge development in genetic genealogy. I revisited my X Chromosome model and published the first and only ranges of X-DNA to date. We’ve always known that X-DNA is too variable to determine relationship types from small amounts. But I’ve found that many relationships can be eliminated if sharing more than about 140 cM. Also, there were some surprises, such as the discovery that X-DNA may be better than autosomal DNA at differentiating between maternal half-siblings and maternal aunts. In the article, averages and ranges of X-DNA are now available up to 2nd cousins once removed.

Small segments

The next article in September concerned small segments. Following a lot of buzz about whether or not we should let people spend their time chasing small segments, which are likely to be false or untraceable, I had to wonder if we were even asking the right question. As a data scientist who’s worked with a lot of empirical and other data in a lot of different fields, one of my greatest concerns is whether or not we’re working with accurate datasets on the whole. I determined that we likely have incorrect averages shared DNA averages in our datasets. More important than what individuals do with their time, people working with large amounts of data are going to have their analyses adversely affected by inaccurate data and everyone will be working with individual data points that are a little more erroneous than they need to be.

Mathematical formulas for genetic genealogy

For my final article over this 12-month period, I decided to share some mathematical formulas that I’ve been using for anywhere from several months to a couple of years. It will probably only appeal to a very small audience for now. But I have no doubt that in the future, as data science becomes more prevalent in genetic genealogy and more attention is being paid to maternal and paternal differences between relatives, that a lot of algorithms will employ them.

Final remarks

That was a staggering amount of information. I thought that I had accomplished some things at this point a year ago. But, compared to what was to come, I hadn’t accomplished much more than working individual genetic genealogy cases for several years. I doubt that I’ll be able to write an article like this at the end of any other year. I’ve dedicated thousands of hours to making new discoveries, developing tools, and working with incest case workers, making custom simulations to find averages and ranges for various relatives in unique cases that couldn’t be solved any other way. I’ve forgone my passion for genealogy for most of the year, focusing instead on what I can provide to the community. I’ve never accepted a single payment for any of this. But I don’t think this amount of work will be sustainable in the future.

I leave you with a few questions. Have the people in whom you’ve placed your trust been helping you find this information or trying to keep it from you? Do they tell their close followers that there’s no multiple cousin relationship predictor, hoping to make their own before too many people notice? Does all of this mean that they have your interest in mind or is there some other motivation?

I have one very important discovery to share with you soon, perhaps the functionality that has been requested the most since shared DNA matches came available. I don’t know when it will be available: maybe later this year or in early 2022. Whatever the future holds, I don’t think I’ll ever have another year like 2021 for genetic genealogy discoveries.

If you had access to the most accurate relationship predictor, would you use it? Feel free to ask a question or leave a comment. And make sure to check out these ranges of shared DNA percentages or shared centiMorgans, which are the only published values that match peer-reviewed standard deviations. Or, try a tool that lets you find the amount of an ancestor’s DNA you cover when combining multiple kits. I also have some older articles that are only on Medium.