Introducing the first ever relationship predictor to include X-DNA data in the percentage input box
The tool can be found here with the new percentage box available as of 27 Apr. 2022.
Why would we include X-DNA in our total shared centiMorgan (cM) or percentage amounts when we use a relationship predictor? The practical answer is “Because 23andMe reports percentages with X-DNA included in addition to autosomal DNA (atDNA),” so any relationship predictor with a percentage box has to include X-DNA in the data used to generate probabilities if it’s going to be used for 23andMe data. Another answer is that, while X-DNA has been unequivocally confounding* to relationship prediction in some cases, it might result in more accurate predictions in some other cases, just like including all of the fully-identical regions (FIRs) helps to better differentiate full-siblings and 3/4 siblings.
By including X-DNA in the data used to create relationship predictions, we’re getting to the point that X-DNA is far less confounding or not confounding at all. For example, a male DNA tester who is checking the possibility that a match is a paternal grandmother would expect to share about 182 cMs less than a female tester would, on average, if the match is indeed a paternal grandmother. All we need to do is indicate the sex of the DNA testers and use the applicable data set. A person indicating two female testers would get “paternal grandparent” probabilities for a granddaughter/grandmother pair that were generated from data with an average of 182 more cMs than a person who indicated two male testers or a male and a female tester. This is actually quite easy to implement since Orogen predictions already list paternal/maternal relationships separately.
For now, I’ve only included cases when both testers are female in the data, and thus in the relationship predictions. I had to obtain almost 18M new data points, 500k for each relationship type, in order to calculate the new probabilities with X-DNA. (I should have obtained the X-DNA along with the atDNA when developing the original atDNA Orogen predictions, as I could have left out the X-DNA for the cM input boxes at the time and then included them now.) Before calculating the new probabilities, I applied an X-DNA low cM cutoff threshold of 6 cMs as well as the atDNA cutoffs previously used.
For the next step I’ll have to obtain a whole new dataset for cases with two male matches. That will require almost 18M more data points. The final step will be to obtain almost 18M more data points for male-female matches. Although, while I’m working on that last step, people who encounter that scenario might want to check the female-female probabilities as well as the male-male probabilities to see what’s possible. The probabilities for a male-female match would definitely be between the two.
Additional work will need to be done in order to re-calculate probabilities with X-DNA for these relationship predictors:
- A tool that includes 18 different types of double cousins
- A relationship predictor without population weights (for known relatives)
How about looking at an example of how the inclusion of X-DNA affects relationship predictions?
Let’s consider a match between two females who share 2,200 cMs and say that their relationship could either be paternal grandmother/granddaughter or paternal half-sisters. This choice is random except for the fact that those two relationships are guaranteed to share about 182 cMs of X-DNA at 23andMe. Sharing 2,382 cMs of (atDNA + X-DNA) would result in about 32% shared DNA. Let’s see how relationship predictions change once we’ve included X-DNA.
The image on the left includes probabilities generated from only atDNA. I’ve also entered only the atDNA amount shared into the input box. Conversely, the image on the right includes X-DNA in both the data used to generate probabilities and in the percentage input box.
Interestingly, the only two relationship types that were guaranteed to share this much X-DNA have both seen significantly increased probabilities (from 38% to 52% and 22% to 30%). Meanwhile, the maternal grandparent/grandchild, maternal half-sister, or paternal avuncular relationships, which could be expected to share about half as much X-DNA, have probabilities that became about half as likely once X-DNA was included. The maternal avuncular relationship, which could be expected to share about 75% of the X-DNA of paternal half-sisters or paternal grandmother/granddaughter, has a probability that dropped to about 75% of what it was before (from 7.3% to 5.6%).
I wouldn’t expect those values to remain proportional for all scenarios. That’s probably somewhat of a coincidence. But I do think that we’ve seen a real benefit to including X-DNA in the total cM counts in cases in which the matches are either paternal half-sisters or paternal grandparent/grandchild. The probabilities of the true (hypothetical) relationships significantly improved in this example and I doubt that that was a fluke.
The significance of this development
Whereas before the options were very limited, genetic genealogists now have many more relationship prediction features available to them. Anyone is free to request new features to be added to the people’s relationship predictor known as Orogen. We’ve never before had that ability. I can’t describe how excited it makes me to be able to add these new features and I imagine that anyone reading this probably feels the same way. As a data scientist, I’m obsessed with improving upon the tools that we use to solve genetic genealogy cases.
Two months ago a new suite of relationship predictors was released at this site. These tools, dubbed Orogen predictors, were the first to use probabilities that were generated from a peer-reviewed data source. Previously, the most widely used relationship predictions were from AncestryDNA and appeared to be something of a black box. We know that their probabilities came from simulations in 2016, but the methodology didn’t describe much about the simulations or how the probabilities were generated. The methodology for Orogen predictors, on the other hand, is quite clear. And the data source has been published in a science journal.
The data used for the Orogen predictions came from Ped-sim. In this case, the refined genetic map of Bhérer et al. (2017) was used as well as the crossover interference parameters of Campbell et al. (2015).
Another difference between the AncestryDNA and Orogen predictors is the ability to predict full-siblings at 23andMe. AncestryDNA in the past couple of years has begun reporting total IBD percentages as a range (e.g. 42%-48%). This is a good start, but the range is currently too wide to use in relationship predictions. For cMs, Ancestry reports them by the half-identical region (HIR) metric. This leads to a full-sibling average of 37.5% rather than 50% of the total cMs. Because of that, AncestryDNA simulated probabilities will result in a prediction of parent/child for any value close to 50%, whereas it could easily be a full-sibling relationship. One previous example showed how a perfectly reasonable full-sibling match of 3,384 cM will get a 100% parent/child prediction from AncestryDNA simulations.
Now we have all of the previous benefits of the Orogen predictors plus the ability to use percentages from 23andMe. I hope you find the new feature useful!
*Here’s an example of when X-DNA is unequivocally confounding to relationship prediction: Let’s say that there’s a 1st cousin pair, one of whom is a male and the other a female. Orogen predictions already take into account whether a 1st cousin pair is on your father’s brother’s side (paternal paternal), father’s sister’s side (paternal maternal), mother’s brother’s side (maternal paternal), or mother’s sister’s side (maternal maternal).
Currently, because the results are the same for autosomal DNA (atDNA), the paternal maternal and maternal paternal results are lumped together in Orogen predictions. Even if you have the ability to select that one match is a male and one is a female, that still isn’t enough information to determine if X-DNA could even be shared through the known relationship. If the match is a paternal maternal 1st cousin, you’re a female, and the 1st cousin is a male, then you have the opportunity to share about 182 cM of X-DNA in addition to 12.5%, on average, atDNA. But if you’re a male and the match is still a paternal maternal 1st cousin, then there’s no opportunity for you to share X-DNA, as your father couldn’t pass X-DNA to a male. In this case, you wouldn’t want to enter your amount of shared DNA into the percentage input box. I
If you’re a male who shares DNA with any relative who couldn’t share X-DNA with you based on known inheritance paths, it’s best to multiply your percentage value by 72.57 to get an approximate cM value and use that in an atDNA only input box.
The tools at DNA-Sci are the only relationship predictors to show the differences between paternal and maternal relatives. So X-DNA is still unequivocally confounding in all other predictors. It appears that, in order for X-DNA to not be confounding, perhaps I should separate out the paternal maternal and maternal paternal relationships from each other. But, even then, X-DNA will still be confounding to 1st cousins once removed, half-first cousins, and all other relationships from that group or more distant because, not only is it too difficult to separate distant relationships into paternal/maternal paths, but the results become too difficult to read and the returns diminish exponentially.
The common understanding has been that X-DNA can survive more generations than atDNA. Few people would argue with that, but some disregard the significance of X-DNA’s fewer opportunities for recombination. They say that, although the X Chromosome only recombines half as many times on alternating male-female lines, those account for only a small percentage of relationships. And so only a misunderstanding would result in thinking that X-DNA is more distant, on average, than autosomal DNA at a given cM value.
A good response is that only a small percentage of our matches will have an exclusively matrilineal relationship from both matches to their most recent common ancestor (MRCA). The vast majority of our X-DNA matches will have some combination of intermediate ancestors other than male-female alternating or matrilineal. So, while thinking that many X-DNA matches come from alternating male-female lines would be a misunderstanding, it’s also a misunderstanding to think that most X-DNA matches come from matrilineal paths.
The majority of our X-DNA matches will be on segments that have had less opportunities for recombination, something between half as many opportunities and the same number of opportunities as atDNA. So, on average, X-DNA will have had only three-quarters of the opportunities for recombination.
Will that cause enough of an effect for X-DNA to be, on average, more distant than a segment of the same cM on any other chromosome? The answer is “yes” according to some initial testing that I’ve done. X-DNA appears to beat out even the largest chromosome (Chromosome 1) in its ability to survive many generations, despite having only about two-thirds the cMs. But that doesn’t tell us whether or not X-DNA is from a more distant MRCA, on average, than atDNA. The answer to that appears to be “no” according to my testing. The total amount of atDNA is so much larger than the X Chromosome (the latter has only about 5% of the cM), that X-DNA just can’t compete with the ability of atDNA to survive through many generations.
With the feature released today and more to come soon, X-DNA is far less confounding to relationship prediction. But we still don’t know if it will ever be ideal to include it. It may be that including the most variable chromosome makes predictions less accurate than they would be otherwise. It would take a lot of testing to find the answer to that question. But I’m beginning to think that adding X-DNA would always be a good idea if it were feasible to list all of the ancestral paths for all cousin types.
Feel free to ask a question or leave a comment. And make sure to check out these ranges of shared DNA percentages or shared centiMorgans, which are the only published values that match peer-reviewed standard deviations. Or, try a tool that lets you find the amount of an ancestor’s DNA you cover when combining multiple kits. I also have some older articles that are only on Medium.