A formula to improve your ability to estimate the genetic relationship between two parents
GEDmatch has had a tool called “Are Your Parents Related” (AYPR) for a few years now. The tool is helpful and easy to use. Some have argued that this information shouldn’t be available to users. However, if the tool results in shared segments between the parents, a link appears to direct people to a genetic counselor. People using the tool have asked if there’s a way to see if any of your ancestor’s farther back were related. That isn’t so easy because there’s nothing built in to your DNA that can determine that. However, shared DNA between cousins who share those ancestors will be inflated and I do have another way to calculate very accurate probabilities for hypothetical relationships between the ancestor pair.
As far as checking if your parents are related, your genome does contain the information necessary. You probably already know that, for each of your 22 autosomal chromosomes, you have one copy that’s fully from your mother and one that’s fully from your father. All one has to do is check to see if the two copies are the same for a significantly long stretch of DNA, which would mean that your parents are somewhat related. That may be a disconcerting idea to you, but if the matching segments are small, their common ancestors are probably very far back and it’s highly unlikely that you have anything to worry about.
Kitty Cooper has already blogged about a way to improve the AYPR tool and perhaps the method is attributable to her. She includes an informational brochure in her blog, so if you’re here because you’re concerned about what the AYPR tool is telling you and you want to talk to a counselor, please visit her page and read the brochure.
Ms. Cooper’s idea is brilliant. She’s providing a service to those in need, and I must say that the AYPR tool wouldn’t be very useful without her rule-of-thumb. Below, I’ll show how much it improves the accuracy of AYPR. The tool at GEDmatch will output a number of segments that your parents may have shared or it will output no segments. If there are matches, one of the results it gives you will be the cM of each matching segment and another number will be the total cM of all segments. Ms. Cooper says to then multiply the total cM by four.
She explains why you do this and gives an example to make it easier to understand. There’s no reason for me to provide another example. But here’s the important idea: Consider the probability of getting a single small stretch of DNA from your parents if they both have it. Whatever amount of DNA your parents may share, each has two different copies of a particular stretch of DNA that has to end up on your two copies of the applicable chromosome. Each of your copies has a 1/2 chance of getting the copy that the respect parent shares with the other parent. Basic probability dictates that two events, each of likelihood 1/2, get multiplied by each other to make 1/4, which is the probability that you have both matches. This effect, spread out over the entire genome, makes it even more likely to be near the average of 25%, meaning that that’s the most likely amount of DNA that you would get out of the DNA that your parents may share. This is the same reason that full-siblings share, on average, 25% fully-identical regions (FIR) with each other. In fact, before I ever wrote an article on the AYPR tool I was using a sibling FIR statistic to estimate the range of DNA that you could under- or over-estimate after multiplying AYPR results by four: Since that 99% confidence interval is 15.2% to 35.8%, multiplying by four could result in 68% to 143% of the DNA your parents may share. So, while the rule-of-thumb is much better than using the raw output of AYPR, it can still be quite far off of the mark.
Room for improvement
25% isn’t a very good chance of finding something. How can you increase your odds? That’s where my calculation comes in. As others have realized, you could have your siblings upload kits to GEDmatch if you have any who have had their DNA genotyped. For parents with very close genetic relationships, the prospects of there being any siblings to the kit owner may be very low. However, we know that people are already attempting to do this.
People have used the AYPR tool for themselves, recorded the results, and then used it for one or more siblings. The next question is inevitably going to be, “Do I add up the total cM from both kits and then multiply by four?” The answer is that you definitely should not. Read on for information regarding the dangers of using even a slightly incorrect multiplier. You and another sibling together have much more than 25% of the DNA that your parents share, if any, so multiplying the cM by four would result in two hypothetical people who are related much more than your parents are related. Another idea is to average the values and then multiply them by four. This is better than adding and then multiplying by four, but it still isn’t ideal.
The first thing to do is get rid of any duplicate segments. Hopefully none are partially overlapping. If so, you’ll have to figure out how many cM are on one side of the overlap and how many are on the other, and then add the two sides together to make one segment. The next step is to add up all of the distinct segments only. Do not add any cM from the same segments twice or add the cM they give you for segments that overlap. Now what are we going to multiply the total cM by?
For unrelated parents, a child has exactly 50% of their DNA. When an additional sibling gets their DNA genotyped, the combined kits reproduce even more of a parent’s DNA. You may recall that I wrote an article over two years ago that discussed this and calculated many results that you’d get for the combination of various relatives and ancestors. I even released a calculator that not only computes the averages for you, but the ranges of expected DNA, with probabilities. Calculating DNA coverage by testing additional relatives is a topic that seems to be growing in popularity. The important lesson to take home is that, with the addition of each relative, you add half of what you added from the previous relative. So, for your DNA plus that of three siblings, the result would be 50% + 25% + 12.5% + 6.25% = 93.75% of a parent’s DNA, on average. We’re going to use a similar concept for the addition of kits with some value of runs of homozygosity (ROH).
So, what’s the probability that you and one sibling have, somewhere between the two of you, any segments that your parents both share? Let’s take a case in which two parents are full-siblings to each other. It’s known that their children will share 62.5% (5/8) DNA with either parent, on average. It’s also known that a child will have 1/4 ROH, on average. But there’s something different about FIR and ROH values. They aren’t converted to cM in the way that HIR values can be multiplied by 68 or 72. That’s because, in getting the 25% value, the numerator was the total count of ROH and the denominator was the amount of DNA in one copy of each chromosome. HIR values are gotten by dividing the amount of shared by DNA by twice that value, or the amount of DNA in both copies—a whole genome. So, to convert a ROH value to cM, we’d have to divide it by two one more time. In that sense, children in this example would have 12.5% (1/8) ROH. Since we know that the AYPR tool is giving us a result in cM, we want to use 1/8.
The mathematical series that determines how much ROH the siblings have combined is in the top row of the table below. The second row shows the running total, or the fraction of the shared DNA between parents, on average, that children have with the addition of each sibling. The first fraction in each term of the top row comes from the previous value in the second row.
|1/8 +||(1/8)*(5/8) +||(13/64)*(1/4) +||…|
Knowing the value for two kits, a person might want to invert the fraction 13/64 and multiply it by the results of the AYPR tool. That would be quite the mistake. The parents in this scenario likely share about 50% DNA. That means we want to multiply 13/64 by a number that results in 1/2. That number is 32/13. This is the number to multiply by the total cM of distinct segments in the results of two kits from full-siblings.
The table below shows how much to multiply by for different numbers of siblings, including yourself.
|# of Children||
Multiply by …
An interesting thing worth noting is that testing an unlimited number of full-siblings and plugging the result into the AYPR tool wouldn’t result in an average near 100% coverage of the parents’ genomes. Since the parent’s are likely to have 25% NIR, the best average you could get across a population, even with unlimited children’s kits, would be 75%. For a value that converts to cM, that’s 37.5% (3/8) of the whole genome. That means that the lower bound on the multiple from an unlimited number of children’s kits would be (8/3) * (1/2) = 4/3. Each multiple after the addition of a new sibling will get lower, starting at four and trending towards 4/3.
Also of note is that this process could be made even better. If the direct-to-consumer genotyping companies could start separating the two chromosome copies, we could find even more of the DNA two parents might share when we check two siblings. In addition to using the AYPR tool, you’d have to compare your paternal homologues to your sibling’s maternal copies, as well as your maternal to your sibling’s paternal. You would then add the total cM of these segments to the result of the AYPR tool, excluding duplicate segments or parts thereof.
The companies would be able to figure out which copy is maternal and which is paternal pretty easily if you had some of your relatives labeled as such. The big company, which is missing the most important tool for genetic genealogy as a hobby—the chromosome browser—would likely be very quick to automate all of this for you. Hopefully the other companies would, too, although we’d always have the option of manually labeling our chromosome copies.
Separating the two homologues is totally possible. Rather than smashing the chromosomes into little bits before reading them, long-read sequencing (LRS) has been successfully used for over a decade. This would allow for telling the difference between the two chromosome copies. It will likely be only a few years before LRS can be done at prices comparable to what we pay for today’s genotyping.
As genetic genealogists, I don’t think we’ll be disappointed by the future of sequencing.
Checking our work:
How much more accurate are the results when we use the results of a sibling in addition to our own? I noted above that the method of multiplying by four is going to result in anywhere from 68% to 143% of the actual DNA shared by the parents. That comes from the 99% confidence intervals (15.2-35.8%) for shared fully-identical regions (FIR) between siblings, which is centered on 25%. The number of runs of homozygosity (ROH, or amount of a genome that’s identical over both chromosome copies) in a child of two siblings is the same as the FIR between siblings. ROH values such as that and others will be published here. Rather than getting the range of undershooting/overshooting for two siblings’ kits and comparing it to the 68-143% range from one kit, there’s a way to more accurately describe how much the improvement will be, on average, when adding a sibling’s kit.
I already have a model that gets all of the averages right when comparing relatives or ancestors or when combining kits. I can use that model to test the numbers in table above. I could simulate into existence two full-siblings with their own genomes. They could then have a child together. Checking the amount of ROH in the child, it should be close to 25%, which can then be multiplied by four. Then I could find the difference between that result and the amount of DNA that the parents already shared with each other (HIR only + FIR method, which averages 50% for full-siblings). I would do this 500k times. The difference between the two values (always multiplying the child’s ROH by four before taking the difference) would tell me how far off, on average, this method is from the actual shared DNA between the parents. I would also be able to check values above and below four to make sure that that’s the best multiple to use.
Then I could do the same test with two kits, with the only differences being that I’d only count distinct segments from the AYPR tool and that I’d multiply the result by 32/13 instead of four. The average difference between this value and the actual shared DNA between parents each time should be less than it was using one kit. I could also check values above and below 32/13 to see if they work better.
Here are the results of those comparisons:
The table above shows that accuracy improves with each method above and with the addition of each tested sibling. It’s a vindication for the rule-of-thumb that says to multiply by four and also for the math that shows what fraction to use for additional siblings who have tested. I’ve also tested multiples that are slightly larger and slightly smaller than each of the multiples above. That has assured me that those multiples are the best ones to use.
But we should also check to see what happens when the parents share a more distant genetic relationship. A good example would be parents who are first cousins to each other. In this case, the first child would only have 3.125% ROH, on average, rather than 25%; and the average combined ROH for an unlimited number of children would be 12.5%, versus 37.5% for parents who are siblings.
The math still shows that four is the best multiple to use for one child of first cousins. However, multiplying the combined ROH between siblings by 3.6 results in less of a difference, on average, between that and the shared DNA between parents. I will calculate the best fractions to use for children of first cousins and test those values afterwards.
Why would the theoretical value be four, while the best value, on average is 3.6? That’s probably because I’m calculating the average difference between combined ROH and shared DNA between the parents. When we have averages that are close to 0% or 100%, the values approaching 0% or 100% are squished, while the values on the other side have room to spread out. This is called a long-tailed distribution and it’s what happens when the average ROH in a child of first cousins is as low as 3.125%. If I were calculating the median difference, the theoretical value should be the same as the best value in practice. But, since I’m using average, the results are affected greatly by a few of the farthest data points from the mean, i.e. values much higher than 3.125%, while the shared DNA between the parents is still near 12.5%, on average. We still want to use the average, as that’s the expected value that you’ll be off the mark when using the tool. And someone who has a much higher than average ROH for the child of first cousins doesn’t know that they have a higher ROH than average. So we should use the best multiple in practice, which is about 3.6.
This is somewhat disappointing because the whole purpose of these methods is to discover what the relationship is between parents, so it would be impossible to find the best multiple to use based on what the relationship is. It would be difficult for people to use the multiple of four, check to see what the likely relationship is, and then re-calculate using something like 3.6 if it appears that their parents are first cousins. It would be far easier for people to just check the expected ROH values that I’ve already calculated. There will be some overlap between scenarios, but there will also be cases in which a person’s ROH value from the AYPR tool is in a distinct category.
If they have additional siblings tested, that would be a good opportunity to find the right multiple to use, in practice, so that they can benefit from the additional coverage from having more siblings tested. Then again, I could also provide charts showing the coverage of two or more siblings and people could just check the charts for those. It seems far easier than checking a lot of different values of multiples.
I hope you’ve found this information helpful.
Feel free to ask me about modeling & simulation, genetic genealogy, or genealogical research. To see my articles on Medium, click here. And try out a nifty calculator that’s based on the first of my three genetic models. It lets you find the amount of an ancestor’s DNA you have when combined with various relatives. And most importantly, check out these ranges of shared DNA percentages or shared centiMorgans, which are the only published values that match known standard deviations.