Mathematical Formulas for Genetic Genealogy

This post won’t be everyone’s cup of tea. But I’ve used these formulas countless times in programming in order to automate processes, which saves me a lot of time and work and prevents errors.

Ethnicity

The first time I ever wrote a formula for genetics was in 2015. It’s a very simple one and it’s the only one here that I don’t use just about daily to make new discoveries in genetic genealogy. But ethnicity estimations are very popular. So I’ll start out with the one that’s most likely to be used by others.

I’ve previously published the above formula here, along with my derivation for the formula and a whole lot of caveats about the limitations of its use and of ethnicity estimations in general. What this formula shows is that the number of generations back to an ancestor (g) can be found by taking two very simple natural logs (ln) and dividing one by the other. The term perc. is simply the percentage of a given ethnicity that was reported to you from a DNA testing site. Please note that the the result is only an average and that the true number of generations will vary quite a bit.

Different Types of Cousins Based on Gender Paths

Many will have heard by now that males and females have different recombination rates on average. The difference is significant enough that paternal relatives have decently wider ranges of shared DNA because the average maternal recombination rate is about 1.7x higher. So, depending on the sex of the ancestors who separate you and a cousin in your family tree, you have different kinds of cousins. And there’s a larger number of possible paths through your tree for more distant cousins. How many of these different gender paths are possible? The below formula can calculate that.

Here the term g refers to the number of generations back to the shared ancestor pair for nth cousins, where n = g – 1. The same formula is written two different ways. With a slight addition to that, we can get a formula for the number of different gender paths for nth cousins and certain number of times removed (xrem).

Number of nth cousins x times removed gender paths

When a cousin is the same generation as you (xrem = 0), the formula still works, so the above formula can be used in place of the previous one.

Labels

Because of the different types of cousins, I’ve had to figure out a way to distinguish between and keep track of the gender paths. I developed a convention for this a few years ago whereby “P” denotes paternal relationships and “M” denotes maternal relationships. For example, in my programming and sometimes in statistics I report, an “MM 1C” refers to a maternal maternal 1st cousin, i.e. your mother’s sister’s child. A “PM 1C” is a paternal maternal 1st cousin, i.e. your father’s sister’s child.. I even have formulas and algorithms for assigning those labels to the appropriate data. The below formulas show how many letter places there will be in the label for cousins of any genetic distance.

For cousins of any genetic distance, not removed.

How many letter places to label a cousin of any genetic distance and gender path

For cousins of any genetic distance, x times removed (xrem).

How many letter places to label a cousin of any genetic distance x times removed and gender path

Well that was simple. And we’ve seen those terms appear in earlier formulas. So for a 1st cousin, who shares an ancestor g = 2 generations back from you, the number of possible letter places is 2¹ = 2. Two places for two different letters gives you four possibilities, just as we saw earlier that there are four gender paths for 1st cousins. The four types are “MM,” “MP,” “PM,” and “PP.” The results from the two above equations can be plugged into the following Java algorithm in order to generate the required labels for each cousin type. I use these algorithms to automatically label statistics from disparate cousin types so I don’t have to treat them with separate analyses or try to keep a count of them sequentially to know which is which.

The combos() method takes in the letters you want to use for labels as the first argument and the number of letter places that will be in each label as the second argument. I also wrote a version of these algorithms in Python, shown below.

Number of Cousins of Any Genetic Distance

We’ve seen formulas to find how many different cousin types we can have in a given generation. But how many cousins in total will we have per generation? Well, that depends on the birth rate, which varies widely over time periods and geographically. So our equations are going to have to include birth rate in them. But I’m going to make it even simpler. If we assume the rate of surviving into adulthood (SR) stays constant over a given time period and for the locations of interest, the formulas become pretty simple. Below is a formula for the number of cousins you will approximately have for a given genetic distance and survival rate.

Number of cousins in a given generation based on survival rate

Similarly, for the total number of cousins once removed:

Number of cousins once removed in a given generation based on survival rate

where the first addend is for the generation younger than you and the second addend is for the generation older than you.

The X Chromosome

I originally developed the equations and codes we saw earlier in order to generate paternal and maternal labels for my X Chromosome models. I then started using them for autosomal DNA once I developed a model of sex-specific recombination. But there are some formulas that only apply to the X Chromosome. The inheritance of X-DNA is really fascinating. You can see it’s unique inheritance pattern as well as the only available ranges of shared X-DNA between relatives here.

I first developed a model of X-DNA sharing in early 2020. At that time, I accidentally discovered (and wrote) that the proportion of a woman’s ancestors in a given generation who can transmit DNA to her approaches, with increasing genetic distance, the number 0.618…, i.e. the inverse of the golden ratio. I then looked it up and found out that I wasn’t the first person to discover that. The fraction of X-contributing ancestors to a male approaches the inverse of the golden ratio divided by two. In order to save time and computational resources with a model that was only originally developed in Python (very user-friendly and full of great tools, but pretty slow), I decided to only calculate the amount of X-DNA that you could share with certain ancestors. After all, many of them cannot have contributed DNA to you via the known path in your tree.

To understand why I calculated the following formulas, imagine that the ancestors in each generation of your family tree are numbered from 1 to 2^g from left to right. This is the traditional family tree layout wherein the ancestor on your paternal line will always be numbered “1” for each generation. (I’m not particularly fond of this patriarchal convention, but I try to keep things as simple and easy to understand as possible.) Numbering the ancestors this way might remind you of Ahnentafel numbers, which are very useful. Unfortunately, knowing which generation a particular ancestor is from is quite important for my models, and so I don’t use Ahnentafel numbers. Instead, each ancestor has a particular generation and a number from 1 to 2^g.

Using this numbering convention, I saw that the easiest and biggest computation-saving gain could be made by only calculating shared X-DNA for the first possible contributing ancestor from left to right for each generation, and then each subsequent ancestor. The below formulas determine which ancestor that will be.

First ancestor from the left in any generation for a female

The term a_g above refers to the first ancestor from the left in a given generation, g, who could contribute X-DNA to a female. The term a_g-1 is the first ancestor from the next most recent generation. It might help to see the first couple of iterations solved for this recursive formula.

Dad is first ancestor from the left in any generation for a female

The first a_g-1 we encounter is just oneself—the first and only person from the left in an ancestral tree for their own generation. (Just go with it.) The term a1 here refers to a woman’s father. We’ll remember that value (a1 = 1) to plug into a_g-1 when we calculate a₂.

Paternal grandmother is first ancestor from the left in the second generation for a female

The term a₂ here refers to a woman’s paternal grandmother. It skips the first ancestor from the left when going two generations back. We would plug in a₂ = 2 as a_g-1 if we were to next calculate a₃. Actually, let’s do it.

Paternal maternal great-grandfather is first ancestor from the left in the second generation for a female

The term a3 here refers to a woman’s paternal maternal grandfather. It skips the first two ancestors from the left when going three generations back. Now lets look at the formula for a male.

First ancestor from the left in any generation for a male

The term a_g immediately above refers to the first ancestor from the left in a given generation, g, who could contribute X-DNA to a male. This formula is very similar to the formula for a female and it’s used in the same way, so I won’t go through the same process of calculating the result for this one.

Of course I could gain more computational efficiency by only calculating shared X-DNA for any possible X-contributing ancestor, but the gain wouldn’t have been as large for the extra programming. Anyway, my newest model is written slightly differently in that it even generates ancestors who can’t contribute X-DNA and it actually calculates shared X-DNA between relatives who can’t share it, in part as a check that it returns no shared X-DNA. Also, the newest version is written in Java and handles all of this just fine, conducting a million trial runs in three minutes.

That covers most of the mathematical formulas I’ve developed for genetic genealogy and ethnicity estimations. If you’ve made it through this far, cheers! I hope you get a chance to apply these to a problem someday.

Cover photo from bedneyimages. If you had access to the most accurate relationship predictor, would you use it? Feel free to ask a question or leave a comment. And make sure to check out these ranges of shared DNA percentages or shared centiMorgans, which are the only published values that match peer-reviewed standard deviations. Or, try a tool that lets you find the amount of an ancestor’s DNA you cover when combining multiple kits. I also have some older articles that are only on Medium.

Mathematical Formulas for Genetic Genealogy

This post won’t be everyone’s cup of tea. But I’ve used these formulas countless times in programming in order to automate processes, which saves me a lot of time and work and prevents errors.

Related

Submit a Comment Cancel reply

Recent Posts

Recent Comments

Archives