How to automatically identify your most important DNA matches
I recently had this idea for a project in Python that could save me a lot of work: What if a computer program could identify your most important DNA relatives? I was sure that the results wouldn’t be perfect, but I wanted to see how close I could get, and maybe get a useful tool in the process.
In theory, if you have enough DNA relatives, you could list the start and end points of each of your segments on a particular chromosome, laying each out in a sequential manner with very little or no gaps in between any of them¹. Then you could do the same for each of your other chromosomes. Let’s say that all of your ancestors have been in the U.S. for several generations and that the families of your ancestors were typically large. It’s then fairly likely that you share each of your DNA segments with at least one other person on a major DNA matching website.
I should specify what I mean by “most important” DNA relatives. In this case, I’m not talking about your closest relatives as measured by total percentage or total centiMorgans (cM), although you would certainly find those people somewhere in my list of most important DNA matches. Rather, the important matches in this case are the ones who match with you, don’t overlap with each other, and in aggregate cover almost every stretch of your DNA. One of your important DNA matches might be a person who shares 8 cM with you on only one segment. You might have a very close relative who also matches with you on that segment, but less than 8 cM. For the purposes of this project, the close relative isn’t as important of a match on that particular segment. That very close relative would likely be your most important match on a different part of your genome, or several different segments. Describing the most important matches in this way allows us to find the segments that we inherited from our ancestors.
I set out to make an algorithm that could not only produce a good approximation of the start and end points of segments for one homologue (say, paternal) of each chromosome, but could optionally attempt to do so for the other homologues (maternal). The output would be two lists of starting points, end points, and match names. If one of your parents has had their DNA genotyped and you use your own phased kit to substitute for the other parent, then you would run just the first phase of the program. The result in that case would be one list that consists of matches from one of your parents—the one for whom you don’t have sequenced DNA.
My primary goal for this program was to get the segment start and end points right. It would be nice to know, secondarily, that a large portion of the segments in one of the output lists from Phase 2 belong particular homologue, but there is no way to identify maternal and paternal homologues based on the information that DNA-testing companies can currently give you. Luckily for me, I was able to use my mother’s DNA relatives for the input data and check the results against both of my maternal grandparents’ matches.
The program is actually quite fast. Even while taking in far more data than what’s necessary and probably not being very well optimized, it only takes a few seconds to run the code for two phases.
When running the program for both phases, it would be tempting to think that the output from Phase 1 was for one parent, say your father, and that the output for Phase 2 must then be your matches on your mother’s side. However, even if the program could perfectly split matches into paternal and maternal sides, this would only work on an individual chromosome basis. For example, if you found that Phase 1 showed you all of your largest paternal segments on Chromosome 1 and Phase 2 showed you all of your largest maternal segments on that chromosome, it could very well be that Phase 1 shows you your maternal matches for Chromosome 2 and your paternal matches show in the output for Phase 2. Just like all DNA genotyping platforms currently can’t tell you which homologues are paternal and maternal for each chromosome, this program won’t be able to do that either. It would require a bit of investigation to find out whether the Phase 1 output for a particular chromosome was paternal or maternal.
That setback aside, it’s easy to see how this tool would still be useful. If I figured out an algorithm that correctly lined up all of your DNA segments in order, and then did it a second time with different, unrelated segments, and all you had to do was figure out which segments matched with relatives on your father’s side and which on your mother’s side, that would still be a much quicker way to identify your important DNA relatives than looking through all of your matches manually.
There’s one thing that I didn’t want to see in the results, but I was sure would happen. Let’s say that you’re comparing the resulting output lists from Phase 2, particularly two matches over comparable start and end points. If these two people are related to each other and match on that same segment, then they don’t belong on two different lists. For me, the purpose of this program is to supplement a text file that I’ve already filled with very detailed notes. That file can list multiple matches under a segment start and end point, starting with the largest match. The start points are often earlier and the end points often stretch farther than those of the largest match. So the solution to the above problem, for me at least, will be to combine the mutual matches into one segment that may be longer. This is like moving one of the matches from the list that it doesn’t belong on and moving it to the appropriate homologue, whether maternal or paternal, leaving a large gap in the list on which the program originally placed it. If you want to find your largest match for that segment on that side, say maternal, you’ll have to look at other DNA matches in that part of your chromosome and check if they match you on your maternal side. This can be done pretty easily by sorting your original MyHeritage matches first by chromosome, then by starting point, and then by cM. Look at the largest cM matches over that stretch of DNA and keep going down the list until you find a maternal match. When this problem occurs, it still saves an immense amount of time to first run the algorithm and get the approximation seen in the output. That is, running the program is a great start to your research.
It’s even possible to add code for additional phases beyond just two. If the second phase doesn’t fill in gaps created by combining related, adjacent segments, then a third phase could be run. Just remember that there would be a lot more manual combining work to be done after running the program with three phases — at least one of the three overlapping segments for the three phases would be a duplicate segment, i.e. belong with another one on the same homologue, for the entire length of the chromosome. When one’s DNA relatives are predominately on the side of just one parent, this will be a big limitation — each phase of the code would produce a lot of segments that belong to the predominant parent, and combining them would repeatedly leave empty segments for the parent who has few tested DNA relatives.
I have found the code to be quite useful after initial testing. Despite having extensive notes for multiple DNA kits showing which segments I believe are paternal and which are maternal for each kit, sometimes with very good certainty, the output of the program has shown me many DNA matches whom I hadn’t noticed before — whether they were recently added to the database or I hadn’t checked them yet.
The best way I can think of to evaluate the Phase 2 results would be to check the fraction of all chromosomes, by length, that has coverage for both parents. A consolation metric would be the fraction that is covered by one parent. I would prefer to calculate this measurement by SNPs, but there will be a lot of segments with partial overlap. There are online calculators that can give you a number of cM or SNPs based on a start and end point, but it would be very time consuming to calculate all of the overlaps in that way. I have not yet attempted to evaluate the result, but below the code for the program I will post the results for my mother’s DNA relatives along with the known homologue for each match.
One way in which this tool could be improved would be to allow for an ‘exclude file’ with certain segments listed. For example, if the second phase of the algorithm had for some reason made a poor choice in picking the first (largest) segment, that segment could be listed in a text file. The algorithm would then read in that segments and pick the next largest segment that wasn’t listed in the ‘exclude file.’ Perhaps even better would be to add the ability to read in ‘include files’ of known paternal and maternal segments. The first phase of the program could start with the, say, maternal segments, and then fill in the remaining gaps with the best fitting segments. Then the second phase would read in the paternal segments and follow the same procedure.
One obvious concern I have is that a small percentage of the population knows how to run a python program on their personal computers and probably an even smaller percentage of DNA-tested genealogy enthusiasts. I can’t do much more than share a link to a good guide for downloading it on your computer: https://realpython.com/installing-python/. I personally either use a Jupyter notebook after having installed Anaconda or I make .py files in Atom and then run them from a terminal window.
Parents must be removed from input data sets before running the program. Also, any other known relatives should be removed if you’re hoping that this program will help you find relatives farther back on that same line. But I would recommend leaving in those relatives at least the first time you try it.
The source code for the algorithm can be found here:
Also, as of 28 Dec. 2020, I have now added a nearly identical code for use with segment-level data from GEDmatch:
These programs have saved me hundreds of hours or more of work. I usually investigate as many segments as possible for multiple kits. Running these programs gives me huge head start by laying out for me all of the largest segments for a kit, plus pretty good coverage for the other parent’s homologues. This allows me to find relatives of much farther back ancestors than I could do by only exploring matches with the largest total percentage or total cM of shared DNA.
This article was originally posted on Medium, on 10 May 2019. Cover photo by Simon Abrams. Feel free to ask me about modeling & simulation, genetic genealogy, or genealogical research. To see my articles on Medium, click here. And try out a nifty calculator that’s based on the first of my three genetic models. It lets you find the amount of an ancestor’s DNA you have when combined with various relatives. And most importantly, check out these ranges of shared DNA percentages or shared centiMorgans, which are the only published values that match known standard deviations.