Update, 3 March 2020: When I finally thought to share this information, I found out that I wasn’t the only person to have ever tried it. Still, I was encouraged to quickly finish writing up my methodology, as there weren’t any other instructions available. One person who joined the conversation was Dave Vance. He has used Gephi to create some very beautiful cluster images—definitely better than mine. In fact, I realized that I only had one (kind of strange) image saved from years ago, which is the preview photo for this article. Dave Vance has since created a great YouTube video describing his process.
I started using a free application called Gephi to turn my DNA segment-level matches into clusters in early 2017. It didn’t seem as useful once MyHeritage came out with their AutoCluster tool.
However, there are quite a few things you could do with Gephi that I don’t think you can do with any of the automated tools designed solely for DNA matches:
- Change colors for different groups
- Change number of groups to lump matches into
- Make nodes bigger for matches with more cM
- Include centiMorgan (cM) values for any range you choose
- Analyze most matches using different parameters such as closeness, centrality, etc.
- Change the graph type based on preferred view or parameter
GEDmatch is the easiest way to manually collect the input data, but it’s still fairly time-consuming.
This is a fairly rough draft. I’ll try to keep up with adding and changing content. And I’d definitely like to add pictures.
Instructions for Getting Your Input Data
Your matches:
Probably the first thing to know is that you shouldn’t include shared cM values for your parents’ or full-siblings’ kits. Although, if your parent has tested, you’d probably want to use their kit or that from the farthest-back tested ancestor for the side of the family you’re investigating. You can also delete any other matches that are very close if you want to see clusters from farther back ancestors.
When you go to GEDmatch, you have the option of One-to-Many matching. Click on that and select your kit. You won’t be able to get more than 3,000 matches unless you’re a paid subscriber. Choose that, as it will be plenty for our purposes. You can drag your cursor to select the entire table or just the number of matches that you want. After doing so, copy it and paste it into the second row of an empty spreadsheet. I would choose to merge formatting to that of the source document at this point or just before pasting. You can get rid of a lot of columns now. Delete all of them except the first (kit number), second (name), and sixth (total cM, probably labeled ‘F’ in your spreadsheet) columns. This spreadsheet will be your nodes input, but it’s not quite ready yet.
Open a new spreadsheet to make your edges input. Copy and paste your own kit number into the second row of the first column. Now paste it again beneath that–into the third row of the first column. (Don’t drag down the first one to populate the second because it may increment your kit number by 1.) Go back to the nodes file you made in Step 1. Copy and paste all three of those columns into the second to fourth columns of the edges spreadsheet, beginning in row 2. Now delete the whole names column and let the total cM column move over one column to the left. Type the column labels into the first four columns of the first row: Source, Target, Weight, Type. Under Source, you should currently have your own kit number listed twice; under Target, all of your matches’ kit numbers; under Weight, all of the total cM values. Type “Undirected” into the first cell under Type. Now select the two cells where your kit number is listed. Drag that selection all the way down to the end of where your 3,000 matches are listed. Don’t bother dragging down the “Undirected” values yet, as we have more data to put into the spreadsheet still.
Mutual Matches:
The next part will be the most time consuming. Just remember–it’s far better than what you’d have to do to get mutual match information from any of the major testing companies. And you can include as many mutual matches as you’d like. Try doing just a few and then testing. But, for good results, you want to make sure that you get some of your matches from at least all four pairs of great-grandparents. You may have far fewer relatives on certain sides, so you want to make sure that you get enough mutual matches to include them.
We’re going to be copying and pasting data from GEDmatch, concatenating it to the bottom of the edges spreadsheet. It will be much easier to do this if you open a new spreadsheet to use in an intermediate step.
Open the ‘people who match 1 or 2 of both kits’ tool on GEDmatch. Using your one-to-many results, starting with your closest relatives (again, not your parents), copy and paste one kit number at a time along with your own to find your mutual matches. Also copy that kit number into the first two empty cells of the Source column under the data that are already in your edges spreadsheet. Click ‘enter’ in GEDmatch to get your mutual matches. Drag to select all of the results. Skip your parents, i.e. start dragging from the second row if the first row is one of your parents. Copy those values and paste them into a blank spreadsheet. Merge the formatting to that of the source document. We’re going to delete all but three of the columns. Delete columns to the right the ‘Shared’ column, which is for the total cM you share with each match. You’ll know to keep the total cM because it will sometimes have larger values than the largest segment cM column.
Copy and paste the three remaining columns underneath the data in your edges spreadsheet. Select the two values of the kit number and drag it down to the bottom of the mutual match kit numbers, just like you dragged your own kit number to the bottom, next to your own matches. Save your edges spreadsheet and clear all of the cells in your intermediate spreadsheet. Go back to GEDmatch and repeat the process with the next kit number.
In your edges sheet drag the “Undirected” values in the Type column all the way down to the end. Your edges spreadsheet is now complete.
Go back to your nodes spreadsheet. You can now delete all of the total cM values that you used for the edges spreadsheet. In the first row, type into the first two columns “ID” and “Label.” Your nodes spreadsheet is now complete.
Save both spreadsheets as CSV files.
Instructions for Loading Your Data into Gephi
Open Gephi
Select ‘New Project’ in the wizard that pops up or go to File > New Project if there’s no wizard.
Click on the ‘Data Laboratory’ button (it’s the center of three buttons at the top left).
Click on ‘Import Spreadsheet.’ Select the nodes spreadsheet that you already created. Be sure that ‘Comma’ is selected for Separator, ‘Nodes table’ is selected under ‘Import as:,’ and UTF-8 should work as the charset. A preview will be shown of your data below that. It should have GEDmatch kit IDs listed under ‘ID’ and names or nicknames listed under ‘Label.’
Click ‘Next.’ Make sure that both ID and Label are checked and then click ‘Finish.’
The Import report pops up. For Graph Type, click ‘Undirected’ and check ‘Append to existing workspace.’ Click ‘OK.’
You should now see the IDs and labels in the data table listed under the Nodes tab. If you were to click on the Edges tab right now, it would be empty. Click File > Save so you don’t lose your work.
Now import the edges table by clicking ‘Import Spreadsheet’ again. Ensure the separator is still ‘Comma’ and this time select ‘Edges table’ under ‘Import as:.’ You should see Source, Target, Type, and Weight in the preview.
Click ‘Next.’ For Imported columns, all four should be selected. A double precision variable type might be selected under ‘Weight.’ That shouldn’t be necessary. It can be changed to ‘Float’ to conserve computer resources.
The Import report pops up again. Selected an undirected graph type again. Be extra careful to check ‘Append to existing workspace’ this time or your edges and nodes won’t exist in the same workspace, which will be pretty useless. I also get the message ‘Parallel edges detected.’ It is far easier to select a merge strategy here that deletes duplicates than to check each row for duplicates during the data entry step. To do that, click ‘More options…’ and change the ‘Edges merge strategy’ field to ‘First’ or ‘Last.’ Each of those rows should be the same, so it doesn’t matter which of those two selections you make. Click ‘OK.’
All of your data should now be in the data table. Now would be a great time to save your workspace, especially if you haven’t done so yet.
Viewing the Graph
In Gephi, click on the ‘Overview’ button at the top left. You should see a graph, but it shouldn’t be pretty yet.
It’s a good idea to let Gephi get some information about your data now. On the right side-bar, click ‘Statistics.’ I like to run every one of these reports with the default parameters at first, starting at the top, except for the ones under ‘Dynamic,’ which isn’t for undirected data. You can close each window after they pop up.
Save your work! Gephi sometimes freezes even on very high-powered computers.
Now go to the Appearance tab at the top left. Under ‘Nodes’ and then ‘Partition’ select something like ‘Closeness Centrality’ and then click ‘Apply.’ If you don’t see ‘Partition’ as an option, make sure the farthest left of the four buttons to the right of ‘Nodes’ and ‘Edges’ is selected. Once you hit ‘Apply,’ this will give your graph some color. The nodes tend to start out too big in the graph. While still on the Nodes tab, click on the button with three different sizes of circles. Click ‘Unique’ and a value such as 3 to make all of the nodes the same size. You can also make some nodes larger by selecting ‘Ranking,’ choosing the same parameter you used to color the groups, and selecting a minimum and maximum size.
You can also change the color of the edges in the Appearance tab. Click ‘Edges,’ ‘Ranking,’ and then ‘Weight’ and ‘Apply.’
To display your data in a graph, click on ‘Layout’ at the bottom of the left sidebar. Try ‘ForceAtlas 2’ and then click ‘Run.’ Eventually, you’ll have to click ‘Stop’ when you like how it looks. Try some of the other layouts, too.
It’s much easier to use Gephi with a mouse than a track pad. However, there are reports that, if you never select the center button at the bottom left of the Graph window, you’ll be able to zoom using the slider at the bottom of the page after selecting the drop-down button at the bottom-right of the graph. Make sure you have ‘Global’ selected to zoom in an out for the page. You may also need to change the sizes of your edges and labels, in which case you select the respective button rather than ‘Global.’
Once you have a graph that fits on the screen and you like the way it looks, you can experiment with the number of groups to lump your matches into. This can be done with ‘Modularity’ in the right side-bar under ‘Statistics’ again. When you click ‘Modularity,’ the ‘Modularity settings’ window pops up. When you increase the value for ‘Resolution,’ the number of groups on your graph will decrease. You could try to get your matches down to four groups, which might correspond to your four grandparents if those sides of your family are equally well tested. You have to be very careful with this, though. It may work if all four of your grandparents are from very different populations, but that usually isn’t the case. When Gephi decides whether or not to include a node in a group, it could either leave it out of a major group or it could lump nodes into a large group that no longer represents the same ancestor or ancestor pair, and is therefore fairly meaningless. It’s safer to have more groups, so only make as few groups as what makes sense.
More than likely closeness centrality is the metric that would best show the key relative who connects your kit to a particular group. Here’s a list of metrics and how I think they’d perform for DNA matches:
Betweenness centrality could show people who don’t necessarily belong to only one group, so they probably aren’t of much interest here. There are few things more frustrating when analyzing DNA matches than thinking that a really strong match is an important one, only to later find that the person matches on two different branches of the tree, and thusly aren’t a good predictor of which side a segment came from.
Degree or weighted degree should show the most important person within a group. It’s a local measure, i.e. not necessarily the most important person in the whole population, but I think that a measure of local connectivity is a useful thing to have after after finding the population-wide nodes that are most important.
Harmonic closeness appears to essentially be closeness, but scores a little better if it has a high degree. That’s unnecessary, as we will be first using closeness centrality to find the important matches who connect a group to your kit, and then separately using degree to find the important matches within a group.
Other important social network analysis metrics are authority, hub, and page rank*. These are also found in the Gephi output. Hubs and authorities are used for models with directed edges, so they don’t apply here. Page rank is a measure that was originally created to create search algorithms—famously, Google.
I hope you were able to follow along with the steps I hastily described here. I’d be glad to take comments and suggestions for how I could improve the tutorial. Let me know if you have beautiful graphs of your own clusters that you’d like to display here as examples.
You might also be interested in relationship prediction tools, which are listed below:
*There doesn’t appear to be eigenvector centrality in the Gephi output. But page rank should be similar to that, where being connected to an important node makes a node more important. I don’t think that’s of interest here. For example person A with a high closeness centrality may have a close cousin with a slightly lower closeness centrality. I see no reason to give that cousin extra importance. You could use that cousin to find a mutual ancestor for the group if that’s all you’re interested in, but for the purposes of this program (as I’ve expressed I want to find the most important people in the population and within a group) I’d rather find the next most important person in that same community who’s on a very different lineage than person A. If this were a metric, it would move centrality in the opposite direction of eigenvector centrality–more like taking the closeness centrality and dividing it by the weight (cM) between a given person and the person with the highest closeness centrality, all within a given modularity class. Something else would need to be done for the third person down as ranked by closeness centrality. Perhaps the closeness centrality of that node would be divided by the average of all weights between it and those with higher closeness centrality. This seems like it could be a useful metric for social network analysis in the future.
If you had access to the most accurate relationship predictor, would you use it? Feel free to ask a question or leave a comment. And make sure to check out these ranges of shared DNA percentages or shared centiMorgans, which are the only published values that match peer-reviewed standard deviations. Or, try a tool that lets you find the amount of an ancestor’s DNA you cover when combining multiple kits. I also have some older articles that are only on Medium.
In reading your article, “Auto-Clusters in Gephi Using Data from GEDmatch”. I am confused by one of the instructions in the section on the ‘people who match 1 or 2 of both kits’ tool on GEDmatch.
The confusion is with the the last sentence of the following section;
“Click ‘enter’ in GEDmatch to get your mutual matches. Drag to select all of the results. Skip your parents, i.e. start dragging from the second row if the first row is one of your parents. Copy those values and paste them into a blank spreadsheet. Merge the formatting to that of the source document. We’re going to delete all but three of the columns. Delete columns to the right of total cM.”
In the “matches both kits” table, there is no “total cM” cM column.
GEDMatch must have changed. I assume that the “total cM” is the “shared” column under ones own kit number. Could you confirm this?
Hi Gary. I’ve gone ahead and edited that for clarification. You’re right that “Shared” means “total cM.” I’m not sure if and when that changed, but I appreciate you alerting me to it. I hope you’re able to get some great visualizations of your clusters!
Thank you, Brit, for updating your instructions. It seems that GEDMatch is constantly improving the tables in its reports, so this disconnect is likely to be something that will happen again in future.
I’ve tried a test subset of my results from GEDMatch and am getting some interesting groupings, even at that low number of data-points. I suspect that this way of visualizing matches will help me to identify key groupings on which I should allocate more of my research time.
I’m also considering how I can extract similar node and edge data from other DNA sites, though I admit that some “hide” it very well.
Using GEPHI is starting to get mentioned in some of the fairly well-known genealogy sites, such as https://familylocket.com. So; I expect you will soon see a lot more interest in your article and Dave’s video. Both are at just the right level for a beginner. I hope you’ll write more articles on using such tools in genetic genealogy.
I agree with you that some graphics in the article would be beneficial; especially when it comes to verifying that the resulting nodes and edges tables are in the correct format. That will help users to proceed, even when the GEDMatch site reports have unexpectedly changed.
Your site will be something I have bookmarked, will check and will definitely mention to colleagues.
A couple of screen shots would be very useful. I’m good with Excel but I go#t lost after “Open the ‘people who match 1 or 2 of both kits’ tool on GEDmatch. “
I, too, could use some sample screen shots as if I “Copy and paste the three remaining columns underneath the data in your edges spreadsheet” , I will have the kit names still in the columns which is not in the edges spreadsheet. Do I delete the column of names and then just move over the kit numbers and total shared cM. Also you said “In your edges sheet drag the “Undirected” values in the Type column all the way down to the end” – I don’t have any values to “drag” down.
My column titles are Source/Target/Weight/Undirected with source being the kit numbers – Target being the variety of kit numbers – the weight being the total cM – the undirected being a column with the heading undirected but nothing underneath- ie the column is empty. What am I missing? I am a bit lost now especially with the “You can now delete all of the total cM values that you used for the edges spreadsheet. In the first row, type into the first two columns “ID” and “Label.”