MIPhy documentation

About MIPhy

Welcome to MIPhy, a tool to quantify phylogenetic instability in trees of multi-species gene families. This page will take you through the process of using MIPhy. It can also be downloaded as a stand-alone local program from GitHub.

MIPhy requires two input files: a rooted gene tree and an information file containing the species relationships and a mapping indicating which species each gene is from. These two sets of example files are available for download, to test the web or the local versions of MIPhy. The info files in particular may be useful as a guide for constructing your own files.

Small example files: ugts_small.nwk and ugts_info.txt
Large example files: vert_cyps.nwk and vert_cyps_info.txt

MIPhy is free to use and download, but if you find it useful it in your research please cite: (Curran et al., 2018).

The algorithm

The most parsimonious sequence of gene events (duplication, loss, and incongruence) is found that reconciles the given gene and species trees, and these events are mapped on to the internal nodes of the gene tree. A score for each node is calculated as the weighted sum of these gene events in that node's descendants; these descendant sequences would be considered to be in the same group under that node. The clustering algorithm then identifies the set of nodes such that each sequence in the tree is contained in exactly one group, and the sum of all group scores is at a minimum. These resulting groups are referred to as minimum instability groups (MIGs), and each sequence in a MIG receives the same instability score.

Tutorial

Submitting the files online

First, download the two small test files above (ugts_small.nwk and ugts_info.txt). Then, open the Home page. In the MIPhy inputs box, you will see two upload buttons. Find where you saved those test files, and upload ugts_small.nwk as the gene tree and ugts_info.txt as the info file.

Leave the weights at their default values, and hit the Upload files and run MIPhy button. Once processing is complete, you should see the text in the Upload summary box change to reflect the files that you uploaded. Click the View results button, which will open the results page in a new tab.

Running MIPhy locally

Download the latest release from the GitHub page, and install it as described in the README file. Once MIPhy is installed, move into the 'miphy_resources' directory of the downloaded package, and then into 'tests'. Begin MIPhy with the command: 'miphy.py ugts_small.nwk ugts_info.txt'. Alternatively, download those test files as described above, navigate to their location in your system, and run MIPhy with the same command.

Explanation of results page

You should see the gene tree displayed in the middle of the page, with the species of each gene indicated by colour in the legend, the tree nodes, and the bar that is around the perimeter of the tree. The MIGs (sequence groups predicted by MIPhy) are shown by the light orange groups on the tree, with the instability score of each indicated by the bar graph around the outside of the tree.

Notice that one MIG is much more unstable than the rest (at the 12:00 position). Hover your mouse over it, and you will see the corresponding entry in the MIGs list light up. Click on the MIG on the tree or in that list and an information window will appear. This particular MIG contains 6 genes from 2 species (4 from green and 2 from brown), which the legend tells us are Caenorhabditis brenneri and C. remanei, respectively. You can also see that reconciling this MIG with the species tree required 4 duplication and 2 gene loss events, and the Relative spread field indicates that the genes in this MIG are 68% more spread out than the median. A reasonable interpretation would be that there is something unique in the environments of C. brenneri and C. remanei compared to the other two nematode species, and the proteins in this cluster may be responding to it. In the case of this particular gene family, the UDP-glucuronosyltransferases, the response is likely to be the degredation of some environmental toxin.

Notice the Parameter weights panel on the left side of the screen, which modifies the impact of each event when calculating the instability scores. The default parameters tend to be quite robust, but if you raise the duplication weight to 2.3 and click Recluster you will see the unstable MIG split into three. Click Reset parameters to return to the originally specified values, and then click Previous several times to switch back and forth between the parameter sets. You may also notice that with the higher weight the instability bars appear to increase for every MIG. This is only because the bar heights are relative to the most unstable MIG, and there is less separation between them now. In general, increasing the duplication or decreasing the loss weights will cause unstable MIGs to split up, as will increasing incongruence or spread weights, though these tend to have less of an impact.

Vertebrate Cytochrome P450 example

For a real-world example, run MIPhy on the two large test files (vert_cyps.nwk and vert_cyps_info.txt). This is a larger data set, consisting of 628 cytochrome P450 (Cyp) proteins from 10 species of vertebrate. It was originally generated by (Thomas, 2007), and is analyzed in some depth in the MIPhy publication.

You should see that these sequences are grouped into 47 MIGs, with an average MIG score of 8.20, and an average sequence score of 26.66. That these two averages are so different indicates that the score distribution of the MIGs is quite skewed; in this case we have many small clusters with very low scores, and a small number of large clusters with very high scores. This skew is expected of this particular gene family, as the Cyps are known to be split relatively evenly between stable and unstable sequences, but your gene family may differ.

Let us first look at the human protein Cyp2C9, which is important in the metabolism of the drug ibuprofin and other NSAIDs. You will notice that MIPhy has hidden the sequence names for this larger data set, in order to better visualize the structure of the tree itself. If you wish, you can enable the labels by scrolling down to the Display options panel and increasing the 'Tree font size'; you would likely have to also increase the 'Tree width' in order to stop the labels from overlapping. For now leave the 'Tree font size' at 0, type '2c9' into the search box in the top-right corner of the tree pane, and press Enter or click the magnifying glass icon. You should see that there is only 1 hit to this query, and you can see it highlighted in green at the bottom of the tree.

You will notice that it is part of the most unstable MIG in the tree, so let us take a closer look at what's going on. Zoom in on this sequence by scrolling with your mouse or using the + button in the top-right corner of the tree, and drag the view around with your mouse so that you can see the MIG and the tree legend at the same time. It may also help to reduce the 'Tree node size' from 2 to 1 in the Display options panel. Notice that Cyp2C9 is part of three doubles of human and macaque sequences next to each other, with no other species between them. An interpretation could be that the ancestral mammalian species possessed the ancestor of these proteins. This ancestral population split into two groups: the descendants of the first include humans and macaques, and the descendants of the second include the other animals in the tree. The ancestral protein was duplicated twice in the human/macaque population at some point after that split, but before the population split again to become modern humans and macaques. One hypothesis from these observations is that these 6 proteins provide a function that is important to the primates, but not to the other mammals.

You can see a similar situation to the left of Cyp2C9 (still in the same MIG), with the grouping of 9 mouse and 2 rat sequences. These proteins may have originated from the same ancestor protein as Cyp2C9, but the short internal branches of this section of the tree make it difficult to be certain. Regardless, some protein was duplicated many times in mouse to give rise to these 9 sequences. It isn't clear whether these duplications happened in the ancestor of mouse and rat and were then lost in rat, or whether they happened only in mouse in the relatively short period of time separating these rodents.

Click on this MIG, and notice that while it contains 100 sequences, they come from only 8 species; there are none from purple or white, which the legend indicates are the two species of fish. If you then examine the MIG immediately to the left (score 6.69), you will see it contains one sequence each from the fish, and 6 from frog. It isn't clear whether this MIG should be on its own (meaning that the terrestrial vertebrates lost their version of the protein after the branching off of the amphibians) or whether it should be merged with the MIG containing Cyp2C9. It is likely that there is some combination of weights that would instead suggest this alternate history.

As a final example, find Cyp2W1 in the tree, and notice that except for 3 sequences on the right side of the MIG (1 from dog and 2 from rat), the mammalian sequences look quite stable (even though the cow sequence is displaced, which could suggest it is a pseudogene or contains annotation errors). Besides these, we have 3 sequences from chicken, 16 from frog, and 4 and 10 from the two species of fish. This expansion in these particular species may indicate these genes are involved with living in an aquatic environment, and is an example of the type of hypothesis MIPhy was designed to generate.

FAQ

My gene tree is unrooted, is that ok?

Sort of. The visualization library that MIPhy uses to draw the trees requires that your tree is rooted, so the MIPhy results page can't be loaded if it isn't. However, you can still analyze it with the local version of MIPhy if you include the "-r" flag to save the MIG data, which skips the results visualization page entirely.

How can I make MIPhy run faster?

MIPhy proceeds in two phases: the main clustering algorithm which is generally very fast, followed by an optional refinement step that can be much slower (it involves calculating a pairwise distance matrix, which takes exponential time on the size of the tree). As an example, we analyzed 5,500 sequences from 58 species on a 2.7GHz laptop. The first phase took 30 seconds, the refinement phase took 7 minutes, and opening the results page in Chrome took 90 seconds.

How many sequences can MIPhy process?

The online server has two built-in limits: it will disable the optional refinement step if you upload a tree with more than 3,000 sequences, and will not run at all for trees with more than 10,000 sequences. The local version has no limits. Note that with either version it may take several minutes to load the results page for > 5,000 sequences in a web browser, and the performance of the interactive controls may start to break down.

What does MiphyValidationError mean?

This error means there was an issue reading one of the two input files. One common reason is that some popular tree viewing software (such as FigTree) add single quotes around the names of some or all of the sequences. You can use miphy-tools 'clean' to remove these.

When closing the local version of MIPhy, what does "[Errno 32] Broken pipe" mean?

The MIPhy process communicates with a small local server it initializes to host the results page, which in turn communicates with your browser window. When you close the browser all of the processes should close in a normal sequence, but as they are independent, sometimes one exits slightly before it was expected to. It should be safe to ignore, especially if it only seems to appear sporadically.

/ˈmaɪˈfaɪ/ - Minimizing Instability in Phylogenetics