|
|
MKT Menu
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1. Introduction to
Standard & Generalized MKT
The comparison of
patterns of polymorphism and divergence is one of the most powerful
approaches to investigate selection at the DNA level. The McDonald-Kreitman
test is almost a necessary step to follow within this approach. It
compares the amount of variation within a species to the divergence
between species at two types of site, one of which is putatively neutral
and used as the reference to detect selection at the other type of site. As
the test was initially described (McDonald and Kreitman 1991), these
sites were synonymous (putatively neutral) and non-synonymous sites in a
coding region. However, the test for selection can potentially be
extended to any two types of sites, provided that one of them is assumed
to evolve neutrally and that both types of sites are linked in the
genome. Furthermore, the rationale of the test can be used to analyze
multiple loci in the genome simultaneously provided that appropriate
multi-locus statistical tests are applied.
The Standard and
Generalized MKT website is the first Web-based resource where users can
perform standard or generalized McDonald-Kreitman tests using different
interfaces:
The Standard MKT allows users to
create 2x2 contingency tables by comparing two types of coding sites,
such as synonymous and non-synonymous
changes in a coding region.
The
Advanced MKT is an extension of the standard test which
allows comparing any two linked regions in the genome, including
non-coding DNA. Note that for the analysis to make sense the
two regions must be tightly linked in the genome. The Neutral Class
selected will be
compared to any another types of sites to test for selection.
The Multi-locus MKT is an
interface where users can analyze multiple coding regions in a single
multi-locus MKT.
The Main Parameters page contains
selectable parameters that apply to any of the three distinct interfaces
for the MKT website.
 |
The MKT Menu on the left allows you to navigate through the different
pages of the Web site:
- The Main page, from which you can input your data.
- The Help page (this page).
- The Example page, which contains a pre-computed example for each type of test.
- The Contact us page, where you can find information about the authors.
The
Resources menu contains quick links to
some
related resources:
|

2. Web interfaces
Standard MKT
NAME OF THE ANALYSIS: Enter a
name to identify the current analysis.

(1)
TYPES OF SITE TO ANALYZE:
Select here the
types of sites you want to analyze. The classical McDonald-Kreitman Test
compares synonymous and non-synonymous changes in the coding region (selected by
default), but you can choose to analyze:
For a single analysis
you can only select those classes of sites that are mutually exclusive, i.e.:
-
Synonymous sites
can only be compared to non-synonymous sites or non-degenerate
sites.
-
Non-synonymous sites can
only be compared to synonymous sites or four-fold degenerate
sites.
-
Four-fold degenerate sites
can only be compared to non-synonymous sites or two-fold
and/or non-degenerate sites.
-
Two-fold degenerate sites
can only be compared to four-fold and/or non-degenerate
sites.
-
Non-degenerate sites can
only be compared to synonymous sites or four-fold and/or
two-fold degenerate sites.
(2)
SET AS NEUTRAL:
Select here which class of site will be taken as the neutral reference. In the classical McDonald-Kreitman Test, non-synonymous changes are compared
to synonymous
changes, being the latter the neutral reference. However, if you use the
Advanced MKT interface to compare one gene to its pseudogene, you
will want to choose the pseudogene sequence as the neutral reference.
You can also compare four-fold degenerate sites of a coding region to a
nearby non-coding region; in this case you will probably want to choose
four-fold degenerate sites as the neutral reference.
(3)
PASTE THE SEQUENCES:
Paste here your sequences, either unaligned or already aligned. Format must be in FASTA
(or aligned gapped-FASTA if sequences are already aligned). You need to
enter
≥
2 sequences for at least one species (from which polymorphism will be
calculated) and at least 1 sequence in the other species (for divergence
estimates). However, you can also include polymorphism data for both
species, and in this case polymorphism will be added up together. You
can also upload the sequences in two separate FASTA files.
(4)
ANNOTATIONS: When you enter the sequences in the corresponding species box, the annotation box is automatically
filled as follows: 'Sequence X --> 1..n' where X is the sequence number
(following the order in the species box) and n is the sequence length.
You can modify these annotations in order to specify which part of each
sequence you want to include in the analysis.
If you uploaded a file with the sequences, then you have to enter the
annotations manually. For each sequence enter a line as follows:
'Sequence X --> n..m',
where X is the number of the sequence and the expression n..m defines
the bases range that will be analyzed (if you want to join different parts
of the sequence you can enter 'Sequence
X --> n..m,o..p,q..r', where o..p and q..r
specify different base ranges in the sequence).

Advanced MKT
|
The advanced MKT form is similar
to the standard MKT (and the statistical procedure for the
calculations is exactly the same) but here you can analyze two separate regions
that can be either coding or non-coding. Note that for the
analysis to make sense the two regions must be tightly linked in
the genome.
There are two boxes similar to
the box in the standard MKT form in which you can enter the two different regions.
In each box you will find:
-
NAME OF THE REGION TO COMPARE:
Enter the name of the
region.
-
TYPE OF SITES TO ANALYZE:
Select the the types of sites you want
to analyze. In this case, you first have to determine if
your sequences are coding or non-coding. If they are coding, you have to select which classes of
sites you want to analyze (see above).
In this form, there is a new category that adds the
number of synonymous and non-synonymous changes altogether. Remember
that only comparisons involving two mutually exclusive types
of site are allowed and this category cannot be compared to
any other type of site from the same region, since sites
would not be mutually
exclusive.
-
SET AS NEUTRAL: Determine the class of
site you want to use as the neutral reference (see
above).
Note that only one neutral class will be allowed to be
selected at either the first OR the second region, and that
this selected neutral class will be compared to any other classes of sites
selected from the two regions.
-
PASTE THE SEQUENCES:
Paste
or upload your sequences in FASTA format (see
above).
-
ANNOTATIONS:
Write here annotations corresponding to the bases on your
sequences you want to analyse (see above).
|

Multi-locus MKT
Here you can analyze
multiple loci in a single multi-locus MKT. You can analyze only Coding
Regions in a form very similar to the Standard MKT or Coding and/or Non-Coding Regions
in a form very similar to the Advanced MKT, but in both cases sequences must be entered in a new FASTA-based format
that support multi-locus data.
This new FASTA-based format contains
two different types of heading lines:
Sequences must thus be introduced
following a certain order, e.g.:
>>Name_of_locus_1
>Sequence_1
actactactacta...
>Sequence_2
actactactacta...
>Sequence_3
actactactacta...
>>Name_of_locus_2
>Sequence_1
ggggcgcgtat...
>Sequence_2
ggggcgcgtat...
Please note that each locus name in
species 1 must match exactly a locus name in species 2 (names are
case-sensitive!), and that the order of the input loci must be
the same in both species. In the form where you can analyze also
non-coding regions, the locus name must match exactly not only for the
two species, but also for the two regions analyzed.
See other parameters
above.


Main Parameters
Finally, a set of main parameters
apply to any of the forms described above:

(1)
EXCLUDE LOW FREQUENCY VARIANTS:
you can exclude variants under a given threshold frequency (i.e. rare
polymorphisms).
(2)
CHOOSE THE GENETIC CODE: available genetic codes include (from
NCBI):
-
The Universal Code.
-
The Vertebrate Mitochondrial Code.
-
The Yeast Mitochondrial Code: for
Saccharomyces cerevisiae, Candida glabrata, Hansemula saturnus
and Kluyveromyces thermotolerans.
-
The Mold, Protozoan and Coelenterate
Mitochondrial Code and the Mycoplasma/Spiroplasma Code: for
Mollicutes (Entomoplasmatales and Mycoplasmatales),
Fungi (Emericella nidulans, Neurospora crassa, Podospora anserina,
Acremonium, Candida parapsilosis, Trichophyton rubrum, Dekkera/Brettanomyces,
Eeniella and Ascobolus immersus, Aspergillus amstelodami,
Claviceps purpurea, Cochliobolus heterostrophus), other
Eukaryotes (Gigartinales among the red algae, and the
protozoa Trypanosoma brucei, Leishmania tarentolae, Paramecium
tetraurelia, Tetrahymena pyriformis and Plasmodium
gallinaceum) and Metazoa (Coelenterata: Ctenophora and
Cnidaria).
-
The Invertebrate Mitochondrial Code: for
Nematoda (Ascaris and Caenorhabditis), Mollusca (Bivalvia and
Polyplacophora), Arthropoda/Crustacea (Artemia) and Arthropoda/Insecta
(Drosophila, Locusta migratoria and Apis mellifera).
-
The Ciliate, Dasycladacean and Hexamita
Nuclear Code: for Ciliate (Oxytricha and Stylonychia,
Paramecium, Tetrahymena, Oxytrichidae and Glaucoma chattoni),
Dasycladaceae (Acetabularia and Batophora), and Diplomonadida (Hexamita
inflata, Diplomonanida ATCC50330 and ATCC50380).
-
The Echinoderm and Flatworm Mitochondrial
Code: for Asterozoa (starfishes), Echinozoa (sea urchins) and
Rhabditophora among the Platyhelmintes.
-
The Euplotid Nuclear Code: for Ciliata (Euplotidae).
-
The Bacterial and Plan Plastid Code: for
Bacteria, Archaea, prokaryotic viruses and chloroplast proteins.
-
The Alternative Yeast Nuclear Code: for
Candida albicans, Candida cylindracea, Candida melibiosica, Candida
parapsilosis and Candida rugosa.
-
The Ascidian Mitochondrial Code: for a
phylogenetically diverse sample of tunicates (Urochordata).
-
The Alternative Flatworm Mitochondrial Code:
for Platyhelminthes (flatworms).
-
The Blepharism Nuclear Code: for
Blepharisma.
-
The Chlorophycean Mitochondrial Code: for
Chlorophyceae and Spizellomyces punctatus.
-
The Trematode Mitochondrial Code: for
Trematoda.
-
The Scenedesmus obliquus Mitochondrial Code:
for Scenedesmus obliquus.
-
The Thraustochytrium Mitochondrial Code:
for Thraustochytrium aureum.
(3)
ALIGN SEQUENCES: if this checkbox is selected,
sequences will be aligned previous to the analysis following the selected parameters.
(4)
ALIGNMENT ORDER:you can also select the alignment order:
-
Species independently, then join: Sequences from species 1
and 2 will be aligned independently, and then the two alignments
will be joined together. This option is recommended when sequences are divergent.
-
All sequences at the same time: All sequences from species 1 and 2
will be aligned together in a single step.
(5)
CHOOSE THE ALIGNMENT PROGRAM: you can choose among two alignment
programs: Muscle (Edgar
2004) or ClustalW2.0 (Larkin
et al. 2007).
(6)
SET THE PARAMETERS: set parameters for each alignment program:
-
MUSCLE
-
Output Format
The alignments can be obtained in these formats:
ClustalW, FASTA, HTML or Phylip sequential. If you want to select more than one format use
Ctrl or Ctrl+Alt.
-
Create a Log file
Select this checkbox if you want to get a log
file. This file contains information about when the program started and finished,
any error messages and warnings. It also contains the command line that has been
executed, the internal parameters and
the progress messages.
-
Output Tree
You can get the tree resulting from the first or the second iteration.
-
Penalties
-
Gap Open
-
Gap Extend
-
Center
-
Diagonals
Instead of aligning sequences
in pairs, the program looks for "diagonals" (short regions of
high similarity between two sequences).
When this option is activated, the accuracy decreases but the speed increases.
This option is recommended for large groups of related sequences. With 'diags'
-> diagonals are always activated. With 'diags1' ->
diagonals are activated for the first iteration only. The main objective of the first iteration is to rapidly construct a multiple alignment
to improve the distance matrix, but it is not very sensitive to the quality of the alignment. With 'diags2' ->
diagonals are activated for the second iteration only. The objective of the second iteration is to make the best possible progressive multiple
alignment.
-
Maximum Trees
This is the maximum number of new trees that are created in the second iteration.
If the value is >1 (the default value), the process will be repeated until it arrives to a convergent result or to
the number specified.
-
Maximum Iterations
As this value decreases, the accuracy also decreases but the speed increases.
It can range from 1 to 16. When 1-3 is chosen, the program performs the number iterations selected.
For values
≥4, the program continues iterating until it arrives to a convergent result or
to the number specified. For huge alignments, 2 is recommended.
CLUSTALW2.0
-
FAST PAIRWISE ALIGNMENT
Control the speed-sensitivity of the initial alignments.
-
Ktup
The size of the exactly matching fragment that is used. Can
range from 1 to 4 for DNA. Increase this to increase speed;
decrease to improve sensitivity.
-
Window Length
The number of diagonals around each "top" diagonal that are considered.
Decrease for speed; increase for greater sensitivity.
-
Score Type
The similarity scores may be expressed as raw scores (number of identical residues
minus a "gap penalty" for each gap) or as percentage scores. If sequences are of very different lengths, percentage
scores make more sense.
-
Top Diagonals
The number of best diagonals in the imaginary dot-matrix plot that are considered.
Decrease to increase speed; increase to improve sensitivity.
-
Pairgap
The number of matching residues that must be found in order to introduce a gap.
This should be larger than K-Tuple Size. This has little effect on speed or sensitivity.
-
MULTIPLE ALIGNMENT
Control the gaps in the final multiple alignments.
-
Gap Open
Reduce this to encourage gaps of all sizes; increase it to discourage them.
Terminal gaps are penalized the same as all others except for END GAPS not being selected. BEWARE of making this too small (approx 5 or so);
if the penalty is too small, the program may prefer to align each sequence opposite one long gap.
-
No End Gaps
Here you can select if you want the terminal gaps to be penalized or not.
-
Gap Extension
Reduce this to encourage longer gaps; increase it to shorten them.
Terminal gaps are penalized the same as all others. BEWARE of making this too small (approx 5 or so); if the penalty is too small,
the program may prefer to align each sequence opposite one long gap.
-
Gap Distances
Penalization for the distance between gaps. Gaps that are less than this distance
apart are penalized more than other gaps. This does not prevent close gaps; it makes them less frequent, promoting a block-like
appearance of the alignment.
-
Transition Weight
Gives transitions a weight between 0 and 1. A weight of 0 means that
transitions are
scored as mismatches, a weight of 1 gives the transitions the match score. For distantly related DNA sequences, the weight should be near to 0, for closely related sequences it can be useful to assign a
higher score.
-
Delay Divergent Sequences
Switch delays the alignment of the most distantly related sequences until the after
the most closely related sequences have been aligned. The
setting shows the percent identity level required to delay the addition
of a sequence; sequences that are less identical than this level to any other sequences will be aligned later.
-
Iteration
With TREE you can iterate at each step of the progressive alignment. With
ALIGNMENT you can iterate just on the final alignment.
-
Number Iterations
The default number of iterations is 3. If you increase this value, the program will
iterate until the score converges o until the maximum number of iterations is reached.
-
OUTPUT FORMAT
The alignments can be obtained in one of these formats:
Clustal with numbers, Clustal without numbers, GCG,
GDE, Phylip, PIR or Nexus.

3. Analysis of the Sequences
Coding sequences are analyzed codon by
codon. If one codon has a gap in any of the sequences, this codon
is totally excluded from the analysis. If one codon is a stop codon, counts are performed as if it was another amino acid, but a warning
is shown in the output page.
Non-coding sequences are analyzed position by position. If one position has a gap in any of the sequences, this position
is totally excluded from the analysis.
In all cases, divergent counts are
corrected by Jukes&Cantor (Jukes and Cantor 1969) and results
are shown with and without this correction.
Treatment of sites or codons with multiple changes
When performing the test, counts in the contingency
table can be either:
Sites: positions in the alignment which are
either polymorphic within a species or divergent among species. Each
site can be counted only once. E.g., if a site
has more than two variants in one species, it will be counted as 1
polymorphic site. For the same reason, when the same site is polymorphic and divergent
at the same time, since it cannot be categorized into a single class,
this site is not taken into account.
Changes: the estimated number of mutations
that have occurred in the position since the ancestor of both analyzed
species. When changes are analyzed, one site (position) might eventually
involve several changes (mutations) if it has more than two variants,
and thus that site will be counted more than once in the contingency
table. In this case, there is no statistical problem when the same site is divergent and polymorphic
at the same time.
EXAMPLE:
|
Species 1
|
ATG TTC CTA GTT
|
ATG TTA CTA GTT
|
ATG TTT CTT GTT
|
Species 2
|
ATG TTC CTG GTT
|
In the example, the third position of the second codon has 3 variants;
this is counted as 1 SITE but as 2 CHANGES. The third position of the
third codon is polymorphic and divergent at the same time; this position
is thus not taken into account for SITES, but it is counted as 1
POLYMORPHIC CHANGE + 1 DIVERGENT CHANGE.
Analyses including synonymous or non-synonymous changes,
only changes are computed. For any other type of analyses, two
different tests are performed: one for changes and
one for sites.

Estimation of the number of synonymous and non-synonymous changes
The numbers of synonymous and non-synonymous polymorphic changes
are estimated for species 1 and 2 independently, and then both counts
are added up together. Then the numbers of synonymous and non-synonymous divergent changes
are estimated.
We use a
maximum parsimony criterion for these estimates. For each codon in the
alignment, we get all the different codons represented in a species and compute the shortest path
that connects all the codons; among these, we keep the
path that involves the least number of replacements (sometimes different
paths involve the same number of replacements and they are equally
parsimonious; the corresponding number of replacements are added up to
the contingency table).
Some special cases apply. First, when one of the
codons at the same position from the other species equals an
intermediate codon in a path, this path is chosen as the most
parsimonious one. Second, stop codons within the alignment are treated
as another amino acid, but when a stop codon appears in the alignment
this is warned in the results; they are only excluded if positioned at
the last codon in the alignment. Furthermore, any paths involving
intermediate stop codons are excluded except if both end codons are stop
codons.
For example:
ATC ATT
|
The most conservative path is: ATC (Ile)
→
ATT (Ile) 1 synonymous change.
|
AGT AGC AGA AGG
|
One of the most conservative paths (tied with
others) is: AGT (Leu)
→ AGC (Leu)
→
AGA (Arg)
→
AGG (Arg) 2 synonymous changes and 1 non-synonymous change.
|
CCC CAG
|
One of the most conservative paths
(tied with others) is: CCC
(Pro)
→ CCG (Pro)
→
CAG (Gln) 1 synonymous change and 1 non-synonymous change.
But if the second species contains an
intermediate codon of another path (CAC), we
will assume that the most parsimonious path in this case is the one
that includes that codon: CCC (Pro)
→
CAC (His)
→
CAG (Gln) 2 non-synonymous changes |
AATT AGG ACT
|
One of the most
conservative paths (tied with others) is: AAT (Asn)
→
ACT (Thr)
→ ACG (Thr)
→
AGG (Arg) 1 synonymous change and 2 non-synonymous changes.
|
The approach is the same for
divergence. We compute all the possible paths from each codon of species
1 to each codon of species 2 and choose the path with the least number
of replacements.

Estimation of the number of degenerate sites/changes
To estimate the number of
degenerate sites/changes we get the first sequence as a reference.
For each codon we determine the degree of degeneracy as represented in the
table below and compute the number of sites/changes for all the sequences:

In this representation of the standard genetic code,
N stands for any nucleotide (T, C, A or
G), Y for any pyrimidine (T or C),
and R for any purine (A or G). The
H in the set of codons for isoleucine (Ile) stands for “not-G” (T, C or A).
Degeneracies are as follows: N represents a four-fold degenerate site,
Y
and R represent two-fold degenerate sites. The
H is considered as a two-fold
degenerate site, and also the first nucleotides in four leucine codons (TTA,
TTG, CTA, and CTG) and four arginine codons (CGA, CGG, AGA, and AGG).
All other nucleotides are non-degenerate.
For example:
Sequence 1:
ATG TTA TCA CAA
Degree of degeneracy: 000 202 004 002

Estimation of the number of sites/changes in non-coding regions
We count the number of polymorphic and divergent sites/changes
for each position in the alignment.

4. MKT Output
Standard and Advanced MKT:
|
The output of the analysis
include:
A table with a summary of the comparisons performed
-
Information about the main input parameters:
-
The Genetic Code used
-
If low-frequency variants are excluded, the
threshold value is indicated
Basic information about the input sequences
at each region
(in the Advanced MKT this information is repeated for each
region):
- The number of sequences for each species
- The length of the alignment
- The percentage of gaps within the alignment. Note that end gaps
are not taken into account. There is a warning when the percentage is
>30%
- A JalView button to visualize the aligned sequences
- The aligned input sequences in any selected formats.
When species are aligned independently, both alignments
are also shown
A 2x2 contingency table for each comparison
performed:

When a comparison includes two classes of sites
from which the program has
analyzed both changes and sites, the results show the two
analyses: one for
changes and another for sites.
From this table, the following estimates are computed:
-
Neutrality Index: Indicates the extent to which the levels of amino acid polymorphism depart from the
expected in the neutral model (Rand and Kann 1996).
|
- Under neutrality, Dn/Ds equals Pn/Ps and thus NI = 1
- If NI < 1, there is an excess of fixation of
non-neutral replacements due to positive selection
(Dn is higher than expected)
- If NI > 1, negative selection is preventing the fixation of harmful mutations (Dn is lower than expected)
|
-
α:
Proportion of adaptive substitutions (Smith and Eyre-Walker 2002)that
ranges from -∞
to 1
and is estimated as 1-NI.
-
χ2
-
p-value
Both the contingency table and the estimates
are computed with the divergence corrected by Jukes&Cantor and
without any correction for divergence. The default results shown
are corrected by Jukes&Cantor, but the results without
correction can be viewed by selecting the button 'Without any
correction for divergence' in the output page.
|

Multi-locus MKT:
The output of the analysis
include:
-
A table with a summary of the comparisons performed
-
Information about the main input parameters:
-
The Genetic Code used
-
If low-frequency variants are excluded, the
threshold value is indicated
Basic information on the input sequences for each
locus:
- The number of sequences for each species
- The length of the alignment
- The percentage of gaps within the alignment.
Note that end gaps
are not taken into account. There is a warning when the percentage is
>30%
- A JalView button to see the aligned sequences
- The aligned input sequences in any selected formats.
When the species are aligned independently, both alignments
are also shown
A 2x2 contingency table for each comparison
performed and locus:

When a comparison includes two classes of sites
from which the program has
analyzed both changes and sites, the results show the two
analyses: one for
changes and another for sites.
From this table, the following estimates
are computed:
-
The Mantel-Haenszel Test of Homogeneity indicates
whether there is homogeneity among the loci. When the p-value is
significant loci are heterogeneous and
then the combination of these loci in a single 2x2 contingency table
is not appropriate.
-
The Mantel-Haenszel Estimator is
equivalent to the Neutrality Index
(Rand and Kann 1996)
shown for one-locus tests and indicates the extent to
which the levels of amino acid polymorphism depart from the expected
in the neutral model.
-
:
the mean proportion of adaptive substitutions (Smith and Eyre-Walker 2002),
ranging from -∞
to 1
and being estimated as:

Note that if only one loci is input to this interface, output estimates
will be the same as in the Standard or the Advanced MKT,
and a warning will be displayed.
Both the contingency tables and the estimates
are computed with the divergence corrected by Jukes&Cantor and
without any correction for divergence. The default results shown
are corrected by Jukes&Cantor, but the results without
correction can be viewed by selecting the button 'Without any
correction for divergence' in the output page.
|
|

|