Matrix Science
Home Mascot Help  
  Help > MS/MS Results Interpretation   
On this page
Ions score significance thresholds
Protein scores
Protein Inference
Hierarchical Clustering
Large Search Results and the Peptide and Select Summary Reports
Related Topics
Result Report Overview
Summary Reports for MS/MS

MS/MS Results Interpretation

Other help pages describe the format and content of the various result reports. In particular, refer to Result Report Overview and Summary Reports for MS/MS. This page attempts to explain some of the underlying concepts, especially those relating to protein inference.

Ions score significance thresholds

In Mascot, the ions score for an MS/MS match is based on the calculated probability, P, that the observed match between the experimental data and the database sequence is a random event. The reported score is -10Log(P). So, during a search, if 1500 peptides fell within the mass tolerance window about the precursor mass, and the significance threshold was chosen to be 0.05, (a 1 in 20 chance of being a false positive), this would translate into a score threshold of 45.

If the quality of an MS/MS spectrum is poor, particularly if the signal to noise ratio is low, a match to the "correct" sequence might not exceed this absolute threshold. Even so, the best match could have a relatively high score, which is well separated from the distribution of 1500 random scores. In other words, the score is an outlier. This would indicate that the match is not a random event and, if tested using a method such as a target-decoy search, such matches can be shown to be reliable. For this reason, Mascot also attempts to characterise the distribution of random scores, and provide a second, lower threshold to highlight the presence of any outlier. The lower, relative threshold is reported as the homology threshold while the higher threshold is reported as the identity threshold.

The identity threshold is still useful because it is not always possible to estimate a homology threshold. If the instrument accuracy is very high or the database is very small, there may only be a small handful of candidate sequences, so that it is not possible to say whether a match is an outlier.

For a search of at least 1000 spectra, where an automatic decoy search was used, you can choose to process the Mascot scores through Percolator. This uses machine learning to re-rank the matches, so as to obtain an optimum false discovery rate. The revised probabilites are converted to scores for reporting purposes, together with a single score threshold to indicate significance.

Protein scores

The protein score in the result report from an MS/MS search is derived from the ions scores. For a search that contains a small number of queries, the protein score is the sum of the highest ions score for each distinct sequence. That is, excluding the scores of duplicate matches, which are shown in parentheses. A small correction is applied to reduce the contribution of low-scoring random matches. This correction is a function of the total number of molecular mass matches for each query. This correction is usually very small, except in no enzyme searches.

This protein score works well for small searches, and provides a logical order to the report. If multiple queries match to a single protein, but the individual ions scores are below threshold, the combined ions scores can still place the protein high in the report. However, the standard protein score is less satisfactory for searches with very large numbers of queries, such as MudPIT data sets. For each MS/MS query, Mascot retains up to 10 peptide matches. When the number of queries is comparable with the number of entries in the database, this means that there can be random, low-scoring matches for every entry. Although the average number of random matches per entry might be low, the actual number will follow a distribution, and some entries will have large numbers of low scoring matches, leading to large protein scores.

While it is obvious from a detailed study of the report that these are meaningless matches, it would be better to eliminate them entirely. So, if the ratio between the number of queries and the number of entries in the database exceeds a pre-determined threshold, the basis for calculating the protein score is changed. Only those ions scores that exceed one or both significance thresholds contribute to the score, so that low scoring, random matches have no effect. This gives a much cleaner report for a large scale search. This threshold is 0.001 by default, and can be changed on a global basis in the configuration file, mascot.dat, or changed for a single report by using the format controls at the top of the report. Note that, when calculating this threshold, if a taxonomy filter is being used, the number of entries in the database is the number remaining after the taxonomy filter.

Protein Inference

When MS/MS spectra are searched against a sequence database, we are matching peptides, not proteins. In most cases, the matched peptides will not be unique to a single protein. Yet, we usually want to know which proteins were present in the sample. So, we are faced with the challenge of protein inference: given a set of peptide matches, which proteins do we believe were present in the sample?

The usual approach is based on the "Principle of Parsimony". We report the minimum set of proteins that account for the observed peptide matches. If we had four peptide matches, two of which occurred in protein A and two in protein B but all four were found in protein C, we would report that protein C had been identified. Proteins A and B might be listed as "sub-set" proteins. It is perfectly possible that our sample actually contained a mixture of proteins A and B, but there is no evidence for this.

The Peptide Summary and Select Summary uses a very simple algorithm. First, we take the protein with the highest protein score, and call this hit number 1. We then take all other proteins that share the same set of peptide matches or a sub-set and include these in the same hit. In the report, they are listed as same-set and sub-set proteins. With these proteins removed from the list, we now take the remaining protein with the highest score and repeat the process until all the significant peptide matches are accounted for.

This sounds simple enough, and works well for small datasets, but larger search results create difficulties:

  • What if two proteins have many strong matches in common but one has an additional weak match? Should we treat one as the outright winner, and relegate the other to the status of sub-set?
  • What if we have intersections? That is, the protein is not a sub-set of any other one protein, but all the matches can be found in a set of proteins, each of which has additional matches.
  • In many cases, the exact sequence of the protein that was analysed is not in the database. All the peptide sequences are present, but spread across several homologous proteins, which might be splice variants or represent different combinations of SNPs.

The Protein Family Summary tries to address these difficulties by clustering proteins into families. The algorithm works as follows:

  1. Create a list of proteins, ordered by protein score
  2. Take the highest scoring protein
  3. Find all the family members for this protein:
    • select all matches with a score at or above the homology threshold
    • for each match, select all other the proteins that contain this match (using the score as a test to include matches that are identical matches though not identical sequences, e.g. I to L substitution or other differences that have no impact on the score)
    • for each new protein, select all new matches with a score at or above the homology threshold
    • loop until all related proteins and matches have been found
    Note that this grouping into families is based on significant matches. Non-significant matches are ignored.
  4. Report this family as a single hit. All these proteins can be removed from the list
  5. For each protein in the family, make a list of the distinct peptide sequences. That is, ignore differences in score, modifications, charge, etc. Where there are duplicate matches, use the highest score
  6. Divide and group the proteins into same-set proteins and sub-set proteins; sub-sets include intersections
    • Where there are same-set proteins, collapse into a single family member
    • Move any proteins that are sub-sets or intersections to the sub-sets list
  7. Perform hierarchical clustering on the family members, using the score excess over threshold of the non-shared matches as the distance metric
  8. Loop from step 2 until no more proteins remain that contain matches with homology score or better

The goal is to present the possible protein assignments clearly, so that someone with knowledge of the biology can make an informed decision as to which proteins are present. In most cases, there will be some ambiguity about precisely which proteins are present. As mentioned earlier, the exact sequence of an analyte may not be in the database, and peptide matches may be distributed across multiple, homologous databse entries. If it is essential to characterise the complete protein sequence, or to choose between splice variants, or to confirm a SNP, it is likely that additional, targeted experiments will be required.

Hierarchical Clustering

To cluster proteins into families, we use the score of the non-shared matches as the distance between two proteins. More precisely, we use the score excess over the significance threshold, since a score below significance threshold could be random, and should not be taken as evidence for two different proteins being present. This means that matches below threshold play no part in the clustering process. Each distinct peptide sequence is represented once by the match with the highest score. Matches to the same sequence with different charge states or with different modifications are considered duplicates.

If two proteins have the same set of peptide matches, the distance between them is zero. If they have just a single shared match, the distance between them is the sum of the score excesses of all the non-shared matches in one protein, since discarding these would make the protein a sub-set of the other, based on the single shared match.

There are some subtleties to this procedure. Consider the case of two proteins which have different peptide matches to the same query with the same score. Only one of these matches can be correct, but we don't know which. One obvious example is where the two sequences differ only in exchange of I and L. In terms of the mass spectrum, these sequences are identical. Unless the mass accuracy is high, the same is true for exchange of Q and K or F and oxidised M. Clearly, a sequence containing F at a particular position is very different, in biological terms, from one containing M at the same position. But, if the scores are the same, there is simply no evidence from the mass spectrometry data for two proteins. In terms of a distance matrix, we must treat it is as if there was no match to either peptide.

Now, consider the case where we have two proteins with different peptide matches to the same query and the scores are not the same. Assume the threshold is 40 and one has a score of 50 and the other has a score of 60. Again, only one of these matches can be correct; it is not the same as if they were independent matches to different queries. Extending the logic that matches to the same query with the same score correspond to a distance of zero, matches to the same query with different scores correspond to a distance that is the score difference. In this example, the distance would be 10. If the two matches came from different queries, and could be treated independently, the distance would be (60 - 40) + (50 - 40) = 30

To create the dendrogram, we first compute a distance matrix, which is the distance between each pair of proteins. The two proteins separated by the smallest distance are joined to create a node, with the length of the branches from the node are the score distance between the proteins. The two joined proteins are removed from list, replaced by the node, and the distances between the new node and all other remaining proteins (or nodes) computed. The process is repeated until only one node remains.

When the dendrogram (or tree) is drawn, the order is chosen to avoid any branches crossing. There is no other significance to the order of the branches, and there are many possible ways to order the branches so as to avoid crossings. In the tabular part of the report, proteins are sorted in order of decreasing score, and this will often be different from the dendrogram order.

Note that, if you select a pair of family members from a large family, it is perfectly possible that they will have no shared matches. Each family member will have shared matches with at least one other family member, or they would not have been grouped into the same family, but this doesn't mean that there are going to be shared matches between every pair.

Large Search Results and the Peptide and Select Summary Reports

The Protein Family Summary is designed for large search results. If, for some reason, you wish to view results using the Peptide Summary or Select Summary reports, this section contains some tips.

The format controls near the top of the report can help streamline the results from a large search by eliminating most of the "junk". These options can also be selected by adding URL switches to the report URL.

MudPIT Protein Scoring: By default, large searches will switch to using more aggressive protein scoring. This removes many of the junk protein hits, which have high protein scores but no high scoring peptide matches. Do not be tempted to switch back to standard scoring.

Require Bold Red: The Peptide Summary and Select Summary reports do not detect intersections. In these reports, red and bold typefaces are used to highlight the most logical assignment of peptides to proteins. The first time a peptide match to a query appears in the report, it is shown in bold face. Whenever the top ranking peptide match appears, it is shown in red. Thus, a bold red match is the highest scoring match to a particular query listed under the highest scoring protein containing that match. This means that protein hits with many peptide matches that are both bold and red are the most likely assignments. Conversely, a protein that does not contain any bold red matches is an intersection of proteins listed higher in the report.

Requiring a protein hit to include at least one bold red peptide match is a good way to filter homologous proteins from a report. You can turn this on using a checkbox in the format controls. The down-side is that you may sometimes throw out the wrong protein! For example, imagine you are searching with a taxonomy of mammals but are mainly interested in yeti proteins. If the same strong peptide matches are found in a yeti protein and also in the human homologue, and one or more junk peptide matches prevent the two proteins collapsing into a single hit, but give the human protein a slightly higher score, that is the one that will feature in the report.

Ignore Ions Score Below: You can minimise the previous problem by judicious use of the Ions score cut-off field. By setting this to (say) 20, you cut out all of the very low scoring, random peptide matches. This means that homologous proteins are more likely to collapse into a single hit, avoiding the need to choose between them.

Suppress the pop-ups: The JavaScript pop-up windows, that show the top 10 peptide matches for each query, are very useful, but they make the HTML report much larger and slower to load in a web browser. If you have a report that never seems to load, or is very slow to scroll, try using the radio buttons to suppress pop-ups.

Copyright © 2010 Matrix Science Ltd. All Rights Reserved.