Incorporating Hidden Markov Model for Identifying Protein
Kinase-specific Phosphorylation Sites
Protein phosphorylation, which is an important mechanism in post-translational modification, affects essential cellular processes such as metabolism, cell signaling, differentiation and membrane transportation. Proteins are phosphorylated by a variety of protein kinases. In this investigation, we develop a novel tool to computationally predict catalytic kinase-specific phosphorylation sites. The known phosphorylation sites from public domain data sources are categorized by their annotated protein kinases. Based on the concepts of profile Hidden Markov Model (HMM), computational models are learned from the kinase-specific groups of the phosphorylation sites. After evaluating the learned models, we select the model with highest accuracy in each kinase-specific group and provide a web-based prediction tool for identifying protein phosphorylation sites. The main contribution here is that we develop a kinase-specific phosphorylation site prediction tool with both high sensitivity and specificity.
Protein phosphorylation, performed by a group of enzymes known as kinases and phosphotransferases (Enzyme Commission classification 2.7), is a post-translational modification essential to correct functioning within the cell 1 . The post-translational modification of proteins by phosphorylation is the most abundant type of cellular regulation. It affects a multitude of cellular signal pathways, including metabolism, growth, differentiation and membrane transport 2 . The enzymes must be sufficiently specific and act only on a defined subset of cellular targets to ensure signal fidelity. Proteins can be phosphorylated at serine, threonine and tyrosine residues.
Because of its importance in cellular control, it is desirable to have a computational tool for quickly and efficiently identifying phosphorylation sites in protein sequences, as well as the catalytic kinases involved in the phosphorylation. This will increase the efficient characterization of new protein sequences 1 . Therefore, in this investigation, we designed and implemented a prediction tool that can facilitate the identification of the phosphorylation sites and the related catalytic kinases.
PhosphoBase 3 is a database of experimentally verified phosphorylation sites. The entries supply the annotations about the phosphoprotein and the exact position of its phosphorylation sites. Furthermore, part of the entries contain information about kinetic data obtained from enzyme analyzes on specific peptides. The Swiss-Prot 4 is a comprehensively annotated protein database. Both experimentally validated and putative phosphorylation annotations can be obtained from the post-translation modification annotation in the database.
NetPhos 2 presents an artificial neural network method that predicts the phosphorylation sites in independent protein sequences with a sensitivity in the range from 69% to 96%. DIPHOS 5 is a web-based tool for the prediction of protein phosphorylation sites. In this study, the position-specific amino acid frequencies and disorder information are used to improve the discrimination between phosphorylation and non-phosphorylation sites. Berry et al 1 employ back-propagation neural networks (BPNNs), the decision tree algorithm C4.5 and the reduced bio-basis function neural networks (rBPNN) to predict phosphorylation sites. NetPhosK 6 is an artificial neural network algorithm to predict protein kinase A (PKA) phosphorylation sites with 100% sensitive and 40% specific in their experiments.
Most of the previous studies on the phosphorylation site prediction have concentrated on only the substrate specificity. In this investigation, the catalytic kinases of the protein phosphorylation are taken into account. The known phosphorylation sites from data sources in public domain are categorized by their annotated protein kinases. In order to increase the sensitivity of the models, the sequences in larger groups of the phosphorylated sites can be further clustered and split into subgroups by Maximal Dependence Decomposition (MDD) 7 . Based on the concepts of profile Hidden Markov Model (HMM), computational models are learned from the kinase-specific groups of the phosphorylation sites. After evaluating the learned models by the k-fold cross-validation or leave-one-out cross-validation, we select the best performed model in each kinase-specific group and provide a web-based prediction tool to facilitate the identification of protein kinase-specific phosphorylation sites.
The PhosphoBase 3 consists of 1,083 experimentally verified phosphorylation sites within 436 protein entries. As given in Table 1, the number of serine sites, threonine sites and tyrosine sites are 713, 164 and 206, respectively. The Swiss-Prot 4 (release 45 of October 2004) maintains 163,500 protein entries of which 3,614 entries are annotated as phosphorylated. The entries contain residues annotated as ‘phosphorylation' in the ‘MOD_RES' fields are extracted. The number of serine sites, threonine sites and tyrosine sites are 1005, 281 and 321, respectively. Those sites annotated as “by similarity”, “potential” or “probable” are considered as test set. Generally, the serine, threonine and tyrosine, which are not annotated as phosphorylation residues, within the experimentally validated phosphorylated proteins are selected as negative sets, i.e., the non-phosphorylated sites. Therefore, two negative (non-phosphorylated) data sets are extracted from PhosphoBase and Swiss-Prot based on the phosphorylation annotation.
* The entries are annotated as “by similarity”, “potential” or “probable”. The data set will be considered as test set in the Discussion section.
Table 2. The statistics of the catalytic kinase-specific phosphorylation sites.
The statistics of kinase-specific phosphorylated sites in the data sources are given in Table 2. The present study confirms that the existence of two major types of protein kinases phosphorylating either at serine/threonine residues or at tyrosine residues. The collected data sets show that the majority of serine/threonine specific protein kinases have a preference for serine residues 8 . The number of serine phosphorylation sites is, in most case, 3-10 times more numerous than threonine sites in both PhosphoBase and Swiss-Prot. Comparing this to the natural situation for the different proteins annotated in the Swiss-Prot database, the ratio is 1.3:1 for serine and threonine; we and others are unable to explain this 9 .
Figure 1. The flow of the proposed method.
The flow of the proposed method is shown in Fig. 1. We firstly extracted the phosphorylated sites as positive sets, non-phosphorylated sites as negative sets and the catalytic kinase annotations from PhosphoBase and Swiss-Prot. The positive sets are then categorized by catalytic kinases. Alternatively, in larger positive groups, the sequences of the phosphorylated sites can be clustered into subgroups by Maximal Dependence Decomposition (MDD) 7 . Thereupon, we incorporate the concept of profile Hidden Markov Model (HMM) to learn computational models from positive sets of the phosphorylation sites. In order to evaluate the learned models, k-fold cross-validation and leave-one-out cross-validation are carried out. After evaluating the models, the model with highest accuracy in each data set is selected. Each step in the proposed method will be introduced below.
Constructing Positive Sets and Negative Sets
We construct the phosphorylated sites as the positive sets and the non-phosphorylated sites as the negative sets. As given in Table 3, the positive sets, PB_Pos and SP_Pos, are constructed from the phosphorylation sites extracted from PhosphoBase and Swiss-Prot, respectively. PB_Pos and SP_Pos are merged into a non-redundant positive set, namely Com_Pos. Similarly, the negative sets, PB_Neg and SP_Neg, are the non-phosphorylation sites extracted from PhosphoBase and Swiss-Prot, respectively. PB_Neg and SP_Neg are merged into a non-redundant data set, namely Com_Neg. Specially, a few phosphorylated residues and non-phosphorylated residues located at the ends of the protein sequences lead to incompleteness of the 9-mer phosphorylated sites or non-phosphorylated sites. These sites are eliminated from the constructed data sets.
Table 3. Data sets constructed in this study.
We define the position 0 as the phosphorylated residue and the positions (-4 ~ -1) and (+1 ~ +4) designated the residues surrounding the phosphorylation residue, such as serine, threonine and tyrosine. For the sake of the observation of the sequence distribution surrounding the phosphorylated residues, we make up the 9-mer sequence logos 10,11 of the phosphorylation sites and the non-phosphorylation sites. The sequence logos 10,11 are a graphical representation of an amino acid or nucleotide multiple sequence alignment. Each logo consists of stacks of symbols, one stack presents each position in the sequence. The overall height of the stack indicates the sequence conservation at that position, while the height of the symbols within the stack indicates the relative frequency of each amino or nucleotide at that position.
As shown in Fig. 2, we present the sequence logos built by the three types of the 9-mer phosphorylated sites and the 9-mer non-phosphorylated sites. By observing the sequence logos, the sequences surrounding the phosphorylated sites are more conserved than the ones surrounding the non-phosphorylated sites. Especially, the sequence logos of serine, threonine and tyrosine sites categorized by protein kinases are made and given in Fig. 3, Fig. 4 and Fig.5, respectively. Previous studies have confirmed that the most of the serine and threonine protein kinases can be divided into three classes, namely the basophilic kinases (i.e., PKA, PKC, PKG, and CaM-II), the acidophilic kinases (i.e., CKI and CKII) and the proline-directed protein kinases (i.e., cdc2 and MAPK) 8 . For the tyrosine kinases, most of the tyrosine kinases are acidophilic kinases.
Figure 2. The logos of the phosphorylated sites and the non-phosphorylated sites.
Figure 3. The logos of the phosphorylated serine sites of different catalytic kinases.
Figure 4. The logos of the phosphorylated threonine sites of different catalytic kinases.
Figure 5. The logos of the phosphorylated tyrosine sites of different catalytic kinases.
Phosphorylation Sites are Clustered by Maximal Dependence Decomposition (MDD)
The site sequences in the positive sets with a larger size can be alternatively clustered by MDD method in order to increase the predictive sensitivity and specificity of the models. The Maximal Dependence Decomposition (MDD) 7 is a methodology to group an set of aligned signal sequences to moderate a large group into subgroups that capture the most significant dependencies between positions. Furthermore, MDD is firstly proposed to group the splice sites during the identification process of splice site prediction 7 . However, in our study, we group protein sequences instead of nucleotides. In order to reduce the data complexity of the phosphorylated sites when doing MDD, we categorize the twenty types of amino acids into five groups such as neutral, acidic, basic, aromatic and imino groups, as the mapping given in Table S1 (see Supplementary Materials). Then, we implement the MDD algorithm for amino acids groups and apply it to the large sets of the phosphorylated sites for sequence clustering.
The MDD is a recursively process to divide the positive sets into tree-like subgroups. When applying MDD to cluster the sequences of a positive set, a parameter, i.e., the minimum-cluster-size, should be set. If the size of a subgroup is less the minimum-cluster-size, the subgroup will not be divided any more. The MDD process terminates until all the subgroup sizes are less the minimum-cluster-size.
Profile Hidden Markov Models (HMMs) are learned from the site sequences in the positive sets. An HMM describes a probability distribution over a potentially infinite numbers of sequences
. It can be used to detect distant relationships between amino acids sequences. Here, we use the software package HMMER
Evaluating the Learned Models
After the models are learned, it is necessary to evaluate whether the models are fitted or not. The following measures of the predictive performance of the models are then calculated: Precision (Prec) = TP / (TP+FP), Sensitivity (Sn) = TP / (TP+FN), Specificity (Sp) = TN / (TN+FP) and Accuracy (Acc) = (Sn + Sp) / 2, where TP, TN, FP and FN are true positive, true negative, false positive and false negative predictions, respectively. Especially, we make the equal sizes of the positive samples and the negative samples during the cross-validation processes.
To evaluate the learned models, two cross-validation methods, such as k-fold cross-validation and leave-one-out cross-validation, are applied in this study. For a large positive set, i.e., the number of a positive set of the phosphorylated sites is greater or equal to thirty sites, the 5-fold cross-validation is used to evaluate the model learned from the data set. The size of the negative set, which is constructed by randomly selected from the corresponding non-phosphorylation sites, is equal to the size of positive set. The experiments are repeated for 20 times and the average precision, sensitivity, specificity and accuracy are calculated. Furthermore, in order to avoid the skew sampling during the cross-validation process, for a small positive set (less than 30), the leave-one-out cross-validation is alternatively applied. Similarly, the negative set in this cross-validation is constructed by the same strategy as the 5-fold cross-validation.
Selecting the Learned Models
For each pair of the positive set and the negative set, we compare the performance of the models learned by HMMER from different data sources and with different grouping strategies, such as the raw phosphorylation sites, the kinase-specific phosphorylation sites and MDD-clustered phosphorylation sites. For each positive set, the model with highest accuracy is selected.
Prediction by the Selected Models
For each kinase-specific positive set of the phosphorylated sites, the best performed model is selected and used to identify the phosphorylation sites within the input protein sequences by HMMsearch 12 . To search the hits of a model, HMMER returns both a HMMER score and an expectation value (E-value). The score is the base two logarithm of the ratio between the probability that the query sequence is a significant match and the probability that it is generated by a random model. The E-value represents the expected number of sequences with a score greater than or equal to the returned HMMER bit scores. While decreasing the E-value threshold favors finding true positives, increasing the E-value threshold favors finding true negatives. We select the HMMER score as the criteria to define a HMM match. A search with the HMMER score greater than 0 is taken as a match (positive prediction), i.e., a HMM recognizes a phosphorylation site.
Especially, in a MDD-cluster data set, for instance, MDD-clustered PKA catalytic serine (S_PKA), models are learned separately from the subgroups resulted by MDD. A prediction made by the group of models is positive if at least one of the models makes positive prediction, whereas a negative prediction of the group of models is defined as all models make negative predictions.
The Web Interfaces
The users can submit their uncharacterizing protein sequences to the query interface and make a choice for the appropriate models to predict for kinase-non-specific or kinase-specific serine, threonine or tyrosine, as shown in Fig. S1 (see Supplementary Materials). As shown in Fig. S2 (see Supplementary Materials), our system provides the positions of the candidate phosphorylation sites, as well as the catalytic protein kinases involved in. The web service is freely available at http://KinasePhos.mbc.NCTU.edu.tw/.
Table 4. The model evaluation of the raw data sets separately from PhosphoBase and Swiss-Prot.
The models perform well by incorporating kinase annotations. The three phosphorylated residue types in PB_Pos and PB_Neg are separately used to learn the models and to be evaluated by cross-validations. Similarly, the same experiment is applied to SP_Pos and SP_Neg. The experiment results are given in Table 4. For instance, the precision, sensitivity, specificity and accuracy of the models learned from phosphorylation serine from PhosphoBase are 0.70, 0.54, 0.77 and 0.65, respectively. As to the consideration of kinase annotations, several experiments are also done in the kinase-specific groups in PB_Pos/PB_Neg and SP_Pos/SP_Neg data set. As given in Table 5, the Prec, Sn, Sp and Acc in the model of S_PKA (kinase PKA catalytic serine) data set are 0.83, 0.81, 0.84 and 0.83, respectively. The average Prec, Sn, Sp and Acc in serine of PB_Pos/PB_Neg are 0.87, 0.74, 0.89 and 0.81, respectively. By comparing the results of the two experiments above, we find that the models learned from the kinase-specific groups of the phosphorylation sites perform well than the raw phosphorylated data sets. Moreover, the results of the experiments in Com_Pos/Com_Neg data sets are partially provided in Table S2 (see Supplementary Materials).
Table 5. The model evaluation of the kinase-specific data sets.
The models perform well in MDD-clustered data sets. Especially in larger data set (greater then 100 site), we apply the MDD to group the sequences of the phosphorylated sites into several subgroups, which are separately taken as training sets, and the model of each subgroup are generated. Partial experiment results are given in Table S3, Table S4 and Table S5. (see Supplementary Materials). Fig. 6 and Fig. 7 show the model comparisons between the original data sets and the MDD-cluster data sets. For the raw phosphorylation sets in PB_Pos, SP_Pos and Com_Pos, all the models learned from MDD-clustered data sets have higher sensitivity than the ones learned from the data sets not applied MDD, but the models lose a little specificity, as shown in Fig. 6. As for the kinase-specific groups of the phosphorylation serine sites, applying MDD in S_PKA, S_PKC and S_CKII groups can increase the sensitivity of the learned models.
Figure 6. Model comparisons between the original data sets and the MDD-cluster data sets.
Figure 7. Serine model comparisons between the original data sets and the MDD-cluster data sets.
By comparing the results given in Fig. 6 and Fig. 7, the accuracy of the models learned from the combined data sets in the raw sequences set experiments and the kinase-specific group experiments are better than the models learned from only PhosphoBase or Swiss-Prot data sets. Except for S_PKA and S_PKC in PB_Pos and SP_Pos, the sizes of the kinase-specific groups are relatively small. For small data sets like S_PKG, S_CaM-II and S_CKII, it is quietly necessary to merge the kinase-specific groups of phosphorylation sites from two different data sources into one group to increase the training set size. All the combinations of the data sets are constructed for learning computational models, which are then evaluated. Alternatively, the MDD can be applied to the data sets which are large enough. For each kinase-specific group, the model with the highest accuracy is selected, as given in Table 6. For instance, the Prec, Sn, Sp and Acc of the model learned from MDD-clustered S_PKA data set, which are constructed by the combined PhosphoBase and Swiss-Prot data sources, are 0.88, 0.90, 0.87 and 0.89, respectively. The average Prec, Sn, Sp and Acc of all the kinase-specific serine models are 0.88, 0.84, 0.88 and 0.86, respectively.
Table 6. The selected models with the highest accuracy.
* means the data set is clustered by MDD
Assessment of Kinase Specificity
In order to assess the specificity of the kinase-specific models, especially kinase-specific serine models, we take a particular group as the positive set and the other groups as the negative sets one by one. The higher specificity the cross-validation, the less incorrect prediction of the phosphorylation sites in other groups. As given in Table 7, the number in the parenthesis besides the kinase name indicates the size of the positive set. For example, the first row gives that there are 232 phosphorylated sites in kinase PKA catalytic serine set. The sensitivity (Sn) of the PKA model is 0.89. The specificity are given in the table, for instance, in the first column the specificity (Sp) of PKC, PKG and CaM-II sets corresponding to the PKA model are 0.51, 0.07 and 0.35, respectively. The Sp values marked in red color indicate they are relatively lower between the kinases PKA, PKC, PKG and CaM-II in basophilic group. Similarly, the Sp values in green color indicate they are relatively lower between the kinases CKI and CKII in acidophilic group. The Sp values in blue color indicate they are relatively lower between the kinases CDC2 and MAPK in praline-directed group. We observe that the specificity corresponding to the kinase-specific data sets in the same kinase group, such as basophilic, acidophilic and praline-directed groups, are relatively lower than the specificity corresponding to the kinase-specific data sets in other groups.
Table 7. The specificity of the kinase-specific serine models.
Discussion and Conclusion
The proposed method is compared to several previously developed phosphorylation prediction tools such as NetPhos 2 , DISPHOS 5 and rBPNN 1 . All the previous tools did not consider the catalytic kinase annotations. Especially, in our investigation we construct the kinase-specific models for phosphorylation sites. We only compare our average accuracy from the best model selected in each kinase-specific model. As given in Table 7, the average accuracy of the models learned from serine, threonine and tyrosine sets are 0.86, 0.86 and 0.80, respectively. The average accuracy for KinasePhos is 0.85. All the accuracies of the serine, threonine and tyrosine models are higer than NetPhos and DISPHOS. When comparing to rBPNN, the average accuracy of KinasePhos is slightly lower.
Table 9. The prediction results of putative phosphorylation sites annotated as “by similarity”, “potential” and “probable” in Swiss-Prot.
* means the data set is clustered by MDD
Furthermore, we would like to experiment our learned models for those putative phosphorylated sites in Swiss-Prot 4 . As given in Table 1, the total number of putative phosphorylated sites annotated as “by similarity”, “potential” or “probable” are 6,343. Especially, we construct the kinase-specific putative phosphorylated sites as kinase-specific test sets, which are then predicted by the learned kinase-specific models. For instance, in the experiment of S_PKA models and putative PKA catalytic serine sets, the positive prediction rate is 0.67. However, the accuracy of the model learned from experimentally validated phosphorylated sites in Com_Pos is 0.89. Almost all the results of the experiments suggest that the kinase annotation of the putative phosphorylation sites in Swiss-Prot might not be as highly accurate as the experimental validated kinase-specific phosphorylated sites in Swiss-Prot and PhosphoBase.
Based on the concept of profile Hidden Markov Models, a predictive tool for protein phosphorylation sites is designed and implemented to facilitate the identification of the phosphorylation sites and the catalytic kinases involved in. After evaluating the learned models, we select the best model in each kinase-specific group and provide a web-based prediction tool for accurately identifying protein phosphorylation sites. Rather than only considering the three phosphorylated residues, the main contribution here is that we successfully develop a kinase-specific phosphorylation site prediction tool.
The prospective works to improve the accuracy of the predictive models are addressed as follows. Firstly, the species-specific phosphorylation sites can be taken into consideration to assess the mechanisms of protein phosphorylation in different organisms. The accuracy of the models learned from a species-specific data set might be improved. Secondly, the structural properties, such as solvent accessibility, of the phosphorylated sites can be considered to possibly reduce those false positive predictions of phosphorylated sites located in buried regions that are potentially to be non-phosphorylated sites. For proteins with known structures, the solvent accessibility of a phosphorylated site can be calculated trivially. However, as to the proteins without structures, the solvent accessibility of a residue should be computationally determined.
1. Berry, E. A.; Dalby, A. R.; Yang, Z. R. Comput Biol Chem 2004, 28(1), 75-85.
2. Blom, N.; Gammeltoft, S.; Brunak, S. J Mol Biol 1999, 294(5), 1351-1362.
3. Blom, N.; Kreegipuu, A.; Brunak, S. Nucleic Acids Res 1998, 26(1), 382-386.
4. Boeckmann, B.; Bairoch, A.; Apweiler, R.; Blatter, M. C.; Estreicher, A.; Gasteiger, E.; Martin, M. J.; Michoud, K.; O'Donovan, C.; Phan, I.; Pilbout, S.; Schneider, M. Nucleic Acids Res 2003, 31(1), 365-370.
5. Iakoucheva, L. M.; Radivojac, P.; Brown, C. J.; O'Connor, T. R.; Sikes, J. G.; Obradovic, Z.; Dunker, A. K. Nucleic Acids Res 2004, 32(3), 1037-1049.
6. Hjerrild, M.; Stensballe, A.; Rasmussen, T. E.; Kofoed, C. B.; Blom, N.; Sicheritz-Ponten, T.; Larsen, M. R.; Brunak, S.; Jensen, O. N.; Gammeltoft, S. J Proteome Res 2004, 3(3), 426-433.
7. Burge, C.; Karlin, S. J Mol Biol 1997, 268(1), 78-94.
8. Kreegipuu, A.; Blom, N.; Brunak, S.; Jarv, J. FEBS Lett 1998, 430(1-2), 45-50.
9. Bairoch, A.; Apweiler, R. Nucleic Acids Res 1998, 26(1), 38-42.
10. Crooks, G. E.; Hon, G.; Chandonia, J. M.; Brenner, S. E. Genome Res 2004, 14(6), 1188-1190.
11. Schneider, T. D.; Stephens, R. M. Nucleic Acids Res 1990, 18(20), 6097-6100.
12. Eddy, S. R. Bioinformatics 1998, 14(9), 755-763.
Table S1. The amino acids group used in MDD.
Table S2. The model evaluation of the kinase-specific data sets in Com_Pos/Com_Neg sets.
Table S3. The accuracy of the models learned by MDD-clustered data sets.
Table S4. The accuracy of the models learned by MDD-clustered data sets with kinase annotations.
* indicates the data set are clustered by MDD
Table S5. The accuracy of the models learned by MDD-clustered data sets with kinase annotations in Com_Pos data set.
* indicates the data set is clustered by MDD
BID Lab, Department of Life Science, National Central University , Taiwan.