Home     About KinasePhos     Comparision     Statistics     Publications     Download     Help

 

Incorporating Hidden Markov Model for Identifying Protein

Kinase-specific Phosphorylation Sites

-----------------1.Introduction---------------

-------------------2.Methods------------------

-------------------3.Statistics------------------

-----------4.Database Comparision----------

------------------5.References-----------------

 

ABSTRACT

Protein phosphorylation, which is an important mechanism in post-translational modification, affects essential cellular processes such as metabolism, cell signaling, differentiation and membrane transportation. Proteins are phosphorylated by a variety of protein kinases. In this investigation, we develop a novel tool to computationally predict catalytic kinase-specific phosphorylation sites. The known phosphorylation sites from public domain data sources are categorized by their annotated protein kinases. Based on the concepts of profile Hidden Markov Model (HMM), computational models are learned from the kinase-specific groups of the phosphorylation sites. After evaluating the learned models, we select the model with highest accuracy in each kinase-specific group and provide a web-based prediction tool for identifying protein phosphorylation sites. The main contribution here is that we develop a kinase-specific phosphorylation site prediction tool with both high sensitivity and specificity.

 

Introduction

 

Protein phosphorylation, performed by a group of enzymes known as kinases and phosphotransferases (Enzyme Commission classification 2.7), is a post-translational modification essential to correct functioning within the cell 1 . The post-translational modification of proteins by phosphorylation is the most abundant type of cellular regulation. It affects a multitude of cellular signal pathways, including metabolism, growth, differentiation and membrane transport 2 . The enzymes must be sufficiently specific and act only on a defined subset of cellular targets to ensure signal fidelity. Proteins can be phosphorylated at serine, threonine and tyrosine residues.

Because of its importance in cellular control, it is desirable to have a computational tool for quickly and efficiently identifying phosphorylation sites in protein sequences, as well as the catalytic kinases involved in the phosphorylation. This will increase the efficient characterization of new protein sequences 1 . Therefore, in this investigation, we designed and implemented a prediction tool that can facilitate the identification of the phosphorylation sites and the related catalytic kinases.

PhosphoBase 3 is a database of experimentally verified phosphorylation sites. The entries supply the annotations about the phosphoprotein and the exact position of its phosphorylation sites. Furthermore, part of the entries contain information about kinetic data obtained from enzyme analyzes on specific peptides. The Swiss-Prot 4 is a comprehensively annotated protein database. Both experimentally validated and putative phosphorylation annotations can be obtained from the post-translation modification annotation in the database.

NetPhos 2 presents an artificial neural network method that predicts the phosphorylation sites in independent protein sequences with a sensitivity in the range from 69% to 96%. DIPHOS 5 is a web-based tool for the prediction of protein phosphorylation sites. In this study, the position-specific amino acid frequencies and disorder information are used to improve the discrimination between phosphorylation and non-phosphorylation sites. Berry et al 1 employ back-propagation neural networks (BPNNs), the decision tree algorithm C4.5 and the reduced bio-basis function neural networks (rBPNN) to predict phosphorylation sites. NetPhosK 6 is an artificial neural network algorithm to predict protein kinase A (PKA) phosphorylation sites with 100% sensitive and 40% specific in their experiments.

Most of the previous studies on the phosphorylation site prediction have concentrated on only the substrate specificity. In this investigation, the catalytic kinases of the protein phosphorylation are taken into account. The known phosphorylation sites from data sources in public domain are categorized by their annotated protein kinases. In order to increase the sensitivity of the models, the sequences in larger groups of the phosphorylated sites can be further clustered and split into subgroups by Maximal Dependence Decomposition (MDD) 7 . Based on the concepts of profile Hidden Markov Model (HMM), computational models are learned from the kinase-specific groups of the phosphorylation sites. After evaluating the learned models by the k-fold cross-validation or leave-one-out cross-validation, we select the best performed model in each kinase-specific group and provide a web-based prediction tool to facilitate the identification of protein kinase-specific phosphorylation sites.

 

Materials and Method

The PhosphoBase 3 consists of 1,083 experimentally verified phosphorylation sites within 436 protein entries. As given in Table 1, the number of serine sites, threonine sites and tyrosine sites are 713, 164 and 206, respectively. The Swiss-Prot 4 (release 45 of October 2004) maintains 163,500 protein entries of which 3,614 entries are annotated as phosphorylated. The entries contain residues annotated as ‘phosphorylation' in the ‘MOD_RES' fields are extracted. The number of serine sites, threonine sites and tyrosine sites are 1005, 281 and 321, respectively. Those sites annotated as “by similarity”, “potential” or “probable” are considered as test set. Generally, the serine, threonine and tyrosine, which are not annotated as phosphorylation residues, within the experimentally validated phosphorylated proteins are selected as negative sets, i.e., the non-phosphorylated sites. Therefore, two negative (non-phosphorylated) data sets are extracted from PhosphoBase and Swiss-Prot based on the phosphorylation annotation.

 

Table 1. The data sources of the phosphorylation sites.

Data sources

Number of phosphorylated proteins

Number of phosphorylated sites

Serine (S)

Threonine (T)

Tyrosine (Y)

Total

PhosphoBase

436

713

164

206

1,083

Swiss-Prot

(Release 45 of October 2004)

796

1,005

281

321

1,607

*3,614

3,578

1,331

1,434

6,343

* The entries are annotated as “by similarity”, “potential” or “probable”. The data set will be considered as test set in the Discussion section.

Table 2. The statistics of the catalytic kinase-specific phosphorylation sites.

Swiss-Prot (Release 45 of October 2004)

Catalytic protein kinases

Number of substrate sites

Serine

Threonine

Tyrosine

Protein kinase C (PKC)

81

67

14

 

cAMP-dependent protein kinase (PKA)

106

97

9

 

Casein kinase II (CKII)

65

55

10

 

Calmodulin-dependent protein kinase II (CaM-II)

14

14

0

 

cGMP-dependence protein kinase (PKG)

7

6

1

 

Casein kinase I (CKI)

14

10

4

 

Cell division cycle protein kinase p34cdc2

47

30

17

 

Mitogen-activated protein kinase (MAPK)

36

21

15

 

Epidermal growth factor receptor (EGFR)

10

 

 

10

Tyrosine kinase Src

14

 

 

14

Insulin receptor (INSR)

11

 

 

11

Total

405

300

70

35

PhosphoBase

Catalytic protein kinases

Number of substrate sites

Serine

Threonine

Tyrosine

Protein kinase C (PKC)

180

150

30

 

cAMP-dependent protein kinase (PKA)

178

167

11

 

Casein kinase II (CKII)

83

70

13

 

Calmodulin-dependent protein kinase II (CaM-II)

35