Calpain, an intracellular
-dependent cysteine protease, is known to play a role in a wide range of metabolic
pathways through limited proteolysis of its substrates. However, only a limited number
of these substrates are currently known, with the exact mechanism of substrate recognition
and cleavage by calpain still largely unknown. While previous research has successfully
applied standard machine-learning algorithms to accurately predict substrate cleavage
by other similar types of proteases, their approach does not extend well to calpain,
possibly due to its particular mode of proteolytic action and limited amount of experimental
data. Through the use of Multiple Kernel Learning, a recent extension to the classic
Support Vector Machine framework, we were able to train complex models based on rich,
heterogeneous feature sets, leading to significantly improved prediction quality (6%
over highest AUC score produced by state-of-the-art methods). In addition to producing
a stronger machine-learning model for the prediction of calpain cleavage, we were
able to highlight the importance and role of each feature of substrate sequences in
defining specificity: primary sequence, secondary structure and solvent accessibility.
Most notably, we showed there existed significant specificity differences across calpain
sub-types, despite previous assumption to the contrary. Prediction accuracy was further
successfully validated using, as an unbiased test set, mutated sequences of calpastatin
(endogenous inhibitor of calpain) modified to no longer block calpain's proteolytic
action. An online implementation of our prediction tool is available at
http://calpain.org.