Protein Function Prediction
We are still far from a complete understanding of the many functions of human proteins, even more so for most other organisms. To help remedy this situation, the CAFA initiative was set up to allow different teams of scientists to compete in predicting function for proteins from a wide range of organisms.
Many different methods have been developed over recent years for this purpose. They use all sorts of techniques and underlying data to make predictions, so much so that CAFA must make different sets of rankings according to different criteria.
For example, some methods are better at predicting function very well for a small number of proteins, whilst others can make decent predictions for a large number of them. Similarly, some are best at making additional predictions for proteins for which some previous functional knowledge is available, whilst others excel for proteins with no previous functional annotation.
Our Method: DomFun
We developed a method for the prediction of protein function based on their underlying domain architecture, DomFun. Instead of considering the protein as a whole, our method works by dissecting it into its constituent domains and making function predictions for each of these domains individually. These domain-function predictions are then recombined to make protein level predictions (Figure 1).
Proteins were split up into their constituent domains based on information from CATH-Gene3D – this comprehensive resource is centred around providing domain annotation for proteins, by splitting proteins structures into their domains, and classifying these domains into evolutionary and functionally related families. This allowed us to produce the lists of protein-domain pairs needed to build the tripartite network shown in Figure 1. The domain-function information was obtained from CAFA itself, which takes it from the Gene Ontology.
For the domain level predictions, we used software developed by our group, NetAnalyzer, a Ruby gem that can be used to analyse multipartite networks to calculate associations between layers. This produced lists of domain-function associations. These were recombined to make scores for the individual proteins, based on data fusion methods.
We compared our score to other CAFA methods using the CAFA scoring system. We performed particularly well when making predictions for a subset of proteins for which previous knowledge is available, performing as well as the top performing CAFA methods.
We also developed a novel procedure to test the performance of our method against additional functional-annotation datasets to those employed in CAFA, which we have coined “Pathway Prediction Performance”. This allowed us to test the performance of our method against not only the Gene Ontology, but also the pathway databases, KEGG and Reactome.
Although our methodology currently uses CATH domain annotation exclusively, it can be extended to include other domain annotation resources, such as Pfam and SCOP. It can also be extended to other annotation sources – as long as the tripartite network shown in Figure 1 can be built with it, it can be used. Future work is needed to investigate the performance of the methodology using other such annotation.