DBindR: Prediction of DNA-binding residues in proteins from amino acid
sequences using a random forest model with a hybrid feature
Introduction
Motivation: In this work, we aim to develop a computational approach for predicting DNA-binding sites in proteins from amino acid sequences. To avoid over-fitting with this method, all available DNA-binding proteins from the Protein Data Bank (PDB) are used to construct the models. The random forest (RF) algorithm is used because it is fast and has robust performance for different parameter values. A novel hybrid feature is presented which incorporates evolutionary information of the amino acid sequence, secondary structure (SS) information and orthogonal binary vector (OBV) information which reflects the characteristics of 20 kinds of amino acids for two physical-chemical properties (dipoles and volumes of the side chains). The numbers of binding and non-binding residues in proteins are highly unbalanced, so a novel scheme is proposed to deal with the problem of imbalanced datasets by downsizing the majority class.
Results: The results show that the RF model achieves 91.41% overall accuracy with Matthew’s correlation coefficient of 0.70 and an area under the receiver operating characteristic curve (AUC) of 0.913. To our knowledge, the RF method using the hybrid feature is currently the computationally optimal approach for predicting DNA-binding sites in proteins from amino acid sequences without using three-dimensional structural information. We have demonstrated that the prediction results are useful for understanding protein-DNA interactions.
Input
DBindR is available at http://www.cbi.seu.edu.cn/DBindR/DBindR.htm. All the CGI scripts of models were written in perl 5.8.4 and the interface was designed using HTML. On the DBindR web page, users can copy/paste amino acid sequences in FASTA format and choose the prediction method (either Random Forest or Support Vector Machine). All non-standard characters will be excluded from the sequences. The length of the submitted amino acid sequences should be more than 29 amino acids and less than 1000 amino acids. The RF and SVM models used for predicting new proteins were constructed from all the data instances in the processed DBP-374 dataset. The RF algorithm is implemented by the randomForest (version 4.5-18) R package (Liaw, 2002), and the SVM algorithm was by the e1071 (version 1.5-16) R package (Dimitriadou, et al., 2006). An E-mail address is required to receive the results. The program slides a window with length β = 11 along the input sequence (into a segment of amino acid sequences). Each window segment constitutes a sample and each sample will be mapped into a 319-dimension feature space reflecting a hybrid feature by combining evolutionary information of the amino acid sequence, the OBVs of the amino acids and SS information of the amino acids. The web server returns the predicted DNA-binding residues and non-binding residues along the input sequence, and marks them “P”and “N”. The prediction reliability index (RI) ranges from the lowest level 0 to the highest level 10 for presentation, and the higher the RI is, the higher reliability the prediction gains. DBindR only allows the prediction for 4 protein sequences at most in one run of prediction.The large batch users can send their sequences to us <js_wu@seu.edu.cn>. We will run on our local computers and the results will be returned by E-mail.
Output
The results are received by E-mail . The results are shown in a user-friendly format. The web server returns the predicted DNA-binding residues and non-binding residues along the input sequence, and marks them“P”(positive) and “N”(negative).The prediction reliability index (RI) ranges from the lowest level 0 to the highest level 10 for presentation, and the higher the RI is, the higher reliability the prediction gains.
RI for the RF model is defined as:
Here, F+ is the fraction of the tree votes (FV) for the positive class in each sample, and
is the threshold by which to classify samples according to FV for the positive class and is set to 1/6 in this paper, because the ratio of positive to negative samples for training is 1: 5.
RI for the SVM model is defined as:
![]()
Here, D is the output value of SVM classifier.
An output example
A. Random Forest

B.Support Vector Machine
