Coronaviruses distribute widely in nature and belong to the Coronaviridae family. They are RNA viruses with enveloped particles and their genomes are single plus strand. The International Committee on Taxonomy of Viruses (ICTV) classifies coronaviruses into α, β, γ, and δ genera. Seven coronaviruses with the phenotype of human infection have been reported: human coronavirus (HCoV) 229E, OC43, NL63, and HKU1; severe acute respiratory syndrome coronavirus (SARS-CoV and SARS-CoV-2); and Middle East respiratory syndrome coronavirus (MERS-CoV). SARS-CoV, MERS-CoV, and SARS-CoV-2 are highly contagious and have caused pandemics or serious epidemics this century.
Following cross-species transmission of animal-origin coronavirus with the novelty of viral antigens, outbreak of pandemics will cause severe economic and societal damage. The natural reservoir of coronaviruses is bat in nature and these viral pathogens transmit to humans through intermediate hosts (civets and dromedaries). Unfortunately, the intermediate host of SARS-Cov-2 is not clear although pangolins are highly suspected. Coronaviruses can cross species barrier and infect humans through the mechanism of point mutations and genome recombination. With fast development of sequencing technology and huge efforts for disease surveillance, genome data of coronaviruses from animals will be obtained in large scale. A prediction model of pandemic risk for animal-origin coronavirus should be proposed and benefit prevention and control of infectious diseases as early warning.
Deep learning developed rapidly in recent years, which has triggered changes in application fields such as speech recognition, image understanding, natural language processing. A recurrent neural network (RNN) is a neural network used to process sequence data and has the ability to capture the inherent characteristics of time series. Because genomes are also long chains comprising four alphabet units, RNNs can extract the features of biological sequences and can predict the phenotype of coronavirus infection. Although deep learning method has many applications in biology and medicine, genome data of coronaviruses should be preprocessed to rationalize the design of mathematic network. The spike protein on the surface of virus particle is the most important surface membrane protein of coronaviruses, being responsible for their binding to the host cell membrane receptor and membrane fusion. It plays a very important role in cross-species infection. The adaptation of other viral proteins to the internal environment of new host also affects viral replication. These facts need to be considered when modeling viral infection, and artificial genome data should be used to increase the weight of the spike protein and build a robust model.
We constructed a predicting model, names as CCSI-DL. The model combines a bidirectional GRU with a one-dimensional convolution and uses the genome sequence of coronaviruses as direct input to predict the pandemic risk of human infection. We trained and tested the CSSI-DL model using single- and multi-group coronavirus genome data and achieved good performances (1 for AUROC and 1 for AUPR). Re-training experiments showed that the model has good transfer learning capabilities and the artificial negative data with genome recombination in the coding region of spike protein were correctly predicted. Moreover, we tried to predict genome data of mutant SARS-Cov-2 from Brazil, United Kingdom, South Africa, and India with this tool and achieved 100% predictive accuracy.
The length of the genome sequence of the coronavirus is about 27–32 kb. We segmented the long sequences of viral genomes into ten segments to increase the performance of the prediction model. In contrast to traditional machine learning methods, deep learning models master the features from the whole genome of coronavirus and predict the risk of cross-species viral infections with robustness. Although the end-to-end model was easy to extract the feature and flexible to build the model, the development about interpretability of prediction output should be further considered, which can increase the understanding of the mechanism about cross-species coronavirus infection in the future.