The risk of developing coronary artery disease (CAD) varies greatly among individuals in the general population. Clinical variables like LDL cholesterol and systolic blood pressure do not always tell the whole story regarding an individual’s risk of developing CAD.
Past research has shown that the level of coronary artery calcium (CAC) of a patient is a strong predictor of CAD, as well as lethal cardiac events, such as heart attacks. Identifying markers predictive of high CAC levels can be very helpful for identifying patients who are at greater risk and preventing accelerated progression of heart disease especially at an early age.
Single nucleotide polymorphisms (SNPs) represent a particularly rich source of genetic variation (about 10 million SNPs are present in the human genome) making them ideal for establishing links between genetic variation and complex diseases.
How can one identify such markers that can predict individuals that are at a high risk of advanced CAC? With the recent advances in genomics, one possible route is utilizing genomic information from a pool of patients that include two subgroups representing the two extremes of the phenotypic distribution in the general population (i.e., no disease vs. advanced disease).
Single nucleotide polymorphisms (SNPs) represent a particularly rich source of genetic variation (about 10 million SNPs are present in the human genome) making them ideal for establishing links between genetic variation and complex diseases. A major challenge in building predictive models of complex diseases is their multifactorial nature that involves interactions between several genes.
Recently, there has been increasing interest in the application of machine learning tools for disease predictions. These methods provide increased ability for integrating multiple data sources (e.g., clinical, genotypic, and transcriptomic) while utilizing potential linear and non-linear interactions between disease predictors.
To this end, we integrated clinical data and SNP genotype data into machine learning models to identify SNPs that are predictive of advanced CAC levels. We found 56 highly predictive SNPs in a discovery cohort, which were then tested in an independent replication cohort.
These two cohorts from ClinSeq® and the Framingham Heart Studies were composed of middle-aged Caucasian men due to their higher risk of advanced CAC in comparison with the rest of the population in the United States. The two extremes of the CAC distribution were equally represented in both cohorts (i.e., no CAC vs. extremely high levels of CAC).
Machine learning tools hold promise for deriving predictive disease models and networks.
21 of the 56 SNPs identified from the discovery cohort generated optimal predictive performance in both cohorts with two machine learning based modeling approaches, namely random forests and neural networks. When we tested these SNPs with patients who had intermediate CAC levels, the predictive performance dropped significantly. Hence, the high performance was specific to advanced CAC.
Finally, we utilized the GeneMANIA database to create a functional interaction network composed of genes on which the optimal subset of 21 SNPs were located, as well as additional genes previously reported to interact with these genes. Several genes involved in the production and inhibition of reactive oxygen species (a major driver of CAC and vascular aging) were present in this network.
In summary, our results showed that machine learning tools hold promise for deriving predictive disease models and networks. These tools are likely to play increasing roles in personalized medicine by helping practitioners design optimal treatment strategies and identify potential drug targets using genomic data.
Disclaimer: The views expressed in this blog post are those of the author and do not necessarily represent the views of the National Heart, Lung, and Blood Institute; National Human Genome Research Institute; the National Institutes of Health; or the U.S. Department of Health and Human Services.