Genetics and Machine Learning

Note from the editor: This article initially appeared in the DNA Decoder, a quarterly journal published by students for students. The magazine is intended to assist students from all around the country interact with one another and discuss fascinating genetics-related ideas.


By Mahir Jethanandani


The human genome contains over three billion base pairs of genetic information, enough to fill more than 200 New York City telephone books (each averaging 1000 pages) if written out [1].


Working with such massive information as the human genome necessitates the employment of cutting-edge technology to both sequence and evaluate what makes these data so fascinating.


The human genome is not just huge, but it’s also incredibly complicated: there are around 20,000 genes and even more areas that govern how they’re expressed.


Small differences in these genes and regulatory areas are what eventually distinguishes each of us (and, unfortunately, sometimes results in disease).


Small differences have a difficult time being identified, especially when they occur in conjunction with one another.


Despite the fact that the Human Genome Project offered a plethora of knowledge on the genetic material that makes up humans, scientists are still striving to uncover the links between genotypic and phenotypic features more than a decade later.




Machine learning is a contemporary technology for identifying patterns and relationships in massive datasets that is becoming increasingly popular. Machine learning, in general, is a sort of artificial intelligence in which computers are trained to improve their performance on a broad job or to “learn” on their own—given a beginning dataset that they may use to discover relevant patterns.


The IBM Watson computer, which was able to surpass even the greatest human contestants on Jeopardy, is a well-known example of machine learning. [two] Machine learning has a wide range of applications in today’s society, and one of the most fascinating is finding patterns in personal genetic data.




Machine learning may be used to help uncover patterns in how tiny differences in genes and regulatory areas result in phenotypic changes (traits, wellbeing, and health) in a more automated form in the context of personal genomics (the study of an individual’s unique human set of DNA).


Knowing which genetic variations are often shared among people with features of interest, such as diabetes or hemophilia, allows computer scientists to use machine learning to pinpoint where in the genome (and maybe why) these problems arise.


Machine learning is being used by whole corporations and research departments all around the world in the hopes of discovering common correlations between people’s DNA and attributes or illness.


By looking for genetic patterns among persons with comparable medical conditions, machine learning can help us find underlying genetic reasons for particular diseases.


Machine learning is responsible for a lot of new discoveries in the human genome. Unsupervised learning, for example, may cluster genes based on their expression in cells and tissues and determine the relationship between genotypic and phenotypic patterns.


It can also be used to enhance sequencing procedures. DeepVariant is one of these projects.




As our knowledge of genetics expands, new problems to tackle develop.


The goal of next-generation sequencing is to minimize the time and resources needed to read and scan a person’s genome.


Machine learning may be used to improve the repetitive job of genome sequencing, especially when utilized for next-generation sequencing. The present genome sequencing method is prone to errors, since it can misinterpret sections of DNA and produce other critical errors, limiting our ability to link genotype to phenotype.


In April 2016, the Food and Drug Administration held the PrecisionFDA Truth Challenge, which attempted to reduce the effect of human genetic sequencing errors. [3]


DeepVariant, Google Research’s solution for next-generation sequencing, was unveiled.

DeepVariant went on to win the highest honors for next-generation sequencing technology.


By boosting machine learning approaches employed in sequencing, DeepVariant enhanced the Genome Analysis Tool Kit (GATK), a prominent genomic tool.[4]


Deep learning frameworks like TensorFlow and PyTorch enable organizations like Verily Life Sciences (the developers of DeepVariant) to enhance the speed and accuracy of sequencing without becoming bogged down in the technical specifics.


DeepVariant optimizes a computer’s capacity to detect patterns in unsupervised data using a subset of machine learning called deep learning.


Such computations are difficult to comprehend, making the challenge of teaching computer systems to learn “correctly” all the more challenging.




  • A sort of artificial intelligence that may be used to detect patterns in data is known as machine learning.


  • Unsupervised learning is a type of machine learning that learns from data without having to classify it explicitly.


  • An individual’s genotype is the heritable genetic material that is unique to them (the usage of this term can refer to a single base pair all the way up to the entire genome or the entire set of DNA in a human).


  • Phenotype : an individual’s observable physical characteristic(s) (can be trait, wellness, or health)


  • The Human Genome Project is a global genomics initiative aimed at establishing the first entire human DNA sequence.




DeepVariant and advances in popularizing personal genomics combine to broaden machine learning’s uses. More importantly, corporations are launching a “app store” for other scientists and genetics aficionados to study their own genomes in connection to their health and well-being.


Despite numerous obstacles, we are making progress in connecting genetics to phenotype as a scientific community. Many patterns that aid in the formation of genetic features have yet to be identified, and machine learning specializes in pattern identification that pushes human ability and knowledge to new heights.




  1. Nova Online : Genome Facts. Last updated 2001
  2. IBM’s Watson computer takes the Jeopardy ! Challenge.
  3. Chin J. Simple Convolutional Neural Network for Genomic Variant Calling with TensorFlow. July 16, 2017 ; 1-3.
  4. Poplin R, Newburger D, Dijamco J, Nguyen N, Loy D, Gross S, McLean C.Y., DePristo M.A. Creating a universal SNP and small indel variant caller with deep neural networks. bioRxiv. Dec. 14, 2016

Mahir Jethanandani is a junior at the University of California, Berkeley, studying Computer Science, Statistics, and Economics. He formerly worked as a Machine Learning and Bioinformatics Research Intern at the University of California, San Francisco Department of Neurology and Bioinformatics. Mahir also worked as an Engineering Intern at 23andMe, where he learned about personal genomics and how it may be used to computer science, machine learning, and bioinformatics. Mahir earned a bachelor’s degree in Computer Science, Statistics, and Economics from UC Berkeley. “The Immaculate Investor” and “The Balance Sheet of Earth” are his books. Mahir formerly worked at Benetech, where he undertook voluntary work for the United Nations alongside Google. He is from Saratoga, California, and following the death of his grandfather, he grew interested in genetics and bioinformatics.