The Variant Calling Problem

In bioinformatics, particularly in the subfield of oncology in which I work, we're often tasked with the issue of identifying variants in a genomic sequence. What this means is that we have a sample sequence along with a reference sequence, and we want to identify regions where the sample differs from the reference. These regions are called variants, and they can often be clinically relevant, potentially indicating an oncogenic mutation or some other telling clinical marker.

In general, there are two types of variant calling that one may be interested in performing. The first of these is called germline variant calling. In this type, we use a reference genome that is an accepted standard for the species in question. There exists more than one acceptable choice for a reference genome, but one of the current standards for humans is called GRCh38. This stands for Genome Reference Consortium human and 38 is an identifying version number. Note the h is used to distinguish the human genome from other genomes that the GRC puts out such as GRCm (mouse genome).

The second type is called somatic variant calling. This method uses two samples from a single individual and compares the sequence of one with the other. In effect, the patients own genome serves as the reference. This is the method commonly used in oncology, as one sample can be taken from a tumor and another from some non-mutated cell such as the bloodline. This process is also sometimes called a tumor-normal study due to the fact that one sample is taken from the tumor and one from a normal cell.

There are a few additional things to note in the discussion above. The first is that in germline calling we're uncovering variants that are being passed from parent to child via the germ cells, i.e. the sperm and eggs. In somatic calling, we're identifying somatic mutations and variants, those that occur within the body over the course of a lifetime, whether they are due to environmental factors, errors in DNA transcription and translation, or some other factor.

Types of Variants

In order to identify variants, it will be helpful to develop an ontology of some of the specific classes and types of variants we might expect to see as the output of our variant callers.

Indel - This is one of the simplest types of variant to describe conceptually, but also one of the most difficult to identify via current variant calling methods. An indel is either an insertion or deletion of a single base at some point in the DNA sequence. The reason these are difficult to identify is that since it is not known in advance at which point in the DNA the variant occurs, it can throw off the alignment with the reference sequence.

SNP - This is short for single nucleotide polymorphism and it refers to a population-level variant in which a single base differs between the sample and reference sequences. By population-level, we mean that this variant is extremely common across specific populations. For example, an SNP might occur when comparing the genomes of Caucasian and Chinese people within a gene coding for skin tone.

SNV - This stands for single-nucleotide variant and is the same as an SNP apart from the fact that it indicates a novel mutation present in the genomes of a few rather than being widespread across a population.

Structural Variant - A structural variant refers to a larger-scale variant in the genome, typically occupying at least 1000 base pairs. These can exhibit a number of different behaviors. For example, a >1kb region of the DNA can be duplicated and reinserted or it can be deleted. These types of structural variation are commonly referred to as copy number variants (CNVs). A partial sequence can also be reversed (called inversion). This image sums it up well

Haplotype Phasing

Before getting into the nitty gritty of variant calling, it will be helpful to describe a related process called haplotype phasing. A haplotype is the set of genetic information associated with a single chromosome. The human genome is diploid, meaning it consists of pairs of chromosomes for which each has one part of its genetic information inherited from the mother and the other part from the father. The combined genetic information resulting from considering these pairs of chromosomes as single units is called the genotype. However, having access to the haplotype can be crucial because it can provide information crucial to identifying disease-causing variants (either SNPs or SNVs) in the genome. The issue is that current sequencing methods don't give us haplotype information for free as often the reads produced cannot be separated into their individual male and female loci. Therefore, we need to use statistical algorithms to piece together what we can about the haplotype sequences after the fact. This is where haplotype phasing comes into play.

Some Preliminaries and Associated Challenges

To start with, we need to establish some background with regards to population-level frequencies of genetic variation so that we can begin to detect deviations from that standard. What may be surprising is that while the human genome is diploid, it varies from the reference at approximately 0.1% of the bases. That means that between any two individuals, we should expect to see at least 99.9% shared genetic information. It is at the remaining 0.1% of sites that we direct our attention when we're looking to tease out the haplotypes. For reference, the condition of sharing genetic information is often called homology.

The teasing out of haplotypes is complicated by the fact that at times genetic information is passed from parent to child in a process known as recombination. Recombination happens when instead of passing down just one chromosome from its diploid pair, a parent passes down a combination of the two. At the biological level, this process occurs during meiosis. The first phase of meiosis involves an alignment of homologous chromosome pairs between the mother and father. This process involves a stage in which the chromosomal arms temporarily overlap, causing a crossover which may (or may not) result in fusion at that locus at which the point of crossover occurs. This results in a recombination of genetic information which is then passed down to the child. One benefit to this process is that it encourages genetic diversity, creating genetic patterns in the child that appeared in neither parent. Unfortunately, it also makes haplotype phasing that much more difficult.

DNA sequencing gives partial haplotype information. It produces sequencing reads which are partial strings of the sequence up to 1000 base pairs. These reads all come from the same chromosome, but they represent only a small slice of the haplotype because the entire chromosome is on the order of 50 million to 250 million base pairs.

Advantages of Knowing the Haplotype

Hopefully it's clear that having access to the haplotype for any given individual is strictly superior to simply having the genotype. This is because, given access to a person's haplotype, we can simply combine that information in a procedural way to yield their genotype. The same cannot be said of the reverse. What's more, having the haplotype allows us to compute statistics about various biological properties that would otherwise be unavailable to use. For example, if we have haplotypes for a series of individuals in a population, we can check at which loci recombination is most likely to occur, what the frequency of that recombination is, etc. It suffices to say that having access to the haplotype is highly advantageous as a downstream input to variant calling methods, and I'll examine some such methods that make use of this information in upcoming posts.

The DeepVariant Caller

Now that we have an understanding of haplotypes and some of the different ways in which variants might arise, I'd like to introduce one of the most successful models for variant calling created to-date: Google's DeepVariant. This model is one of the simplest to introduce in that it uses virtually no specialized knowledge of genomics in terms of encoding input features. Surprisingly, this has no detrimental effect on its accuracy, as it outperformed virtually every other variant caller on the market at the time of its release. The approach it uses is somewhat novel so I'll introduce that here, and I'll compare it in subsequent posts with the methodology behind other variant callers.

How it Works

DeepVariant is unique in that it doesn't operate on the textual sequence information of the aligned and reference genomic reads. Rather, it creates something called a pileup image from the alignment information. A pileup image shows the sequenced reads from the sample aligned with the reference. Each base is assigned an RGB color, and possible variants within the reads as compared to the reference are highlighted accordingly. Here's an example of what a pileup looks like.
The DeepVariant model is trained on a large dataset of such images in which variants have already been labeled. It can then be applied to new alignments and samples by generating their pileup using a program such as SAMTOOLS. Here's how the authors from Google structured their model.
As you can see from the diagram, their workflow proceeds as follows:
  1. Identifying Candidate Variants and Generating Pileup: Candidate variants are identified according to a simple procedure. The reads are aligned to the reference, and in each case the CIGAR string for the read is generated. Based on how that read compares to the reference at the point which it is aligned, it is classified as either a match, an SNV, an insertion, or a deletion. If the number of reads which differ at that base pass a pre-defined threshold, then that site is identified as a potential variant. The authors also apply some preprocessing to ensure that the reads under consideration are of high enough quality and are aligned properly in order to be considered. After candidate variants have been identified, the pileup image is generated. The notable points here are that candidate variants are colored and areas with excessive coverage are downsampled prior to image generation.
  2. The model is trained on the data, then run during inference stages. They use a CNN, specifically the Inception v2 architecture, which takes in a 299x299 input pileup image and uses an output softmax layer to classify into one of hom-ref, het, or hom-alt. They use SGD to train with a batch size of 32 and RMS decay set to 0.9. They initialize the CNN to using the weights from one of the ImageNet models.
The authors find that their model generalizes well, and can be used with no statistically significant loss in accuracy on reads aligned to a reference different from the one on which DeepVariant was trained. They also stumble upon another surprising result, namely that their model successfully calls variants on the mouse genome, opening up the model's to a wide diversity of species.


I hope this post shed some light on how variant calling works and elucidates the biological underpinnings of the process well enough for newcomers to the field to start diving into some papers. In future posts, I'll expand on some of the other popular variant calling models as well as introduce algorithms for related processes, such as the haplotype phasing discussed earlier.