How to perform variant calling in whole genome sequencing?
29 May 2025
Understanding Variant Calling
Whole genome sequencing (WGS) is a powerful tool in genomics that provides comprehensive insights into the genetic makeup of an organism. One of the critical processes in WGS is variant calling, which involves identifying mutations or variations in the genome. These variations could be single nucleotide polymorphisms (SNPs), insertions, deletions, copy number variations, or structural variants. Accurate variant calling is essential for applications such as disease research, population genetics, and personalized medicine.
Preparing the Data
Before diving into variant calling, it's crucial to ensure the quality of the sequencing data. Start by obtaining high-quality raw sequencing reads, typically in FASTQ format. Quality control (QC) is the first step, involving tools like FastQC to assess the overall quality of the data. Metrics such as read length distribution, quality scores, and adapter contamination are evaluated. Based on the QC results, trimming tools like Trimmomatic or Cutadapt can be used to remove low-quality bases and adapters, ensuring cleaner input data for the subsequent steps.
Aligning the Reads
The next critical step is aligning the sequencing reads to a reference genome. This step is crucial for identifying variants as it provides a mapped position for each read. Popular tools for alignment include BWA (Burrows-Wheeler Aligner) and Bowtie2, which align reads to a reference genome such as the human genome (e.g., GRCh38). During alignment, parameters like mismatches and gaps are carefully controlled to minimize errors. The result is a SAM/BAM file containing aligned reads, which serves as the basis for variant calling.
Processing Aligned Reads
Once the reads are aligned, further processing is necessary to improve variant calling accuracy. This involves sorting, marking duplicates, and recalibrating base quality scores. Picard tools can be used to mark PCR duplicates, which are artifacts that may skew results. GATK (Genome Analysis Toolkit) is commonly used for base quality score recalibration, which adjusts the quality scores of reads to account for systematic biases introduced during sequencing. This step enhances the reliability of the variant calls.
Variant Calling
With the processed BAM files in hand, variant calling can be performed using software tools designed to identify genetic variants. GATK is one of the most popular frameworks for variant calling, offering tools like HaplotypeCaller or UnifiedGenotyper. These tools analyze the read data to detect variants, generating a VCF (Variant Call Format) file that lists identified variants along with their genomic coordinates and quality metrics. Other tools like FreeBayes and SAMtools mpileup provide alternative methods for variant calling, each with unique algorithms and advantages.
Filtering and Annotation
After variant calling, the resulting VCF file needs to be filtered to retain high-confidence variants while discarding those likely to be artifacts. Filtering criteria may include read depth, quality scores, and variant frequency. Tools like GATK’s VariantFiltration can assist in this process. Following filtering, variant annotation is performed to provide functional insights into the identified variants. ANNOVAR and SnpEff are popular tools for annotating variants, offering information on gene associations, predicted impacts, and known disease associations.
Validation and Interpretation
Variant calling is a computational process that benefits from validation and careful interpretation. Validation methods can include replicating variant calls using different pipelines or experimental techniques such as Sanger sequencing. Interpretation involves understanding the biological relevance of the variants, which often requires integrating additional data sources such as population frequency databases or functional genomics datasets. The ultimate goal is to derive meaningful conclusions from the variant data, which can inform research and clinical outcomes.
Conclusion
Variant calling in whole genome sequencing is a multi-step process that combines data preparation, alignment, processing, calling, filtering, and interpretation. Each step is crucial for ensuring accuracy and reliability in identifying genetic variants. With the constant evolution of sequencing technologies and computational methods, researchers have access to increasingly sophisticated tools to enhance variant calling, paving the way for breakthroughs in genomics and personalized medicine.
Discover Eureka LS: AI Agents Built for Biopharma Efficiency
Stop wasting time on biopharma busywork. Meet Eureka LS - your AI agent squad for drug discovery.
▶ See how 50+ research teams saved 300+ hours/month
From reducing screening time to simplifying Markush drafting, our AI Agents are ready to deliver immediate value. Explore Eureka LS today and unlock powerful capabilities that help you innovate with confidence.
Accelerate Strategic R&D decision making with Synapse, PatSnap’s AI-powered Connected Innovation Intelligence Platform Built for Life Sciences Professionals.
Start your data trial now!
Synapse data is also accessible to external entities via APIs or data packages. Empower better decisions with the latest in pharmaceutical intelligence.