Tumor Mutational Burden (TMB) and Computational Scoring
1. Introduction
Tumor mutational burden (TMB) is a quantitative genomic biomarker defined as the total number of somatic, non-synonymous mutations per megabase (Mb) of coding sequence interrogated in a tumor specimen. TMB has emerged as a surrogate measure of immunogenicity in neoplastic tissues across multiple species. The underlying rationale is that a higher number of somatic mutations increases the probability of generating neo-antigenic peptides that can be presented by major histocompatibility complex (MHC) molecules and recognised by host cytotoxic T lymphocytes. In veterinary medicine, TMB assessment is being translated from human oncology into trials for canine and feline hemangiosarcoma, melanoma, and mammary carcinoma, where checkpoint blockade immunotherapies are under investigation.
This article provides an exhaustive technical review of the biological, sequencing, and computational principles that govern TMB determination. Given that TMB scores are highly dependent on panel design, sequencing depth, variant calling thresholds, and germline filtering, a standardised computational scoring methodology is essential for inter-study comparability. The present discussion deliberately omits commercial platform names and focuses on the generic bioinformatic workflow.
2. Biological Basis of Somatic Mutation Accumulation
Somatic mutations arise from DNA replication errors, endogenous chemical damage (e.g., deamination, oxidative lesions), and exogenous mutagens (e.g., ultraviolet light, aflatoxins, tobacco-specific nitrosamines). The DNA repair systems of the host cell, including base excision repair (BER), nucleotide excision repair (NER), and mismatch repair (MMR), normally correct most lesions. When repair capacity is exceeded or when repair genes themselves are inactivated, mutations become fixed in clonal populations.
Mutational signatures, as defined by trinucleotide context, can point to specific mutagenic processes. For example, a predominance of C > T transitions at dipyrimidine sites is associated with ultraviolet exposure in feline and canine cutaneous squamous cell carcinoma. The presence of T > C transitions alongside microsatellite instability (MSI) suggests MMR deficiency. In veterinary species, no large-scale catalog of mutational signatures yet exists, but the same computational framework (e.g., SigProfiler, mutationalPatterns) can be applied to whole-genome sequencing (WGS) or whole-exome sequencing (WES) data.
The total number of mutations that reach fixation in a tumor is the product of the mutation rate per cell division, the number of cell divisions, and the selective advantage or neutrality of each mutation. Passenger mutations accumulate neutrally and constitute the bulk of TMB. Driver mutations, although fewer in number, confer a proliferative advantage and are positively selected.
3. Sequencing-Based TMB Measurement
3.1. Sequencing Modalities
TMB can be estimated from three sequencing modalities:
- Whole-genome sequencing (WGS): Interrogates both coding and non-coding regions. The TMB denominator is the entire genome size (approximately 2.5 Gb in mammals). WGS provides the most comprehensive mutation count but is cost-prohibitive for routine diagnostics.
- Whole-exome sequencing (WES): Targets the approximately 30 Mb of coding exons plus flanking splice sites. WES yields a smaller denominator and better coverage of known genes. Exonic mutations are more likely to be non-synonymous and potentially immunogenic.
- Targeted gene panel sequencing: Anchored to a set of cancer-related genes (e.g., 100 to 500 genes). The panel covers a limited genomic footprint (0.5 Mb to 2.0 Mb). TMB is extrapolated by dividing the total number of non-synonymous somatic mutations by the panel’s effective target size. Accuracy depends on panel breadth and representation of mutation-prone regions.
3.2. Library Preparation and Sequencing Chemistry
Sequencing libraries are prepared from DNA extracted from formalin-fixed, paraffin-embedded (FFPE) or fresh frozen tumor tissue. A matched normal sample (blood or adjacent non-tumor tissue) is essential for filtering out germline polymorphisms. For targeted panels, hybrid capture probes are designed to pull down genomic regions of interest. The captured fragments are then sequenced on a high-throughput sequencer using reversible terminator chemistry (sequencing-by-synthesis) or semiconductor-based detection.
Sequencing depth is a critical parameter. A median coverage of at least 200x to 500x is standard for targeted panels to detect mutations at low variant allele frequencies (VAF). For WES, 100x is typically sufficient, while WGS may be performed at 30x to 60x. Inadequate depth increases the risk of false negatives for subclonal mutations and reduces TMB accuracy.
4. Bioinformatics Pipeline for TMB Scoring
A structured computational pipeline is required to convert raw sequencing reads into a numeric TMB value. The pipeline involves pre-processing, alignment, duplicate marking, and variant calling, followed by stringent filtering.
4.1. Pre-processing and Alignment
Raw FASTQ files are assessed for quality using tools such as FastQC. Adapter sequences and low-quality bases are trimmed with Trimmomatic or cutadapt. The resulting reads are aligned to the reference genome (e.g., CanFam3.1 for dog, Felis_catus_9.0 for cat, Gallus_gallus-6.0 for chicken) using BWA-MEM or Bowtie2. Aligned reads are sorted and indexed with SAMtools.
Duplicated reads, which arise from PCR amplification, are marked and excluded using Picard MarkDuplicates. This step is mandatory because duplicate reads inflate confidence in a single molecule, biasing allele frequency estimates.
4.2. Somatic Variant Calling
Somatic single nucleotide variants (SNVs) and small insertions/deletions (indels) are called using dedicated somatic callers such as MuTect2 (from GATK4), Strelka2, or SomaticSniper. These tools compare tumor and matched normal BAM files to identify mutations present only in the tumor.
The callers employ Bayesian or Poisson models to estimate the probability that a given position is truly variant. Key parameters include:
- Minimum tumor VAF: Typically 1% to 5%.
- Minimum depth in tumor: 20 to 50 reads.
- Minimum depth in normal: 10 to 20 reads.
- Minimum base quality: Phred score Q20 or Q30.
A set of filters is applied post-calling:
- Germline filter: Variants present in the normal sample at a VAF > 0.1% are removed.
- Strand bias filter: Variants with highly skewed representation on forward versus reverse strands are excluded.
- Clustered event filter: Variants within 5 base pairs of an indel are often sequencing artefacts.
- Blacklisted regions: Polymorphic sites and low-complexity regions are masked.
4.3. Annotation and Mutation Type Selection
Variant annotation is performed with SnpEff, VEP (Variant Effect Predictor), or ANNOVAR. Each variant is categorised as:
- Non-synonymous: Missense, nonsense, frameshift, or splice-site altering variants.
- Synonymous: Silent variants that do not alter the amino acid sequence.
- Non-coding: Intronic, intergenic, or untranslated region (UTR) variants.
For standard TMB calculation, only non-synonymous (coding) mutations are counted. Synonymous mutations are excluded because they are less likely to generate novel peptides. Some clinical algorithms also exclude known driver mutations (e.g., BRAF V595E in canine urothelial carcinoma) to avoid inflating TMB, although this remains debated.
4.4. Normalisation and TMB Value Computation
The raw mutation count is normalised by the effective target size:
[ \text{TMB (mutations per Mb)} = \frac{\text{Number of somatic non-synonymous mutations}}{\text{Total target size (Mb)}} ]
The target size is the total length of all exons or captured regions after excluding:
- Regions with low coverage (< 100x).
- Regions with poor mapping quality.
- Homopolymer and repetitive tracts.
If the raw count is 50 mutations and the target size is 1.0 Mb, TMB = 50 mutations/Mb.
4.5. Downstream Correction for Artefacts
FFPE-derived DNA frequently contains deamination artefacts (C > T and G > A transitions) introduced by formalin crosslinking. These artefacts can artificially elevate TMB estimates. Computational correction strategies include:
- Utilising a dedicated FFPE-aware variant caller (e.g., MuTect2 with FFPE filtering).
- Removing variants at a VAF below a threshold (e.g., 5%) where artefacts are enriched.
- Applying a mutational signature decomposition to subtract the FFPE signature.
5. Computational Scoring Tools
Several open-source and commercial tools provide standardised TMB pipelines. Selected tools are summarised in Table 1.
Table 1. Representative computational tools for TMB scoring.
| Tool | Input | Variant Caller | Normalisation | Output |
|---|---|---|---|---|
| Neopepsee | BAM | Samtools/BCFtools | Exonic target size | TMB score, neoantigen candidates |
| TMBcal | VCF | Custom wrapper | Panel-specific BED file | Mutation count and TMB |
| PureCN | BAM + SNP array | MuTect2 | Adjusted for purity | TMB with purity correction |
| GSvar | BAM | Strelka2 | BED file from panel | TMB, MSI status, mutational signatures |
| SomaticWrapper | FastQ | MuTect2, Strelka2 | Exome or panel size | Comprehensive variant report |
The choice of tool depends on data type (targeted panel vs. WES), required sensitivity for low-VAF mutations, and whether tumor purity is known.
6. Tumor Purity and Clonal Heterogeneity
Tumor purity (the fraction of neoplastic cells in the biopsy) directly affects TMB accuracy. A sample with 20% purity will have a proportion of variant reads diluted by normal cells. The observed VAF is reduced, and subclonal mutations may be missed entirely.
Several computational methods estimate purity:
- Estimate: Uses expression data (RNASeq) to infer stromal and immune content.
- ABSOLUTE: Uses copy number alterations from SNP arrays or sequencing.
- PureCN: Jointly models copy number and B-allele frequency.
When purity is low (below 20%), TMB is likely underestimated. Some algorithms adjust TMB upward by dividing the raw count by purity, but this assumes all mutations are clonal. More conservative approaches report both raw and adjusted TMB with a cautionary flag.
7. Panel Selection and Impact on TMB
The size and composition of a targeted gene panel introduce systematic variability. Small panels (0.5 Mb) have wider confidence intervals because they sample a smaller fraction of the exome. Large panels (1.5 to 2.0 Mb) produce more stable estimates.
Panel content must include genes from diverse mutational contexts. Panels enriched for highly mutated genes (e.g., TP53, KIT) may overestimate TMB if the number of hotspot mutations is high. Conversely, panels concentrating on low-mutation-rate housekeeping genes may underestimate TMB.
A recommended practice is to compute the Spearman correlation between panel-derived TMB and WES-derived TMB in a validation cohort. A correlation coefficient (rho) greater than 0.85 is used as a threshold for panel reliability.
8. Comparison with Microsatellite Instability
Microsatellite instability (MSI) arises from MMR deficiency and leads to hypermutation, particularly at short tandem repeat (STR) loci. TMB and MSI are correlated but not identical. MMR-deficient tumors typically exhibit TMB values above 10 mutations/Mb and show MSI. However, not all high-TMB tumors are MSI-positive; some arise from polymerase epsilon (POLE) or polymerase delta (POLD1) exonuclease domain mutations.
In veterinary species, MSI testing is less established than in human medicine. Computational assessment of MSI from NGS data can be performed using tools such as MSIsensor or MANTIS, which evaluate the length distributions of STRs between tumor and normal samples.
9. Workflow Diagram
The following Mermaid diagram illustrates the end-to-end TMB scoring workflow.
flowchart TD
A[DNA Extraction] --> B[Library Preparation / Hybrid Capture]
B --> C[High-Throughput Sequencing]
C --> D[FASTQ Pre-processing]
D --> E[Alignment to Reference Genome BWA-MEM]
E --> F[Duplicate Marking Picard]
F --> G[Somatic Variant Calling MuTect2 / Strelka2]
G --> H[Variant Filtering Germline, Strand Bias, VAF]
H --> I[Variant Annotation SnpEff / VEP]
I --> J[Count Non-synonymous Somatic Mutations]
J --> K[Normalisation by Target Size Mb]
K --> L{FFPE Artefacts Detected}
L -->|Yes| M[Apply FFPE Correction]
M --> N[Final TMB Reported as muts/Mb]
L -->|No| N
N --> O[Clinical Interpretation]
10. Sources of Preanalytical and Analytical Variability
Several factors contribute to TMB score variation:
- Sample type: FFPE versus fresh frozen. FFPE libraries have more artefacts and lower yield.
- DNA input: Low-input DNA leads to PCR duplication bias.
- Sequencing depth: Inadequate depth reduces sensitivity for subclonal mutations.
- Variant caller: Different callers have different sensitivity and specificity profiles.
- Germline database: A species-specific germline database (e.g., dog 1000 Genomes) improves filtering. Cross-species databases can introduce errors.
- Panel design: As discussed in Section 7.
In multi-institutional studies, harmonised protocols and centralised bioinformatics are strongly recommended. Proficiency testing using reference cell lines (e.g., with known TMB values) allows cross-platform calibration.
11. Veterinary Applications and Translational Considerations
In veterinary oncology, TMB is being evaluated as a predictive biomarker for response to immune checkpoint inhibitors (anti-PD-1, anti-PD-L1). Canine hemangiosarcoma and melanoma exhibit moderate to high TMB, whereas lymphoma and osteosarcoma have lower TMB. Computational scoring must account for species-specific genome annotations and MHC polymorphism.
One key challenge is the absence of large, publicly available WGS catalogs for common veterinary species. Databases such as the Canine Genome Annotation and the Feline Genome Database are expanding, but germline variant databases remain incomplete. This means that false-positive somatic calls due to rare germline polymorphisms can inflate TMB.
Despite these limitations, the computational framework described above is directly transferable. Veterinary bioinformaticians should adopt human-based best practices while customising variant filters for each species.
12. Future Directions
Ongoing research is focused on:
- Developing species-specific mutational signature catalogs.
- Integrating TMB with RNA-based immune microenvironment metrics (e.g., T-cell receptor repertoire, cytotoxic gene expression).
- Standardising TMB thresholds for immunotherapy trial enrolment (e.g., TMB-high defined as above 10 mutations/Mb in dogs).
- Improving circulating tumor DNA (ctDNA) TMB estimation for non-invasive monitoring.
TMB scoring will remain a computationally intensive discipline that requires rigorous validation, quality control, and transparent reporting.
References
[1] Chalmers ZR, Connelly CF, Fabrizio D, Gay L, Ali SM, Ennis R, Schrock A, Campbell B, Shlien A, Chmielecki J, Huang F, He Y, Sun J, Tabori U, Kennedy M, Lieber DS, Roels S, White J, Otto GA, Ross JS, Garraway LA, Miller VA, Stephens PJ, Frampton GM. Analysis of 100,000 human cancer genomes reveals the landscape of tumor mutational burden. Genome Medicine. 2017;9(1):34.
[2] Wood DE, White JR, Georgiadis A, Van Emburgh B, Parpart-Li S, Mitchell J, Anagnostou V, Niknafs N, Karchin R, Papp E, McCord C, LoBello J, Taverna D, Hruban RH, Lote R, Sahin IH, Zheng L, Voaklander D, Sun R, Hall MJ, Diaz LA, Velculescu VE, Angiuoli SV, Sausen M, Jones S. A machine learning approach for somatic mutation discovery. Science Translational Medicine. 2018;10(466):eaar7939.
Disclaimer: This article is for educational and informational purposes only. It is not intended to substitute for professional veterinary advice, diagnosis, treatment, or regulatory guidance. Always consult a licensed veterinarian or qualified specialist regarding animal health, disease diagnosis, and therapeutic decisions.