Getting Started: Haplotype quantification

HLA quantification

Preparation

HLA quantification is carried out in 3 steps. Before starting, some files have to be downloaded. First, HLA alleles and the associated XML file has to be downloaded from the IPD-IMGT/HLA database using:

wget ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/hla_gen.fasta
wget ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/xml/hla.xml.zip

The xml file has to be unzipped using:

unzip hla.xml.zip

Vg pangenome index should be downloaded with:

wget https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.0-mc-grch38.xg

Reference genome in fasta format should be downloaded. For HLA typing, we do not recommend to use a genome that contains ALT contigs. Please use a reference genome with no ALT contigs i.e. it could be a primary assembly from either Ensembl or UCSC.

The following three steps are carried out. Note that each step is bound to the outputs coming from the previous steps. The file and folder names are given for exemplary purposes to get the grasp of haplotype quantification easier.

Generation of haplotype variants

First step in haplotype quantification is the generation of haplotype variants using HLA alleles and a reference genome using the following command:

orthanq candidates hla --alleles hla_gen.fasta --genome reference.fasta --xml hla.xml --output candidate_variants

Above command produces VCF files for each HLA locus (A.vcf, B.vcf, C.vcf, DQB1.vcf) in candidate_variants folder.

Preprocessing

Second step is the preprocessing of reads which includes alignment of reads to genome and calling by Varlociraptor. It firstly creates a BWA index using the reference genome. Secondly, the reads are aligned to the genome using the index and reads aligned to HLA loci are extracted and converted to FASTQ. Thirdly, the reads are aligned to the prebuilt pangenome index that was downloaded previously. Then, the extracted reads are aligned to the pangenome graph using vg giraffe. Finally, the aligned BAM file is set to further processing for the header and Varlociraptor is used to call the variants. This step has to be carried out for the locus of interest using the corresponding VCF file produced during candidate variant generation. The following command accomplishes this task and produces a BCF file for e.g. locus A:

orthanq preprocess hla --genome reference.fasta --haplotype-variants candidate_variants/A.vcf --output preprocessing/reads_A.bcf --reads reads_1.fq reads_2.fq --vg-index hprc-v1.0-mc-grch38.xg

Note: To reduce runtime, you can provide a pre-built BWA index using the –bwa-index option.

Note 2: If you already have BWA-aligned reads, you can use the –bam-input parameter to further decrease runtime.

Calling

The third and last step is the quantification of HLA haplotypes for e.g. locus A:

orthanq call hla --haplotype-variants candidate_variants/A.vcf --output quantification/reads_A.csv --prior diploid --haplotype-calls preprocessing/reads_A.bcf --xml hla.xml

Above command quantifies haplotypes assuming the sample is a normal healthy sample via the chosen prior as diploid. So it’s so called HLA typing in the clinical context. However, if haplotype quantification is carried out for a tumor sample, then ‘uniform’ prior has to be chosen.

Virus Variant Quantification

Generation of haplotype variants

The first step in virus variant quantification is generating haplotype variants. This involves using known virus lineages, strains, or variants alongside a reference genome. Use the following command:

orthanq candidates virus --lineages hla_gen.fasta --genome reference.fasta --output out/candidates.vcf

Preprocessing

The second step involves read alignment to the reference genome and variant calling using Varlociraptor:

orthanq preprocess hla --genome reference.fasta --haplotype-variants  out/candidates.vcf --output out/preprocessed.bcf --reads reads_1.fq reads_2.fq

Quantification

In the final step, virus variant quantification is performed based on the haplotype calls:

orthanq call hla --haplotype-variants out/candidates.vcf --prior uniform --haplotype-calls out/preprocessed.bcf --output quantification

Interpretation of Results

HLA quantification

HLA quantification creates a file in the provided output folder and generates the following files:

  • predictions: All predictions with all solutions and posterior estimates (predictions.csv)

  • three_field_solutions: Best 10 solutions created with the prediction results (3_field_solutions.json)

  • two_field_solutions: Best 10 solutions created with the prediction results, on the 2-field-resolution (2_field_solutions.json)

  • best_solution: Best solution plot containing a barchart with solution, genotype and locus matrices, and a violin plot showing the allele frequency distribution (best_solution.json)

  • lp_solution: LP solution plot containing a barchart with solution & genotype and locus matrices (lp_solution.json)

  • two_field_table: All predictions with all solutions and posterior estimates on the 2-field resolution. (2-field.csv)

  • G_groups: All predictions with correspondng G groups (G_groups.csv)

Note: HLA types having no G group has None in the haplotype field.

Note 2: All JSON files can be converted to SVG using vl2svg (https://anaconda.org/conda-forge/vega-lite-cli).

Virus lineage quantification

(coming soon)