Discovery of potential microhaplotype markers in Plasmodium vivax

Discovery of potential microhaplotype markers in Plasmodium vivax#

The aim of this notebook it to scan the core genome and calculate a number of summary statistics (here: cardinality, heterozigosity, entropy) in partially-overlapping sliding windows, to identify potential marker candidates.

This notebook serves two purposes. First, it provides details on the underlying methods used in the main publication: Lineage-informative microhaplotypes for spatio-temporal surveillance of Plasmodium vivax malaria parasites. Second, it provides a template for further microhaplotype candidates selection analysis, for example by changing the selection criteria and/or the statistics calculated for each window.

The notebook consists of four parts:

Setup the environment
Load the data
Select the variants
Calculate summary statistics in sliding windows across the core genome

We used data from a subset of high-quality samples part of the open MalariaGEN Pv4 dataset, which contains genome variation data on nearly two-thousands worldwide samples of natural Plasmodium vivax infections. Details on this project, the methods used, and all contributing partners can be found in the key publication: MalariaGEN et al, Wellcome Open Research 2022, 7:136 https://doi.org/10.12688/wellcomeopenres.17795.1.

The dataset can be accessed in a number of ways. Here we used the the malariagen_data Python package, which allows to use the data directly from the cloud and without having to first download them locally. The Pv4 user guide provides all the information on how to use the package as well as some examples to get started.

All required files for running this notebook can be found in the supplementary_files directory on this repo. Pre-computed data and output files can be found in the precomputed directory.

Setup the environment#

This notebook can be run from any computer and can also work from a compute node within Google Cloud, for example via MyBinder or Google Colab, which are free interactive computing services running in a cloud environment. We primarily use common packages that come pre-installed in many computing environments. However, some specialised packages (e.g. malariagen_data) might require manual installation.

For example, the line below can be uncommented and run on Google Colab to install malariagen-data if needed:

#!pip install -q --no-warn-conflicts malariagen-data==7.14.1

Note: To mantain compatibility with some on-demand cloud environments, we use specific versions of each package which might not be compatible with the most recent one in your local Python installation. If you having trouble running the code, check to make sure that the versioning of your packages match the ones reported below.

# Load all required packages and print their version to check compatability
from importlib.metadata import version
import os
import glob
import logging

print(f'Numpy version: {version("numpy")}')
print(f'Pandas version: {version("pandas")}')
print(f'scikit-allel version: {version("scikit-allel")}')
print(f'malariagen-data version: {version("malariagen_data")}')

import numpy as np
import pandas as pd
import allel
from malariagen_data.pv4 import Pv4

# Suppress warning
np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)

Numpy version: 1.23.5
Pandas version: 2.1.3
scikit-allel version: 1.3.7
malariagen-data version: 7.14.1

Load the data#

As previously stated, this notebook accesses the genetic data from the MalariaGEN Pv4 data resource via the malariagen_data Python package, which allows you to use the data directly from the cloud and without having to first download them locally. Detailed instructions can be found in th Pv4 user guide, which this section is based on.

# Initialise the environment
pv4 = Pv4()

First, we load the sample metadata, containing details on each sample in the resource such as time and place of collection, quality metrics, etc.

sample_metadata = pv4.sample_metadata()
sample_metadata.head()

	Sample	Study	Site	First-level administrative division	Country	Lat	Long	Year	ENA	All samples same individual	Population	% callable	QC pass	Exclusion reason	Is returning traveller
0	BBH-1-125	X0009-PV-ET-LO	Jimma	Ethiopia: Oromia	Ethiopia	7.683331	36.851318	2016	ERR2678989	BBH-1-125	AF	88.52	True	Analysis_set	False
1	BBH_1_132	X0009-PV-ET-LO	Jimma	Ethiopia: Oromia	Ethiopia	7.683331	36.851318	2016	ERR2678991	BBH_1_132	AF	90.20	True	Analysis_set	False
2	BBH_1_137	X0009-PV-ET-LO	Jimma	Ethiopia: Oromia	Ethiopia	7.683331	36.851318	2016	ERR2679003	BBH_1_137	AF	87.09	True	Analysis_set	False
3	BBH_1_153	X0009-PV-ET-LO	Jimma	Ethiopia: Oromia	Ethiopia	7.683331	36.851318	2016	ERR2678992	BBH_1_153	AF	90.60	True	Analysis_set	False
4	BBH_1_162	X0009-PV-ET-LO	Jimma	Ethiopia: Oromia	Ethiopia	7.683331	36.851318	2016	ERR2678993	BBH_1_162	AF	91.67	True	Analysis_set	False

The sample_metadata dataframe now need to be appended with supplemental information required for sample selection, such as a measure of within-host diversity (\(F_{WS}\)) available from the Pv4 data resource. The (\(F_{WS}\)) statistic is needed to select samples for downstream analysis that are deemed likely to be clonal lineages (and to exclude likely polyclonal lineages) which simplifies marker selection.

pv4_fws = pd.read_csv('../supplementary_files/Pv4_fws.txt', sep='\t', comment='t')
sample_metadata = pd.merge(sample_metadata, pv4_fws, on='Sample', how='outer')
pv1_samples = pd.read_csv('../supplementary_files/Samples_in_Pv1.tsv', sep='\t')
sample_metadata = sample_metadata.merge(pv1_samples, on='Sample', how='left')

Second, we load the variants data, which includes the details on each individual genetic variation discovered in the complete Pv4 dataset, such as their position in the genome, the type of variation, quality metrics, etc.

variant_dataset = pv4.variant_calls(extended=True)
variant_dataset

<xarray.Dataset>
Dimensions:                                   (variants: 4571056, alleles: 7,
                                               samples: 1895, ploidy: 2,
                                               genotypes: 3, alt_alleles: 6)
Coordinates:
    variant_position                          (variants) int32 dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_chrom                             (variants) object dask.array<chunksize=(65536,), meta=np.ndarray>
    sample_id                                 (samples) object dask.array<chunksize=(1895,), meta=np.ndarray>
Dimensions without coordinates: variants, alleles, samples, ploidy, genotypes,
                                alt_alleles
Data variables: (12/42)
    variant_allele                            (variants, alleles) object dask.array<chunksize=(65536, 1), meta=np.ndarray>
    variant_filter_pass                       (variants) bool dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_is_snp                            (variants) bool dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_numalt                            (variants) int32 dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_CDS                               (variants) bool dask.array<chunksize=(65536,), meta=np.ndarray>
    call_genotype                             (variants, samples, ploidy) int8 dask.array<chunksize=(65536, 64, 2), meta=np.ndarray>
    ...                                        ...
    variant_SNPEFF_IMPACT                     (variants) object dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_SNPEFF_TRANSCRIPT_ID              (variants) object dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_SOR                               (variants) float32 dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_VQSLOD                            (variants) float32 dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_VariantType                       (variants) object dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_altlen                            (variants, alt_alleles) int32 dask.array<chunksize=(65536, 6), meta=np.ndarray>

As you can see above, the complete variant_dataset contains 4,571,056 variants from 1,895 samples.

Finally, the Pv4 data resource also provides the coordinate boundaries of the P. vivax core genome and the details on how they were defined. In short, this are genome regions that consistently have high mapping quality and can reliably used for variants discovery and genotyping. The discovery of potential marker will be limited to those regions only, therefore avoiding sub-telomeric and internal hypervariable regions that, while potentially harbouring more diversity, they would also be likely too unstable for inclusion in a subsequent marker panel.

We can now load the coordinates and then modify the dataframe to align with the variant_dataset.

pv_regions = pd.read_csv( "../supplementary_files/Pv4_regions.bed",
                         sep="\t", comment="t", names=["chrom", "chromStart", "chromEnd", "name"])
# The file is 0-based, and needs to be converted to be 1-based for consistency with the variant data
pv_regions[["chromStart", "chromEnd"]] += 1

pv_regions.head()

	chrom	chromStart	chromEnd	name
0	PvP01_01_v1	1	116542	Sub
1	PvP01_01_v1	116542	677963	Core
2	PvP01_01_v1	677963	679790	Cen
3	PvP01_01_v1	679790	903592	Core
4	PvP01_01_v1	903592	1021665	Sub

Select the variants#

Here we decide which variants to include in the microhaplotypes discovery analysis of each sliding window.

Samples selection#

Before selecting variants, we need to decide which samples to include for the downstream analysis, as we will need to calculate the global microhaplotype allele frequencies later in this notebook, among other statistics. These calculations will be made easier by choosing a subset of samples from the [Pv4 data resource] that are unique, high-quality, and monoclonal. In order to select only these samples, we used the following criteria:

Are clonal (\(F_{WS}\) > 0.95).
Have low missingness (50% or more of the positions in the genome of the sample has a genotype call (% callable < 50%).
Are neither technical nor biological duplicates (Exclusion reason == Analysis_set). This Analysis_set flag in the metadata allows you to select only the highest quality sample for downstream analysis in the event of duplicate sample(s) in the Pv4 data resource.
Due to data restrictions at the time of analysis, these samples are from a limited group of studies.

If you would like to be more/less stringent on your selection criteria, all of these filters can be changed/removed.

# List of MalariaGEN partner studies with an explicit agreement to participate in this analysis
useable_studies = ['1128-PV-MULTI-GSK', '1154-PV-TH-PRICE', '1157-PV-MULTI-PRICE', 'X0001-PV-MULTI-HUPALO2016','X0002-PV-KH-PAROBEK2016']

# Filter the samples based on the criteria above
loc_filtered_samples = (
    (sample_metadata["Study"].isin(useable_studies) | sample_metadata.in_pv_10)
    & (sample_metadata["Fws"] > 0.95)
    & (sample_metadata["% callable"] > 50)
    & (sample_metadata["Exclusion reason"] == "Analysis_set")
)
subset_metadata = sample_metadata[loc_filtered_samples]
print(len(subset_metadata))

As you can see, after applying these filters we are left with 615 samples that will be used in the downstream marker discovery analysis.

We can now filter the variant_dataset to only include these 615 samples to speed-up processing.

variant_dataset_filtered_samples = variant_dataset.isel(samples=loc_filtered_samples)
variant_dataset_filtered_samples

<xarray.Dataset>
Dimensions:                                   (variants: 4571056, alleles: 7,
                                               samples: 615, ploidy: 2,
                                               genotypes: 3, alt_alleles: 6)
Coordinates:
    variant_position                          (variants) int32 dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_chrom                             (variants) object dask.array<chunksize=(65536,), meta=np.ndarray>
    sample_id                                 (samples) object dask.array<chunksize=(615,), meta=np.ndarray>
Dimensions without coordinates: variants, alleles, samples, ploidy, genotypes,
                                alt_alleles
Data variables: (12/42)
    variant_allele                            (variants, alleles) object dask.array<chunksize=(65536, 1), meta=np.ndarray>
    variant_filter_pass                       (variants) bool dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_is_snp                            (variants) bool dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_numalt                            (variants) int32 dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_CDS                               (variants) bool dask.array<chunksize=(65536,), meta=np.ndarray>
    call_genotype                             (variants, samples, ploidy) int8 dask.array<chunksize=(65536, 31, 2), meta=np.ndarray>
    ...                                        ...
    variant_SNPEFF_IMPACT                     (variants) object dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_SNPEFF_TRANSCRIPT_ID              (variants) object dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_SOR                               (variants) float32 dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_VQSLOD                            (variants) float32 dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_VariantType                       (variants) object dask.array<chunksize=(65536,), meta=np.ndarray>
    variant_altlen                            (variants, alt_alleles) int32 dask.array<chunksize=(65536, 6), meta=np.ndarray>

Which gives us variant_dataset_filtered_samples.

Variants selection#

Now that we have selected our samples, we can now proceed to variant selection by filtering for only the variants that are:

High quality variants that pass all quality filters (variant_filter_pass).
Single nucleotide polymorphisms (SNP) variants (variant_is_snp).
Are biallelic SNPs (variant_numalt == 1).
Found only in coding regions (variant_CDS).

All filters can be modified/removed, and in this instance we chose to use only high-confidence bilallelic SNPs in coding regions to make microhaplotype data analysis as simple as possible. In theory you could include things like tri-allelic SNPs, insertions/deletions, etc, but as data-handling is already challenging for multi-allelic markers, we chose simplicity where possible. Noncoding regions can be less stable genome regions, so they were also excluded to future-proof panels for longevity (i.e. reducing the likelihood of variants popping up in assay primer regions over time).

%%time
filters = (
    ((variant_dataset_filtered_samples["variant_numalt"] == 1).data)
    & (variant_dataset_filtered_samples["variant_filter_pass"].data)
    & (variant_dataset_filtered_samples["variant_is_snp"].data)
    & (variant_dataset_filtered_samples["variant_CDS"].data)
)
# This part might take a while but not too long (~minutes)
variant_dataset_filtered = variant_dataset_filtered_samples.isel(variants=filters.compute())
variant_dataset_filtered

CPU times: user 2.7 s, sys: 304 ms, total: 3 s
Wall time: 12.5 s

<xarray.Dataset>
Dimensions:                                   (variants: 440222, alleles: 7,
                                               samples: 615, ploidy: 2,
                                               genotypes: 3, alt_alleles: 6)
Coordinates:
    variant_position                          (variants) int32 dask.array<chunksize=(9339,), meta=np.ndarray>
    variant_chrom                             (variants) object dask.array<chunksize=(9339,), meta=np.ndarray>
    sample_id                                 (samples) object dask.array<chunksize=(615,), meta=np.ndarray>
Dimensions without coordinates: variants, alleles, samples, ploidy, genotypes,
                                alt_alleles
Data variables: (12/42)
    variant_allele                            (variants, alleles) object dask.array<chunksize=(9339, 1), meta=np.ndarray>
    variant_filter_pass                       (variants) bool dask.array<chunksize=(9339,), meta=np.ndarray>
    variant_is_snp                            (variants) bool dask.array<chunksize=(9339,), meta=np.ndarray>
    variant_numalt                            (variants) int32 dask.array<chunksize=(9339,), meta=np.ndarray>
    variant_CDS                               (variants) bool dask.array<chunksize=(9339,), meta=np.ndarray>
    call_genotype                             (variants, samples, ploidy) int8 dask.array<chunksize=(9339, 31, 2), meta=np.ndarray>
    ...                                        ...
    variant_SNPEFF_IMPACT                     (variants) object dask.array<chunksize=(9339,), meta=np.ndarray>
    variant_SNPEFF_TRANSCRIPT_ID              (variants) object dask.array<chunksize=(9339,), meta=np.ndarray>
    variant_SOR                               (variants) float32 dask.array<chunksize=(9339,), meta=np.ndarray>
    variant_VQSLOD                            (variants) float32 dask.array<chunksize=(9339,), meta=np.ndarray>
    variant_VariantType                       (variants) object dask.array<chunksize=(9339,), meta=np.ndarray>
    variant_altlen                            (variants, alt_alleles) int32 dask.array<chunksize=(9339, 6), meta=np.ndarray>

This step has now created variant_dataset_filtered which now consists of 440,222 variants.

Next, we will further select variants that have global characteristics that can increase the informativeness of our potential microhaplotype markers. In this case, we would like to enrich for variants that have high global minor allele frequencies (> 0.1) to increase the chances that a variant will not be an incredibly rare (and likely uninformative) variant globally. We will also select for variants that have low global missingness (< 0.1), which increases confidence that the variant position is well-represented across diverse populations.

To calculate allele frequencies, we need to process the entire genotype matrix (i.e. every genotype call in every sample) which can take a while. Here we use pre-calculated values to speed-up the process, and the code used to generate the files is provided below in case you wish to run it yourself.

Important note: Allele frequency and missingness will need to be re-computed if you use different selection criteria for the variants and/or samples.

%%script false --no-raise-error
# Remove the magic above if you want to re-calculate allele frequency and missingness, e.g. for a different samples/variants selection.
# Note that this process can take several minutes to complete

# Load the genotype data
gt = allel.GenotypeDaskArray(variant_dataset_filtered["call_genotype"].data)

# Calculate allele frequency
ac_pop = gt.count_alleles()
ac_pop_freq = ac_pop.to_frequencies().compute()

# Calculate missingness
freq_missing = gt.count_missing(axis=1).compute() / gt.shape[1]

# Save the results to speed-up future re-use
np.save('../precomputed/ac_pop_freq.npy', ac_pop_freq)
np.save('../precomputed/freq_missing.npy', freq_missing)

# Load pre-calculated frequencies
ac_pop_freq = np.load('../precomputed/ac_pop_freq.npy')
freq_missing = np.load('../precomputed/freq_missing.npy')

We can now finalise the variants selection process and filter based on global frequency and missingness.

pop_freq_filter = (ac_pop_freq.min(axis=1) > 0.1) & (freq_missing < 0.1)
variant_dataset_filtered = variant_dataset_filtered.isel(variants=pop_freq_filter)
variant_dataset_filtered

<xarray.Dataset>
Dimensions:                                   (variants: 13084, alleles: 7,
                                               samples: 615, ploidy: 2,
                                               genotypes: 3, alt_alleles: 6)
Coordinates:
    variant_position                          (variants) int32 dask.array<chunksize=(247,), meta=np.ndarray>
    variant_chrom                             (variants) object dask.array<chunksize=(247,), meta=np.ndarray>
    sample_id                                 (samples) object dask.array<chunksize=(615,), meta=np.ndarray>
Dimensions without coordinates: variants, alleles, samples, ploidy, genotypes,
                                alt_alleles
Data variables: (12/42)
    variant_allele                            (variants, alleles) object dask.array<chunksize=(247, 1), meta=np.ndarray>
    variant_filter_pass                       (variants) bool dask.array<chunksize=(247,), meta=np.ndarray>
    variant_is_snp                            (variants) bool dask.array<chunksize=(247,), meta=np.ndarray>
    variant_numalt                            (variants) int32 dask.array<chunksize=(247,), meta=np.ndarray>
    variant_CDS                               (variants) bool dask.array<chunksize=(247,), meta=np.ndarray>
    call_genotype                             (variants, samples, ploidy) int8 dask.array<chunksize=(247, 31, 2), meta=np.ndarray>
    ...                                        ...
    variant_SNPEFF_IMPACT                     (variants) object dask.array<chunksize=(247,), meta=np.ndarray>
    variant_SNPEFF_TRANSCRIPT_ID              (variants) object dask.array<chunksize=(247,), meta=np.ndarray>
    variant_SOR                               (variants) float32 dask.array<chunksize=(247,), meta=np.ndarray>
    variant_VQSLOD                            (variants) float32 dask.array<chunksize=(247,), meta=np.ndarray>
    variant_VariantType                       (variants) object dask.array<chunksize=(247,), meta=np.ndarray>
    variant_altlen                            (variants, alt_alleles) int32 dask.array<chunksize=(247, 6), meta=np.ndarray>

We can now see that we have significantly decreased the number of variants in variant_dataset_filtered based on the above criteria to 13,084 potential variants.

For convenience and reproducibility, we are storing chromosome and position of the selected variants in precomputed/variants_selected.tsv for future use. This would be useful later, for example to map candidate windows back to variants without having to re-filter the full dataset.

va = variant_dataset_filtered.reset_coords(["variant_chrom","variant_position"])
va[["variant_chrom","variant_position"]].to_pandas().to_csv("../precomputed/variants_selected.tsv", index=False, sep="\t")

Calculate summary statistics in sliding windows across the core genome#

This is the final and principal section of the notebook. After having prepared the data and selected the individual variants to use in the discovery analysis, we slide a 200 base pair window along the core genome and calculate a range of summary statistics on a per-window basis to scan for candidate microhaplotype regions that contain multiple variants.

In this example and following the main publication, we focus on cardinality and diversity. Specifically, for each window the code below calculates:

Number of filtered variants it contains.
Number and proportion of samples carrying each unique haplotype.
Entropy, defined as \(\sum\) (gt_freqs * \(\log(gt\_freqs)\))
Heterozygosity, defined as 1 - \(\sum\) gt_freqs²

where gt_freqs are the frequencies of each unique haplotype.

The functions below can be modified and expanded to calculate different or additional statistics.

Note: to minimise space requirements, we only store windows containing more than 1 variant and that are unique, i.e. we remove all windows that contain same exact set of variants.

Utility functions#

def filter_variants(variant_dataset, field, value):
    logging.info(f"Filtering variants: '{field}=={value}'")
    filter_values = (variant_dataset[field] == value).data
    variant_dataset_filtered = variant_dataset.isel(variants=filter_values)
    return variant_dataset_filtered


def variant_positions(positions):
    return list(positions)


def unique_allele_counts_in_window(gt):
    unique, index, counts = np.unique(gt, axis=1, return_counts=True, return_index=True)
    # Find index with the missing or het
    alleles_with_missing = []
    alleles_with_het = []
    for i in range(len(index)):
        if -1 in gt[:, index[i]].compute():
            alleles_with_missing.append(i)
        if gt[:, int(index[i])].is_het().compute().any():
            alleles_with_het.append(i)

    return counts, alleles_with_missing, alleles_with_het


def calculate_stats(variant_dataset, window_length, step, chrom):
    pos = variant_dataset["variants"].data

    # Count variants that are in each window
    logging.info(f"Counting variants per-windows in {chrom}")    
    n_variants, windows = allel.windowed_count(
        pos, size=window_length, step=step, start=1
    )

    # Filter out windows that don't contain any variant
    logging.info(f"Filtering and de-duplicating windows in {chrom}")
    index_with_variants = [i for i, var in enumerate(n_variants) if var != 0]
    window_with_variants = [list(windows[i]) for i in index_with_variants]

    # Find windows with unique variants - skip any windows that have replicate set of variants
    if len(window_with_variants) > 1:
      positions, windows, counts = allel.windowed_statistic(
          pos, pos, statistic=variant_positions, windows=window_with_variants
      )
      unique_var, unique_var_index = np.unique(positions, return_index=True)
      unique_windows = [list(window_with_variants[i]) for i in unique_var_index]
    else:
      unique_windows = window_with_variants
        
    logging.info(f"Number of unique windows: {len(unique_windows)}")
    
    # Count occurances of each unique allele
    logging.info(f"Genotyping windows in {chrom}")
    values = allel.GenotypeDaskArray(variant_dataset["call_genotype"].data.astype(int))

    allele_counts, windows, counts = allel.windowed_statistic(
        pos,
        values,
        statistic=unique_allele_counts_in_window,
        windows=unique_windows,
        fill=[0, None, None],
    )
    n_variants, windows = allel.windowed_count(
        pos, windows=unique_windows
    )

    return n_variants, allele_counts, windows

def evaluate_marker_options(variant_dataset, chrom, window_length=200, step=50):

    # Filter variants to chromosome and set index
    variant_dataset = filter_variants(variant_dataset, "variant_chrom", chrom)
    variant_dataset = variant_dataset.set_index(
        variants="variant_position", samples="sample_id"
    )

    logging.info(f"Starting window scanning for {chrom}")

    # Calculate per-window statistics
    n_variants, allele_counts, windows = calculate_stats(
        variant_dataset, window_length, step, chrom
    )

    logging.info(f"Window scanning completed for {chrom}")
    
    # Format results
    window_start = list(windows[:, 0])
    window_end = list(windows[:, 1])
    variant_counts = list(n_variants)
    unique_allele_counts = list(allele_counts[:, 0])
    unique_alleles_with_missing = list(allele_counts[:, 1])
    unique_alleles_with_het = list(allele_counts[:, 2])

    return (
        variant_counts,
        unique_allele_counts,
        unique_alleles_with_missing,
        unique_alleles_with_het,
        window_start,
        window_end,
    )

def run_per_chromosome(chromosomes, variants, results_dir):

    for chrom in chromosomes:
    
        logging.info(f"Starting chromosome {chrom}")
        
        # Calculate window stats
        (
            variant_counts,
            unique_allele_counts,
            unique_alleles_with_missing,
            unique_alleles_with_het,
            window_start,
            window_end,
        ) = evaluate_marker_options(variants, chrom)
    
        # Format data
        df = pd.DataFrame(
            data={
                "chrom": chrom,
                "window_start": window_start,
                "window_end": window_end,
                "variant_counts": variant_counts,
                "unique_allele_counts": unique_allele_counts,
                "unique_alleles_with_missing_index": unique_alleles_with_missing,
                "unique_alleles_with_het_index": unique_alleles_with_het,
            }
        )
    
        # Calculate entropy and heterozygosity
        unique_allele_count = []
        unique_allele_freqs = []
        entropy = []
        het = []
        df_with_stats = df.copy()
        
        for index, row in df.iterrows():
            gt_counts = row.unique_allele_counts
            n_alleles = len(gt_counts)
            gt_freqs = gt_counts/sum(gt_counts)
    
            unique_allele_freqs.append(list(gt_freqs))
            unique_allele_count.append(n_alleles)
            entropy.append(-np.sum(gt_freqs * np.log(gt_freqs)))
            het.append(1.0 - np.sum(gt_freqs ** 2))
            
        df_with_stats["unique_allele_frequencies"] = unique_allele_freqs
        df_with_stats["unique_allele_count"] = unique_allele_count
        df_with_stats["entropy"] = entropy
        df_with_stats["het"] = het
        
        # Output to csv
        logging.info(f"Writing results for chromosome {chrom}")
        df_with_stats.to_csv(os.path.join(results_dir,f"{chrom}_windowed_stats.csv"))
        
    return True

Main execution#

Here we start the execution of the functions defined above. Be aware that the execution across the entire core genome can require significant time (multiple hours). Results are stored per chromosome and then aggregated into a single file, allowing to resume computation from the last completed chromosome, if needed.

After all calculations are completed, the final step is to collate all the chromosome-level files into a single file, resulting in an output file that contains each identfied window along with all of its summary statistics called genome_stats.csv, which can be found in the precomputed directory.

%%script false --no-raise-error
# Remove the magic above if you want to re-calculate the windowed statistics, e.g. for a different samples/variants selection.
# Note that this process can take several minutes to complete

results_directory = "../precomputed/sliding_window_results"

# Initialise the logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(message)s")

logging.info("Start")
logging.info(f'Output directory: {results_directory}')

chrom_list = [f'PvP01_{chr_num+1:02}_v1' for chr_num in range(14)]

logging.info(f'Chromosomes: {chrom_list}')

# Reduce the size and pre-load the variants table to speedup computation
variables_to_remove = list(variant_dataset_filtered.keys()) # Add all variable names to remove list...
variables_to_remove.remove("call_genotype") # ...but we need to keep the genotypes
reduced_variant_set = variant_dataset_filtered.drop_vars(variables_to_remove).load() # Reduce the dataset

run_per_chromosome(chrom_list, reduced_variant_set, results_directory)

# Collate all the chromosome-level files into a single file
logging.info("Collating results")

all_files = glob.glob(os.path.join(results_directory , "*_windowed_stats.csv"))
li = []
for filename in all_files:
    df = pd.read_csv(filename, index_col=0)
    li.append(df)
results_df = pd.concat(li, ignore_index=True)
results_df.to_csv("../precomputed/genome_stats.csv")

logging.info("Completed!")