Introduction

Introduction#

This is the companion repository for the paper Lineage-informative microhaplotypes for spatio-temporal surveillance of Plasmodium vivax malaria parasites.

Here we provide two interactive notebooks that go step-by-step through the selection process and exploration of marker benchmarking, together with a number of accessory files to speed up future execution.

In the first notebook, we scan the P. vivax genome in partially-overlapping sliding windows and then calculate a number of summary statistics (in this example: cardinality, heterozygosity, entropy). Each window represents a potential microhaplotype marker, assuming it satisfies a number of customisable selection criteria (e.g. diversity, number of variants, etc.). The selection criteria we use are based on a previous study which used an in silico approach to determine optimal criteria for capturing sufficient data from marker panels to detect identity-by-descent (IBD), or relatedness between parasite lineages Taylor et al, Genetics 2019. However, this is only one of many potential use cases that could be explored with this framework.
The second notebook analyses all the windows together, explores panel optimisation, and then selects a candidate panel. The selection process is a challenging mathematical optimisation problem, and here we provide two complementary and effective ways to perform the task. It is worth mentioning that while we show selection methods here, often a subsequent manual curation would be required because of certain constraints, downstream requirements or other considerations (proximity to other markers, reduction of gaps across the genome, low/high diversity regions, or individual assay/panel performance during experimental validation, etc.) The codebase is also modular and can be extended to use different optmisation algorithms.

We used data from a subset of high-quality samples that are part of the open MalariaGEN Pv4 dataset, which contains genome variation data on nearly two-thousands worldwide samples of natural Plasmodium vivax infections. Details on this project, the methods used, and all contributing partners can be found in the key publication: MalariaGEN et al, Wellcome Open Research 2022, 7:136 https://doi.org/10.12688/wellcomeopenres.17795.1. The dataset can be accessed in a number of ways and here we used the malariagen_data Python package, which allows to use the data directly from the cloud and without having to first download them locally. The Pv4 user guide provides all the information on how to use the package as well as some examples to get started.

The notebooks can be run from any computer, including via MyBinder or Google Colab, two free interactive computing services that run in a cloud environment. Note that the first notebook requires navigation through hundreds of thousands of genetic variations in thousands of samples and, while the malariagen_data Python package provides and efficient way to access the data directly on the cloud, the process can still take hours (or days!) depending on the available computing infrastructure. To jump-start the selection process described in the second notebook, we have also provided a number of pre-calculated statistics for ease of use.