About TopLD

Overview

Current publicly available tools that allow rapid exploration of patterns of linkage disequilibrium (LD) between markers (e.g., HaploReg and LDlink) are based on data from the 1000 Genomes project. As such, these resources often miss genetic variants that were not included in 1000 Genomes or were filtered out due to QC issues. Even given these limitations, both HaploReg and LDlink are wildly popular tools that have provided genetic researchers with invaluable information on LD.

Here, we use TOPMed whole genome sequencing (WGS) data to build a similar LD resource. This resource provides a much more comprehensive representation of genetic variation and their LD patterns than 1000 Genomes (particularly for rare-variants, and in the specific populations that we are proposing) [1].

Methods

We specifically used TOPMed WGS data from the following four cohorts: BioMe, MESA, JHS, and WHI. We first used RFMix to infer local and global ancestry of participants in these four cohorts, with reference populations from the 1000 Genomes Project and Human Genome Diversity Project. We then retained only samples with >90% ancestry from a single population (as estimated via RFMix), resulting in 1,662 individuals of African-, 1,015 of East Asian-, 14,112 of European-, and 252 of South Asian- ancestry. We further removed related individuals (with a stringent kinship coefficient threshold of 2^(-5.5)). Finally, we had 1,335 unrelated individuals of African-, 844 of East Asian-, 13,160 of European-, and 239 of South Asian- ancestry for pairwise LD inference. We inferred LD separately within each of the four ancestral groups, for all pairs within 500Kb, and with a minimum R2 threshold of 0.2. Table 1 (below) below summarizes sample size, number of variants and number of LD pairs within each population.

Table 1. SNV Coverage

population Sample Size #pairs with r2 >= 0.2 #pairs with r2 >= 0.5 #pairs with r2 >= 0.8 #pairs with MAF < 0.01 #unique variants from pairs with r2>=0.2 #unique variants from pairs with r2>=0.5 #unique variants from pairs with r2>=0.8 #unique variants with MAF > 0 #unique variants with MAF < 0.01
EUR 13,160 1,998,836,5841,112,674,102611,189,251 1,297,284,942 145,623,581136,917,934115,978,103 149,493,274140,520,707
AFR 1,335 1,732,359,823847,569,859452,114,792 977,235,648 61,463,80759,379,97953,083,215 61,694,12445,697,424
SAS 239 1,041,164,192495,121,502328,077,193 365,703,405 22,880,91621,636,36220,327,975 22,948,92113,270,991
EAS 844 1,084,682,858610,360,822359,218,986 445,325,740 36,314,89234,809,74831,539,066 36,493,36128,372,201

Table 2 SV Coverage

population #SV #SVs-in-LD*
EUR 79,004 26,239
AFR 44,859 22,347
SAS 16,511 12,298
EAS 20,789 10,028

*: SVs wtih at least one LD tag with R2 > 0.8.

Citation

[1] Le Huang*, Jon Rosen*, Quan Sun* et al. TOP-LD: A tool to explore linkage disequilibrium with TOPMed wholegenome sequence data. the American Journal of Human Genetics (2022).

[1] Taliun, D., Harris, D.N., Kessler, M.D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021).