Summary - TopLD

Overview

Current publicly available tools that allow rapid exploration of patterns of linkage disequilibrium (LD) between markers (e.g., HaploReg and LDlink) are based on data from the 1000 Genomes project. As such, these resources often miss genetic variants that were not included in 1000 Genomes or were filtered out due to QC issues. Even given these limitations, both HaploReg and LDlink are wildly popular tools that have provided genetic researchers with invaluable information on LD.

Here, we use TOPMed whole genome sequencing (WGS) data to build a similar LD resource. This resource provides a much more comprehensive representation of genetic variation and their LD patterns than 1000 Genomes (particularly for rare-variants, and in the specific populations that we are proposing) [1].

Methods

We specifically used TOPMed WGS data from the following four cohorts: BioMe, MESA, JHS, and WHI. We first used RFMix to infer local and global ancestry of participants in these four cohorts, with reference populations from the 1000 Genomes Project and Human Genome Diversity Project. We then retained only samples with >90% ancestry from a single population (as estimated via RFMix), resulting in 1,662 individuals of African-, 1,015 of East Asian-, 14,112 of European-, and 252 of South Asian- ancestry. We further removed related individuals (with a stringent kinship coefficient threshold of 2^(-5.5)). Finally, we had 1,335 unrelated individuals of African-, 844 of East Asian-, 13,160 of European-, and 239 of South Asian- ancestry for pairwise LD inference. We inferred LD separately within each of the four ancestral groups, for all pairs within 500Kb, and with a minimum R2 threshold of 0.2. Table 1 (below) below summarizes sample size, number of variants and number of LD pairs within each population.

population	Sample Size	#pairs with r2 >= 0.2	#pairs with r2 >= 0.5	#pairs with r2 >= 0.8	#pairs with MAF < 0.01	#unique variants from pairs with r2>=0.2	#unique variants from pairs with r2>=0.5	#unique variants from pairs with r2>=0.8	#unique variants with MAF > 0	#unique variants with MAF < 0.01
EUR	13,160	1,998,836,584	1,112,674,102	611,189,251	1,297,284,942	145,623,581	136,917,934	115,978,103	149,493,274	140,520,707
AFR	1,335	1,732,359,823	847,569,859	452,114,792	977,235,648	61,463,807	59,379,979	53,083,215	61,694,124	45,697,424
SAS	239	1,041,164,192	495,121,502	328,077,193	365,703,405	22,880,916	21,636,362	20,327,975	22,948,921	13,270,991
EAS	844	1,084,682,858	610,360,822	359,218,986	445,325,740	36,314,892	34,809,748	31,539,066	36,493,361	28,372,201

population	#SV	#SVs-in-LD*
EUR	79,004	26,239
AFR	44,859	22,347
SAS	16,511	12,298
EAS	20,789	10,028

Citation

[1] Le Huang*, Jon Rosen*, Quan Sun* et al. TOP-LD: A tool to explore linkage disequilibrium with TOPMed wholegenome sequence data. the American Journal of Human Genetics (2022).

[1] Taliun, D., Harris, D.N., Kessler, M.D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021).

Overview

Methods

Table 1. SNV Coverage

Table 2 SV Coverage

Citation