Yet two decades after the publication of the draft human genome, these and other challenging DNA features remain as stubborn gaps in our chromosomal atlas.Beth Sullivan, a centromere researcher at Duke University in Durham, North Carolina, recalls a conversation in 2014 with Karen Miga, a genomics researcher at the University of California, Santa Cruz.
Their goal is to produce, for each chromosome, an end-to-end genome map that stretches from one telomere (the repetitive sequence elements that cap chromosomal ends) to the other.
“This wasn’t just doing it for the sake of doing it,” says Miga.Since then, scientists working as part of the Genome Reference Consortium (GRC) have been fleshing out the assembly, manually checking it and using sequencing analysis to identify segments with errors and information gaps.
“That’s a large portion of the yet-to-be-closed gaps,” says Adam Phillippy, a bioinformatician at the US National Human Genome Research Institute in Bethesda, Maryland, and T2T co-chair.
“We’re talking about 15–20% error rates in the early PacBio reads,” says Phillippy.
“Within the past three or four years, we could now get read lengths of over 100 kilobases,” says Phillippy.Set up in early 2019, the consortium aims to produce high-quality, end-to-end assemblies for every human chromosome.
In one2, computational biologist Matthew Loose at the University of Nottingham, UK, and his colleagues described the first human genome assembled entirely from Oxford Nanopore data.
But Loose and his colleagues covered around 90% of GRCh38 with 99.8% accuracy using nanopore data alone, while also closing a dozen major gaps in the reference genome.In the second study3, Miga and her team reassembled the centromere of the human Y chromosome, the genome’s smallest.
“We could actually traverse all the way across the centromere,” says Miga.
“With these, we were able to essentially have a backbone representation of those chromosomes from telomere to telomere, but at lower accuracy,” says Phillippy.Glennis Logsdon, a postdoc in the lab of genome scientist Evan Eichler at the University of Washington in Seattle and first author on the chromosome 8 work, says that the different sequencing technologies have distinctive quirks.
“We were only going to glue two sequences together if they’re basically 100% identical over 7,000 bases of their length,” says Phillippy.The initial work4 with chromosome X also benefited from previous knowledge of that chromosome’s centromere, which has been well studied at the structural level.
“We used a variety of molecular techniques to make sure that the size of the assembly of the α-satellite array from the sequencing information was correct,” says Sullivan.
“Some centromeres we now can assemble completely from high-fidelity reads — no extra help is needed,” says Pevzner, although he adds that well-calibrated algorithms that can work with such data are also required.
“It took us a year or more to do each of the chromosome X and 8 projects,” says Phillippy, “but we were then able to essentially finish all the remaining chromosomes in a two-month span.” Now the end is in sight.
“We’ve green-lit all of the centromeric arrays except for the one on chromosome 9,” says Miga.
This centromere, she says, is massive — spanning 27 million bases — and has posed a special challenge in terms of validation.
Logsdon and others have been using nanopore sequencing to find patterns of DNA chemical modification that can influence chromosomal function.
“Most of the centromere is methylated, but there’s this dip in methylation that seems to be found in all centromeres,” she says.Challenging as it has been to build, a single end-to-end genome offers researchers limited value without other genomes from diverse individuals against which to compare it.
To boost its utility, in late 2020, the T2T began working more closely with a parallel effort, the Human Pangenome Reference Consortium (HPRC).
The HPRC was launched in 2019 with the goal of replacing GRCh38 with a reference genome that better captures the scope of human diversity, based on whole-genome data from at least 350 individuals.But first, researchers must work out how to apply the T2T process to a diploid genome.
Determining which sequences reside on which chromosome copy requires scientists to identify enough unique genetic landmarks to confidently assemble distinct contigs for each DNA strand, a tough feat in ultra-repetitive regions such as the centromere.
“We’ll need much more accurate or longer reads to span the full centromere region for a diploid genome,” says Morishita.
“If my child was sick and I knew that I could get 100% of the genome with long-read, I would want to pay that difference,” says Miga.International Human Genome Sequencing Consortium.
15 hours ago
17 hours ago
18 hours ago
18 hours ago
Get monthly updates and free resources.
CONNECT WITH US