Description
This container track helps call out sections of the genome that often cause problems or
confusion when working with the genome. There are three subtracks for now, Anshul Kundaje's
ENCODE Blacklist, GRC (Genome Reference Consortium) Exclusions, and the UCSC
Unusual Regions track.
The hg19 genome has a track with the same name, but with many more
subtracks, as the GeT-RM and Genome-in-a-Bottle artifact variants do not exist yet
for hg38, to our knowledge. If you are missing a track here that you know from
hg19 and have an idea how to add it hg38, do not hesitate to contact us.
The Problematic Regions track contains the following subtracks:
-
The UCSC Unusual Regions subtrack contains annotations collected at UCSC,
put together from other tracks, our experiences and support email list
requests over the years. For example, it contains the most well-known gene
clusters (IGH, IGL, PAR1/2, TCRA, TCRB, etc) and annotations for the GRC
fixed sequences, alternate haplotypes, unplaced
contigs, pseudo-autosomal regions, and mitochondria. These loci can yield alignments with
low-quality mapping scores and discordant read pairs, especially for short-read sequencing data.
This data set was manually curated, based on the Genome Browser's
assembly description, the FAQs about assembly, and the
NCBI RefSeq "other" annotations
track data.
-
The ENCODE Blacklist subtrack contains a comprehensive set of regions which are troublesome
for high-throughput Next-Generation Sequencing (NGS) aligners. These regions tend to have a very
high ratio of multi-mapping to unique mapping reads and high variance in mappability due to
repetitive elements such as satellite, centromeric and telomeric repeats.
-
The GRC Exclusions subtrack contains a set of regions that have been flagged by the GRC to
contain false duplications or contamination sequences. The GRC has now removed these sequences from
the files that it uses to generate the reference assembly, however, removing the sequences from the
GRCh38/hg38 assembly would trigger the next major release of the human assembly. In order to
help users recognize these regions and avoid them in their analyses, the GRC have produced a masking
file to be used as a companion to GRCh38, and the BED file is available from the
GenBank FTP site.
The Highly Reproducible Regions track highlights regions and variants
from eight samples that can be used to assess variant detection pipelines. The
"Highly Reproducible Regions" subtrack comprises the intersection of the reproducible
regions across all eight samples, while the "Variants" subtracks contain the reproducible
variants from each assayed sample. Both tracks contain data from the following samples:
- a Chinese Quartet, samples CQ-5, CQ-6, CQ-7, CQ-8
- a HapMap Trio, samples NA10385, NA12248, NA12249
- a Genome in a Bottle sample, NA12878s
Please refer to the Pan et al reference for more information on how
these regions were defined.
Display Conventions and Configuration
Each track contains a set of regions of varying length with no special configuration options.
The UCSC Unusual Regions track has a mouse-over description, all other tracks have at most
a name field, which can be shown in pack mode. The tracks are usually kept in dense mode.
The Hide empty subtracks control hides subtracks with no data in the browser window.
Changing the browser window by zooming or scrolling may result in the display of a different
selection of tracks.
Data access
The raw data can be explored interactively with the Table Browser
or the Data Integrator.
For automated download and analysis, the genome annotation is stored in bigBed files that
can be downloaded from
our download server.
Individual
regions or the whole genome annotation can be obtained using our tool bigBedToBed
which can be compiled from the source code or downloaded as a precompiled
binary for your system. Instructions for downloading source code and binaries can be found
here.
The tool
can also be used to obtain only features within a given range, e.g.
bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg38/problematic/comments.bb -chrom=chr21 -start=0 -end=100000000 stdout
Methods
Files were downloaded from the respective databases and converted to bigBed format.
The procedure is documented in our
hg38 makeDoc file.
Credits
Thanks to Anna Benet-Pagès, Max Haeussler, Angie Hinrichs, Daniel Schmelter, and Jairo
Navarro at the UCSC Genome Browser for planning, building, and testing these tracks. The
underlying data comes from the
ENCODE Blacklist and some parts were copied manually from the HGNC and NCBI
RefSeq tracks.
References
Amemiya HM, Kundaje A, Boyle AP.
The ENCODE Blacklist: Identification of Problematic Regions of the Genome.
Sci Rep. 2019 Jun 27;9(1):9354.
PMID: 31249361; PMC: PMC6597582
Pan B, Ren L, Onuchic V, Guan M, Kusko R, Bruinsma S, Trigg L, Scherer A, Ning B, Zhang C et
al.
Assessing reproducibility of inherited variants detected with short-read whole genome
sequencing.
Genome Biol. 2022 Jan 3;23(1):2.
PMID: 34980216; PMC: PMC8722114
|
|