Description
This track represents a comprehensive set of human transcription factor binding sites based on
ChIP-seq experiments generated by production groups in the ENCODE Consortium from the inception
of the project in September 2007, through the March 2012 internal data freeze.
The track represents peak calls (regions of enrichment) that were generated by the ENCODE
Analysis Working Group (AWG) based on a uniform processing pipeline developed for the ENCODE
Integrative Analysis effort and published in a set of coordinated papers in September 2012.
Peak calls from that effort, based on datasets from the January 2011 ENCODE data freeze) are
available at the ENCODE Analysis Data Hub.
This track is an update that includes newer data, and slightly modified methods for the
peak calling.
This track contains 690 ChIP-seq datasets representing 161 unique regulatory factors (generic
and sequence-specific factors).
The datasets span 91 human cell types and some are in various treatment conditions.
These datasets were generated by the five ENCODE TFBS ChIP-seq production groups: Broad,
Stanford/Yale/UC-Davis/Harvard, HudsonAlpha Institute, University of Texas-Austin and University
of Washington, and University of Chicago. The University of Chicago ChIP-seq were performed
with an alternative epitope-tagged ChIP-seq methodology. The primary and lab-processed data
(along with methods descriptions, credits and references) on which this track is based are
available in the following ENCODE tracks: HAIB TFBS, SYDH TFBS, UChicago TFBS, UTA TFBS,
UW CTCF Binding. These tracks are accessible from the ENC TF Binding Super-track.
Display and File Conventions and Configuration
The display for this track shows site location with the point-source of the peak marked with a
colored vertical bar and the level of enrichment at the site indicated by the darkness of the item.
The display can be filtered to higher valued items, using the
Score range: configuration item.
The score values were computed at UCSC based on signal values assigned by the ENCODE
uniform analysis pipeline.
The input signal values were multiplied by a normalization factor calculated as the ratio
of the maximum score value (1000) to the signal value at 1 standard deviation from the mean,
with values exceeding 1000 capped at 1000. This has the effect of distributing scores up to
mean + 1std across the score range, but assigning all above to the maximum score.
This track is a composite annotation track containing multiple subtracks, one for each cell type.
The display mode and filtering of each subtrack can be individually controlled.
For more information about track configuration, see
Configuring Multi-View Tracks.
Metadata for a particular subtrack can be found by clicking the down arrow in
the list of subtracks. The UCSC Accession listed in the metadata can be used with the
File Search tool to
retrieve primary data files underlying datasets of interest, by selecting
UCSC Accession from the "ENCODE terms" drop down menu option.
In the subtrack selection list, the ENCODE tier (priority) is listed for each cell type.
Tier 1 and Tier 2 represent categories with cell types designated for intensive study by
the ENCODE investigators.
After the January 2011 data freeze, an additional set of cell types were promoted from
Tier 3 to Tier 2 to broaden the list of intensively studied cell types.
These cell types are listed as Tier 2* in the subtrack list here (and are
described as 'newly promoted to tier 2: not in 2011 analysis' on the
ENCODE Common Cell Types page).
Download files
for this track are in
ENCODE NarrowPeak format.
Methods
All ChIP-seq experiments were performed at least in duplicate, and were scored against
an appropriate control designated by the production groups (either input DNA or DNA obtained
from a control immunoprecipitation).
Short Read Mapping
For each dataset, mapped reads in the form of BAM files were downloaded from the
ENCODE UCSC DCC.
These BAM files were generated by the ENCODE data production labs (using different mappers
and mapping parameters), but all used a standardized version of the GRCh37 (hg19) reference
human genome sequence with the following modifications:
- Mitochondrial sequence was included.
- Alternate sequences were excluded.
- Random contigs were excluded.
- The female version of the genome was represented by the autosomes and chrX, whereas the
male genome was represented by the autosomes, chrX, and chrY with the PAR regions masked.
In order to standardize the mapping protocol, custom unique-mappability tracks were used to
only retain unique mapping reads, i.e. reads that map to exactly one location in the genome.
Positional and PCR duplicates were also filtered out.
Quality Control
A number of quality metrics for individual replicates listed on the
ENCODE portal Quality Metrics page, including measures of library complexity
and signal enrichment, were calculated and are available for
review (Landt et al., 2012; Kundaje et al., 2013a).
The Integrated Quality Flag from this quality assessment was used to assign the
quality metadata term
for each dataset (e.g., Good vs. Caution).
Datasets that did not pass the minimum quality control thresholds are not included in this track.
Peak Calling
Since every ENCODE dataset is represented by at least two biological replicate experiments,
a novel measure of consistency and reproducibility of peak calling results between replicates,
known as the Irreproducible Discovery Rate (IDR), was used to determine an optimal number
of reproducible peaks (Li et al., 2011; Kundaje et al., 2013b). Code and detailed step-by-step
instructions to call peaks using the IDR method are
available.
In brief, the SPP peak caller (Kharchenko et al., 2008) was used with a relaxed peak calling
threshold (FDR = 0.9) to obtain a large number of peaks (maximum of 300K) that span true signal
as well as noise (false identifications). The IDR method analyzes a pair of replicates, and
considers peaks that are present in both replicates to belong to one of two populations : a
reproducible signal group or an irreproducible noise group. Peaks from the reproducible group
are expected to show relatively higher ranks (ranked based on signal scores) and stronger
rank-consistency across the replicates, relative to peaks in the irreproducible groups.
Based on these assumptions, a two-component probabilistic copula-mixture model is used to fit
the bivariate peak rank distributions from the pairs of replicates. The method adaptively learns
the degree of peak-rank consistency in the signal component and the proportion of peaks belonging
to each component. The model can then be used to infer an IDR score for every peak that is found
in both replicates. The IDR score of a peak represents the expected probability that the peak
belongs to the noise component, and is based on its ranks in the two replicates.
Hence, low IDR scores represent high-confidence peaks. An IDR score threshold of 0.02 (2%) was
used to obtain an optimal peak rank threshold on the replicate peak sets (cross-replicate
threshold).
If a dataset had more than two replicates, all pairs of replicates were analyzed using the IDR
method. The maximum peak rank threshold across all pairwise analyses was used as the final
cross-replicate peak rank threshold. Reads from replicate datasets were then pooled and SPP
was once again used to call peaks on the pooled data with a relaxed FDR of 0.9.
Pooled-data peaks were once again ranked by signal-score. The cross-replicate rank threshold
learned from the replicates was used to threshold the ranked set of pooled-data peaks.
Any thresholds based on reproducibility of peak calling between biological replicates are bounded
by the quality and enrichment of the worst replicate. Valuable signal is lost in cases for which
a dataset has one replicate that is significantly worse in data quality than another replicate.
A rescue pipeline was used for such cases in order to balance data quality between a set of
replicates. Mapped reads were pooled across all replicates of a dataset, and then randomly
sampled (without replacement) to generate two pseudo-replicates with equal numbers of reads.
This sampling strategy tends to transfer signal from stronger replicates to the weaker
replicates, thereby balancing cross-replicate data quality and sequencing depth.
These pseudo-replicates were then processed using the IDR method in order to learn a rescue
threshold.
For datasets with comparable replicates (based on independent measures of data quality), the
rescue threshold and cross-replicate thresholds were found to be very similar.
However, for datasets with replicates of differing data quality, the rescue thresholds were
often higher than the cross-replicate thresholds, and were able to capture true peaks that
showed statistically significant and visually compelling ChIP-seq signal in one replicate
but not in the other.
Ultimately, for each dataset, the best of the cross-replicate and rescue thresholds were used
to obtain a final consolidated optimal set of peaks.
All peak sets were then screened against a specially curated empirical blacklist
of regions in the human genome (wgEncodeDacMapabilityConsensusExcludable.bed.gz)
and peaks overlapping the blacklisted regions were discarded (Kundaje et al., 2013b).
Briefly, these artifact regions typically show the following characteristics:
- Unstructured and extreme artifactual high signal in sequenced input-DNA and control datasets,
as well as open chromatin datasets irrespective of cell type identity.
- An extreme ratio of multi-mapping to unique mapping reads from sequencing experiments.
- Overlap with pathological repeat regions such as centromeric, telomeric and satellite
repeats that often have few unique mappable locations interspersed in repeats.
Differences from the January 2011 freeze pipeline
The January 2011 uniform processing was performed as part of the
ENCODE Integrative Analysis
reported in coordinated publications in September 2012.
The results from this effort are available from the ENCODE Analysis Hub at the EBI.
- For the March 2012 freeze, only the SPP peak caller was used.
SPP and PeakSeq were used for the January 2011 freeze.
- For March 2012, In the read mapping phase, an extra step was performed to remove all
positional duplicates. This was done to avoid low library complexity issues.
In January 2011, remove positional duplicates were retained.
- For March 2012, an IDR threshold of 2% was used for comparing and thresholding the true
replicates and the pooled pseudo-replicates.
In January 2011, the IDR threshold was set to 1% for the true replicates and 0.25% for the
pooled pseudo-replicates. These thresholds were determined to be too stringent.
Credits
The processed data for this track were generated by Anshul Kundaje on
behalf of the ENCODE Analysis Working Group. Credits for the primary data underlying this
track are included in track description pages listed in the Description section above.
Contact:
Anshul Kundaje
References
ENCODE Project Consortium.
A user's guide to the encyclopedia of DNA elements (ENCODE).
PLoS Biol. 2011 Apr;9(4):e1001046. PMID: 21526222; PMCID: PMC3079585
ENCODE Project Consortium.
An integrated encyclopedia of DNA elements in the human genome.
Nature. 2012 Sep 6;489(7414):57-74. PMID: 22955616; PMCID: PMC3439153
Kharchenko PV, Tolstorukov MY, Park PJ.
Design and analysis of ChIP-seq experiments for DNA-binding proteins.
Nat Biotechnol. 2008 Dec;26(12):1351-9. PMID: 19029915; PMCID: PMC2597701
Kundaje A, Jung L, Kharchenko PV, Sidow A, Batzoglou S, Park PJ.
Assessment of ChIP-seq data quality using strand cross-correlation analysis. (submitted), 2012a.
Kundaje A, Li Q, Brown JB, Rozowsky J, Harmanci A, Wilder SP, Batzoglou S, Dunham I, Gerstein M, Birney E, et al.
Reproducibility measures for automatic threshold selection and quality control in ChIP-seq datasets. (submitted), 2012b.
Li QH, Brown JB, Huang HY, Bickel PJ.
Measuring reproducibility of high-throughput experiments.
Ann. Appl. Stat. 2011; 5(3):1752-1779.
Data Release Policy
While primary ENCODE data is subject to a restriction period as described in the
ENCODE data release policy,
this restriction does not apply to the integrative analysis results.
The data in this track are freely available.