ENCODE Transcription Factors Track Settings
 
ENCODE Transcription Factor ChIP-seq Peaks and Signal based on Uniform processing pipeline

Maximum display mode:       Reset to defaults   
Select view (Help):
Peaks ▾       Signal ▾      
Select subtracks by factor and cell line:
 All
Factor
ATF3
BATF
BCL11A
BCL3
BCLAF1
BDP1
BHLHE40
BRCA1
BRF1
BRF2
CCNT2
CEBPB
CHD2
CTBP2
CTCF
CTCFL
E2F1
E2F4
E2F6
EBF1
EGR1
ELF1
ELK4
EP300
ESR1
ESRRA
ETS1
FOS
FOSL1
FOSL2
FOXA1
FOXA2
GABPA
GATA1
GATA2
GATA3
GTF2B
GTF2F1
GTF3C2
HDAC2
HDAC8
HMGN3
HNF4A
HNF4G
HSF1
IRF1
IRF3
IRF4
JUN
JUNB
JUND
MAFF
MAFK
MAX
MEF2A
MEF2C
MXI1
MYC
NANOG
NFE2
NFKB1
NFYA
NFYB
NR2C2
NR3C1
NRF1
PAX5
PBX3
POLR2A
POLR3A
POLR3G
POU2F2
POU5F1
PPARGC1A
PRDM1
RAD21
RDBP
REST
RFX5
RXRA
SETDB1
SIN3A
SIRT6
SIX5
SMARCA4
SMARCB1
SMARCC1
SMARCC2
SMC3
SP1
SP2
SPI1
SREBF1
SRF
STAT1
STAT2
STAT3
SUZ12
TAF1
TAF7
TAL1
TBP
TCF12
TCF7L2
TFAP2A
TFAP2C
THAP1
TRIM28
USF1
USF2
WRNIP1
YY1
ZBTB33
ZBTB7A
ZEB1
ZNF143
ZNF263
ZNF274
ZZZ3
Factor
All 
Cell Line






















































































































Cell Line
GM12878   GM12878
H1-hESC   H1-hESC
K562   K562
HeLa-S3   HeLa-S3
HepG2   HepG2
HUVEC   HUVEC
A549   A549
AG04449   AG04449
AG04450   AG04450
AG09309   AG09309
AG09319   AG09319
AG10803   AG10803
AoAF   AoAF
BJ   BJ
Caco-2   Caco-2
ECC-1   ECC-1
Fibrobl   Fibrobl
Gliobla   Gliobla
GM06990   GM06990
GM10847   GM10847
GM12864   GM12864
GM12865   GM12865
GM12872   GM12872
GM12873   GM12873
GM12874   GM12874
GM12875   GM12875
GM12891   GM12891
GM12892   GM12892
GM15510   GM15510
GM18505   GM18505
GM18526   GM18526
GM18951   GM18951
GM19099   GM19099
GM19193   GM19193
HA-sp   HA-sp
HBMEC   HBMEC
HCFaa   HCFaa
HCPEpiC   HCPEpiC
HCT-116   HCT-116
HEEpiC   HEEpiC
HEK293   HEK293
HMEC   HMEC
HMF   HMF
HPAF   HPAF
HPF   HPF
HRE   HRE
HRPEpiC   HRPEpiC
HSMM   HSMM
HSMMtube   HSMMtube
HTB-11   HTB-11
MCF10A-Er-Src   MCF10A-Er-Src
MCF-7   MCF-7
NB4   NB4
NH-A   NH-A
NHDF-Ad   NHDF-Ad
NHEK   NHEK
NHLF   NHLF
NT2-D1   NT2-D1
Osteobl   Osteobl
PANC-1   PANC-1
PBDE   PBDE
PFSK-1   PFSK-1
ProgFib   ProgFib
Raji   Raji
SAEC   SAEC
SH-SY5Y   SH-SY5Y
SK-N-SH RA   SK-N-SH RA
T-47D   T-47D
T-REx-HEK293   T-REx-HEK293
U2OS   U2OS
U87   U87
WERI-Rb-1   WERI-Rb-1
Cell Line






















































































































Cell Line
 All
Factor
ATF3
BATF
BCL11A
BCL3
BCLAF1
BDP1
BHLHE40
BRCA1
BRF1
BRF2
CCNT2
CEBPB
CHD2
CTBP2
CTCF
CTCFL
E2F1
E2F4
E2F6
EBF1
EGR1
ELF1
ELK4
EP300
ESR1
ESRRA
ETS1
FOS
FOSL1
FOSL2
FOXA1
FOXA2
GABPA
GATA1
GATA2
GATA3
GTF2B
GTF2F1
GTF3C2
HDAC2
HDAC8
HMGN3
HNF4A
HNF4G
HSF1
IRF1
IRF3
IRF4
JUN
JUNB
JUND
MAFF
MAFK
MAX
MEF2A
MEF2C
MXI1
MYC
NANOG
NFE2
NFKB1
NFYA
NFYB
NR2C2
NR3C1
NRF1
PAX5
PBX3
POLR2A
POLR3A
POLR3G
POU2F2
POU5F1
PPARGC1A
PRDM1
RAD21
RDBP
REST
RFX5
RXRA
SETDB1
SIN3A
SIRT6
SIX5
SMARCA4
SMARCB1
SMARCC1
SMARCC2
SMC3
SP1
SP2
SPI1
SREBF1
SRF
STAT1
STAT2
STAT3
SUZ12
TAF1
TAF7
TAL1
TBP
TCF12
TCF7L2
TFAP2A
TFAP2C
THAP1
TRIM28
USF1
USF2
WRNIP1
YY1
ZBTB33
ZBTB7A
ZEB1
ZNF143
ZNF263
ZNF274
ZZZ3
Factor
All 
Select subtracks further by: (select multiple categories and items - help)
Tier:
Lab:
Method:
Quality:
Treatment:

List subtracks: only selected/visible    all    ()
  view↓1 Tier↓2 Cell Line↓3 Factor↓4 Lab↓5 Treatment↓6 Method↓7   Track Name↓8  
 
dense
 Configure
 Peaks  1  GM12878  CTCF  Broad      SPP  GM12878 SPP TFBS Peaks of CTCF from Broad (40006 peaks)    Data format 
 
dense
 Configure
 Peaks  1  GM12878  CTCF  UT-A      SPP  GM12878 SPP TFBS Peaks of CTCF from UT-A (42492 peaks)    Data format 
 
dense
 Configure
 Peaks  1  GM12878  CTCF  Stanford      SPP  GM12878 SPP TFBS Peaks of CTCF from Stanford (42808 peaks)    Data format 
 
dense
 Configure
 Peaks  1  GM12878  CTCF  UW      SPP  GM12878 SPP TFBS Peaks of CTCF from UW (35011 peaks)    Data format 
 
dense
 Configure
 Peaks  1  GM12878  CTCF  Broad      PeakSeq  GM12878 PeakSeq TFBS Peaks of CTCF from Broad    Data format 
 
dense
 Configure
 Peaks  1  GM12878  CTCF  UT-A      PeakSeq  GM12878 PeakSeq TFBS Peaks of CTCF from UT-A    Data format 
 
dense
 Configure
 Peaks  1  GM12878  CTCF  Stanford      PeakSeq  GM12878 PeakSeq TFBS Peaks of CTCF from Stanford    Data format 
 
dense
 Configure
 Peaks  1  GM12878  CTCF  UW      PeakSeq  GM12878 PeakSeq TFBS Peaks of CTCF from UW    Data format 
 
dense
 Configure
 Signal  1  GM12878  CTCF  Broad      Wiggler  GM12878 TFBS Signal of CTCF from Broad    Data format 
 
dense
 Configure
 Signal  1  GM12878  CTCF  UT-A      Wiggler  GM12878 TFBS Signal of CTCF from UT-A    Data format 
 
dense
 Configure
 Signal  1  GM12878  CTCF  Stanford      Wiggler  GM12878 TFBS Signal of CTCF from Stanford    Data format 
 
dense
 Configure
 Signal  1  GM12878  CTCF  UW      Wiggler  GM12878 TFBS Signal of CTCF from UW    Data format 
 
dense
 Configure
 Peaks  1  GM12878  GABPA  HudsonAlpha      PeakSeq  GM12878 PeakSeq TFBS Peaks of GABP from HudsonAlpha    Data format 
 
dense
 Configure
 Signal  1  GM12878  GABPA  HudsonAlpha      Wiggler  GM12878 TFBS Signal of GABP from HudsonAlpha    Data format 
 
dense
 Configure
 Peaks  1  GM12878  GABPA  HudsonAlpha      SPP  GM12878 SPP TFBS Peaks of GABP from HudsonAlpha (5095 peaks)    Data format 
 
dense
 Configure
 Peaks  1  GM12878  NFKB1  Stanford      PeakSeq  GM12878 PeakSeq TFBS Peaks of NFKB from Stanford    Data format 
 
dense
 Configure
 Signal  1  GM12878  NFKB1  Stanford      Wiggler  GM12878 TFBS Signal of NFKB from Stanford    Data format 
 
dense
 Configure
 Peaks  1  GM12878  NFKB1  Stanford      SPP  GM12878 SPP TFBS Peaks of NFKB from Stanford (10073 peaks)    Data format 
 
dense
 Configure
 Peaks  1  GM12878  POLR2A  HudsonAlpha      PeakSeq  GM12878 PeakSeq TFBS Peaks of POL2 from HudsonAlpha    Data format 
 
dense
 Configure
 Peaks  1  GM12878  POLR2A  UT-A      PeakSeq  GM12878 PeakSeq TFBS Peaks of POL2 from UT-A    Data format 
 
dense
 Configure
 Peaks  1  GM12878  POLR2A  Stanford      PeakSeq  GM12878 PeakSeq TFBS Peaks of POL2 from Stanford    Data format 
 
dense
 Configure
 Peaks  1  GM12878  POLR2A  Stanford      PeakSeq  GM12878 PeakSeq TFBS Peaks of POL2 from Stanford    Data format 
 
dense
 Configure
 Signal  1  GM12878  POLR2A  HudsonAlpha      Wiggler  GM12878 TFBS Signal of POL2 from HudsonAlpha    Data format 
 
dense
 Configure
 Signal  1  GM12878  POLR2A  UT-A      Wiggler  GM12878 TFBS Signal of POL2 from UT-A    Data format 
 
dense
 Configure
 Signal  1  GM12878  POLR2A  Stanford      Wiggler  GM12878 TFBS Signal of POL2 from Stanford    Data format 
 
dense
 Configure
 Signal  1  GM12878  POLR2A  Stanford      Wiggler  GM12878 TFBS Signal of POL2 from Stanford    Data format 
 
dense
 Configure
 Peaks  1  GM12878  POLR2A  HudsonAlpha      PeakSeq  GM12878 PeakSeq TFBS Peaks of POL24H8 from HudsonAlpha    Data format 
 
dense
 Configure
 Signal  1  GM12878  POLR2A  HudsonAlpha      Wiggler  GM12878 TFBS Signal of POL24H8 from HudsonAlpha    Data format 
 
dense
 Configure
 Peaks  1  GM12878  POLR2A  HudsonAlpha      SPP  GM12878 SPP TFBS Peaks of POL24H8 from HudsonAlpha (20091 peaks)    Data format 
 
dense
 Configure
 Peaks  1  GM12878  POLR2A  HudsonAlpha      SPP  GM12878 SPP TFBS Peaks of POL2 from HudsonAlpha (34699 peaks)    Data format 
 
dense
 Configure
 Peaks  1  GM12878  POLR2A  UT-A      SPP  GM12878 SPP TFBS Peaks of POL2 from UT-A (12781 peaks)    Data format 
 
dense
 Configure
 Peaks  1  GM12878  POLR2A  Stanford      SPP  GM12878 SPP TFBS Peaks of POL2 from Stanford (21446 peaks)    Data format 
 
dense
 Configure
 Peaks  1  GM12878  POLR2A  Stanford      SPP  GM12878 SPP TFBS Peaks of POL2 from Stanford (9040 peaks)    Data format 
 
dense
 Configure
 Peaks  1  K562  CTCF  Broad      SPP  K562 SPP TFBS Peaks of CTCF from Broad (45094 peaks)    Data format 
 
dense
 Configure
 Peaks  1  K562  CTCF  UT-A      SPP  K562 SPP TFBS Peaks of CTCF from UT-A (46846 peaks)    Data format 
 
dense
 Configure
 Peaks  1  K562  CTCF  UW      SPP  K562 SPP TFBS Peaks of CTCF from UW (39601 peaks)    Data format 
 
dense
 Configure
 Peaks  1  K562  CTCF  Broad      PeakSeq  K562 PeakSeq TFBS Peaks of CTCF from Broad    Data format 
 
dense
 Configure
 Peaks  1  K562  CTCF  UT-A      PeakSeq  K562 PeakSeq TFBS Peaks of CTCF from UT-A    Data format 
 
dense
 Configure
 Peaks  1  K562  CTCF  UW      PeakSeq  K562 PeakSeq TFBS Peaks of CTCF from UW    Data format 
 
dense
 Configure
 Signal  1  K562  CTCF  Broad      Wiggler  K562 TFBS Signal of CTCF from Broad    Data format 
 
dense
 Configure
 Signal  1  K562  CTCF  UT-A      Wiggler  K562 TFBS Signal of CTCF from UT-A    Data format 
 
dense
 Configure
 Signal  1  K562  CTCF  UW      Wiggler  K562 TFBS Signal of CTCF from UW    Data format 
 
dense
 Configure
 Peaks  1  K562  CTCFL  HudsonAlpha      SPP  K562 SPP TFBS Peaks of CTCFL from HudsonAlpha (12906 peaks)    Data format 
 
dense
 Configure
 Peaks  1  K562  CTCFL  HudsonAlpha      PeakSeq  K562 PeakSeq TFBS Peaks of CTCFL from HudsonAlpha    Data format 
 
dense
 Configure
 Signal  1  K562  CTCFL  HudsonAlpha      Wiggler  K562 TFBS Signal of CTCFL from HudsonAlpha    Data format 
 
dense
 Configure
 Peaks  1  K562  GABPA  HudsonAlpha      PeakSeq  K562 PeakSeq TFBS Peaks of GABP from HudsonAlpha    Data format 
 
dense
 Configure
 Signal  1  K562  GABPA  HudsonAlpha      Wiggler  K562 TFBS Signal of GABP from HudsonAlpha    Data format 
 
dense
 Configure
 Peaks  1  K562  GABPA  HudsonAlpha      SPP  K562 SPP TFBS Peaks of GABP from HudsonAlpha (13202 peaks)    Data format 
 
dense
 Configure
 Peaks  1  K562  POLR2A  HudsonAlpha      PeakSeq  K562 PeakSeq TFBS Peaks of POL2 from HudsonAlpha    Data format 
 
dense
 Configure
 Peaks  1  K562  POLR2A  UT-A      PeakSeq  K562 PeakSeq TFBS Peaks of POL2 from UT-A    Data format 
 
dense
 Configure
 Peaks  1  K562  POLR2A  Stanford  IFNa30  PeakSeq  K562 with IFNa30 PeakSeq TFBS Peaks of POL2 from Stanford    Data format 
 
dense
 Configure
 Peaks  1  K562  POLR2A  Stanford  IFNa6h  PeakSeq  K562 with IFNa6h PeakSeq TFBS Peaks of POL2 from Stanford    Data format 
 
dense
 Configure
 Peaks  1  K562  POLR2A  Stanford  IFNg30  PeakSeq  K562 with IFNg30 PeakSeq TFBS Peaks of POL2 from Stanford    Data format 
 
dense
 Configure
 Peaks  1  K562  POLR2A  Stanford  IFNg6h  PeakSeq  K562 with IFNg6h PeakSeq TFBS Peaks of POL2 from Stanford    Data format 
 
dense
 Configure
 Peaks  1  K562  POLR2A  Stanford      PeakSeq  K562 PeakSeq TFBS Peaks of POL2 from Stanford    Data format 
 
dense
 Configure
 Peaks  1  K562  POLR2A  Stanford      PeakSeq  K562 PeakSeq TFBS Peaks of POL2 from Stanford    Data format 
 
dense
 Configure
 Signal  1  K562  POLR2A  HudsonAlpha      Wiggler  K562 TFBS Signal of POL2 from HudsonAlpha    Data format 
 
dense
 Configure
 Signal  1  K562  POLR2A  UT-A      Wiggler  K562 TFBS Signal of POL2 from UT-A    Data format 
 
dense
 Configure
 Signal  1  K562  POLR2A  Stanford  IFNa30  Wiggler  K562 with IFNa30 TFBS Signal of POL2 from Stanford    Data format 
 
dense
 Configure
 Signal  1  K562  POLR2A  Stanford  IFNa6h  Wiggler  K562 with IFNa6h TFBS Signal of POL2 from Stanford    Data format 
 
dense
 Configure
 Signal  1  K562  POLR2A  Stanford  IFNg30  Wiggler  K562 with IFNg30 TFBS Signal of POL2 from Stanford    Data format 
 
dense
 Configure
 Signal  1  K562  POLR2A  Stanford  IFNg6h  Wiggler  K562 with IFNg6h TFBS Signal of POL2 from Stanford    Data format 
 
dense
 Configure
 Signal  1  K562  POLR2A  Stanford      Wiggler  K562 TFBS Signal of POL2 from Stanford    Data format 
 
dense
 Configure
 Signal  1  K562  POLR2A  Stanford      Wiggler  K562 TFBS Signal of POL2 from Stanford    Data format 
 
dense
 Configure
 Peaks  1  K562  POLR2A  HudsonAlpha      PeakSeq  K562 PeakSeq TFBS Peaks of POL24H8 from HudsonAlpha    Data format 
 
dense
 Configure
 Signal  1  K562  POLR2A  HudsonAlpha      Wiggler  K562 TFBS Signal of POL24H8 from HudsonAlpha    Data format 
 
dense
 Configure
 Peaks  1  K562  POLR2A  Broad      PeakSeq  K562 PeakSeq TFBS Peaks of POL2B from Broad    Data format 
 
dense
 Configure
 Signal  1  K562  POLR2A  Broad      Wiggler  K562 TFBS Signal of POL2B from Broad    Data format 
 
dense
 Configure
 Peaks  1  K562  POLR2A  Broad      SPP  K562 SPP TFBS Peaks of POL2B from Broad (8681 peaks)    Data format 
 
dense
 Configure
 Peaks  1  K562  POLR2A  HudsonAlpha      SPP  K562 SPP TFBS Peaks of POL24H8 from HudsonAlpha (20507 peaks)    Data format 
 
dense
 Configure
 Peaks  1  K562  POLR2A  HudsonAlpha      SPP  K562 SPP TFBS Peaks of POL2 from HudsonAlpha (26826 peaks)    Data format 
 
dense
 Configure
 Peaks  1  K562  POLR2A  UT-A      SPP  K562 SPP TFBS Peaks of POL2 from UT-A (22489 peaks)    Data format 
 
dense
 Configure
 Peaks  1  K562  POLR2A  Stanford  IFNa30  SPP  K562 with IFNa30 SPP TFBS Peaks of POL2 from Stanford (13678 peaks)    Data format 
 
dense
 Configure
 Peaks  1  K562  POLR2A  Stanford  IFNa6h  SPP  K562 with IFNa6h SPP TFBS Peaks of POL2 from Stanford (13379 peaks)    Data format 
 
dense
 Configure
 Peaks  1  K562  POLR2A  Stanford  IFNg30  SPP  K562 with IFNg30 SPP TFBS Peaks of POL2 from Stanford (16921 peaks)    Data format 
 
dense
 Configure
 Peaks  1  K562  POLR2A  Stanford  IFNg6h  SPP  K562 with IFNg6h SPP TFBS Peaks of POL2 from Stanford (16979 peaks)    Data format 
 
dense
 Configure
 Peaks  1  K562  POLR2A  Stanford      SPP  K562 SPP TFBS Peaks of POL2 from Stanford (7994 peaks)    Data format 
 
dense
 Configure
 Peaks  1  K562  POLR2A  Stanford      SPP  K562 SPP TFBS Peaks of POL2 from Stanford (18052 peaks)    Data format 
    
Assembly: Human Feb. 2009 (GRCh37/hg19)


Note: ENCODE Project

Description

This set of data tracks represents a comprehensive set of human transcription factor binding sites based on ChIP-seq experiments generated by all production groups in the ENCODE Consortium. The data tracks represent peak calls (regions of enrichment) and signals that were generated by the ENCODE Analysis Working Group (AWG) based on a uniform processing pipeline. The datasets are based on the January 2011 internal data freeze. These datasets were used in all downstream analysis pipelines by members of the ENCODE Consortium and are one of the primary sources of data referenced in the ENCODE Integrative analysis paper (ENCODE Project Consortium, 2012).

Data Statistics

These tracks represent 457 ChIP-seq datasets representing 119 unique regulatory factors (generic and sequence-specific factors). The datasets span 77 human cell types and some are in various treatment conditions. These datasets were generated by 5 main production groups : Broad, Stanford/Yale/UC-Davis/Harvard, HudsonAlpha, UChicago, UT-Austin and UWashington.

Methods

All ChIP-seq experiments were performed at least in duplicate, and were scored against an appropriate control designated by the production groups (either input DNA or DNA obtained from a control immunoprecipitation). Submitted data was generally expected to meet an initial standard for inter-replicate consistency developed by the ENCODE Consortium to ensure an acceptable level of reproducibility; four fifths of the top 40% of the targets identified from one replicate (using an acceptable scoring method) should overlap the list of targets from the other replicate, or target lists scored using all available reads from each replicate should share more than 75% of targets in common. In addition, a number of quality metrics for individual replicates, including measures of library complexity and signal enrichment, were calculated, and these are available for review (Kundaje et al., 2012a, Kundaje et al., 2012b). As sequencing has become more economical, minimum standards for the number of reads required for submission of data have been established and upgraded over the course of the ENCODE project. Datasets used in the analyses presented here complied with the minimum depth requirements at the time of submission (ranging from 6M to 20M uniquely mapped reads).

A detailed description of the precise standards and considerations for evaluating the quality of ChIP-seq data and antibodies used for ChIP-seq is available (ENCODE Project Consortium, 2012).

We built a scoring pipeline for the uniform processing of all TF ChIP-seq experiments generated by the ENCODE Consortium. This pipeline was implemented on the EBI cluster, but can readily be ported to other computers. Uniform signal was generated by processing the aligned reads using the align2rawsignal "Wiggler" software (see http://code.google.com/p/align2rawsignal for details and settings). The method accounts for the depth of sequencing, the mappability of the genome (based on read length and ambiguous bases) and different fragment length shifts for the different datasets being combined. It also differentiates between positions that showed zero signal simply because they are unmappable and positions that are mappable but have no reads.

Reads from all ENCODE ChIP-seq experiments and matching controls were mapped to a standardized version of the GRCh37 (hg19) reference human genome sequence with the following modifications:

  • Mitochondrial sequence was included.
  • Alternate sequences were excluded.
  • Random contigs were excluded.
  • The female version of the genome was represented by the autosomes and chrX, whereas the male genome was represented by the autosomes, chrX, and chrY with the PAR regions masked.
Reads from experiments in cell lines labeled male or unknown were mapped to the above mentioned male genome, while experiments in cell lines labeled female were mapped to the female version of the genome.

All TF ChIP-seq datasets were processed using the standardized pipeline (Fig. S1b). Mapped reads in the form of BAM files were downloaded from the UCSC ENCODE portal (ENCODE Project Consortium, 2011). Multi-mapping reads were discarded. The SPP (Kharchenko et al., 2008) and the PeakSeq (Rozowsky et al., 2009) peak calling methods were used to identify peaks (regions of enrichment) by comparing each ChIP-seq experiment to a corresponding input DNA control experiment. For the SPP peak caller, peaks were ranked using the peak signal score which is a function of ChIP signal enrichment in each peak over background signal from a corresponding input DNA experiment, corrected for mirror correlation of positive- and negative-strand tag densities within the peak region. PeakSeq peaks were ranked using the estimated false discovery rate (q-value) for each peak, which is computed from the enrichment of ChIP-seq reads in a peak, relative the the normalized counts of matching control reads using a Binomial test.

Since every ENCODE dataset is represented by at least two biological replicate experiments, we used a measure of consistency of peak calling results between replicates, known as the irreproducible discovery rate (IDR), in order to determine an optimal number of reproducible peaks (Li et al., 2011, Kundaje et al., 2012b). Peak calling was performed independently on each replicate of a ChIP-seq dataset. We used relaxed peak calling thresholds (FDR = 0.9 for SPP and FDR=0.05 for PeakSeq) in order to obtain a large number of peaks that span true signal as well as noise (false identifications). The IDR method analyzes a pair of replicates, and considers peaks that are present in both replicates to belong to one of two populations : a reproducible signal group or an irreproducible noise group. Peaks from the reproducible group are expected to show relatively higher ranks and stronger rank-consistency across the replicates, relative to peaks in the irreproducible groups. Based on these assumptions, a two-component probabilistic copula-mixture model is used to fit the bivariate peak rank distributions from the pairs of replicates. The method adaptively learns the degree of peak-rank consistency in the signal component and the proportion of peaks belonging to each component. The model can then be used to infer an IDR score for every peak that is found in both replicates. The IDR score of a peak represents the expected probability that the peak belongs to the noise component, and is based on its ranks in the two replicates. Hence, low IDR scores represent high-confidence peaks. We used an IDR score threshold to obtain an optimal peak rank threshold on the replicate peak sets (cross-replicate threshold). For SPP-based peak calls, the IDR threshold used was 1%, and for PeakSeq, we used 5%. If a dataset had more than two replicates, all pairs of replicates were analyzed using the IDR method. We used the maximum peak rank threshold across all pairwise analyses as the final cross-replicate peak rank threshold. We then pooled reads from replicate datasets and used SPP/PeakSeq to call peaks on the pooled data with a relaxed FDR of 0.9/0.05. Pooled-data peaks were once again ranked by signal-score (for the SPP peak caller) or q-value (for the PeakSeq peak caller). The cross-replicate rank threshold learned from the replicates was used to threshold the ranked set of pooled-data peaks.

Any thresholds based on reproducibility of peak calling between biological replicates are bounded by the quality and enrichment of the worst replicate. Valuable signal is lost in cases for which a dataset has one replicate that is significantly worse in data quality than another replicate. Hence, we developed a rescue pipeline for such cases. In order to balance data quality between a set of replicates, we pooled mapped reads across all replicates of a dataset, and then randomly sampled (without replacement) two pseudo-replicates with equal numbers of reads. This sampling strategy tends to transfer signal from stronger replicates to the weaker replicates, thereby balancing cross-replicate data quality and sequencing depth. We then processed these pseudo-replicates using the IDR method in order to learn a rescue threshold. We found that for datasets with comparable replicates (based on independent measures of data quality), the rescue threshold and cross-replicate thresholds were very similar. However, for datasets with replicates of differing data quality, the rescue thresholds were often higher than the cross-replicate thresholds, and were able to capture true peaks that showed statistically significant and visually compelling ChIP-seq signal in one replicate but not in the other. Ultimately, for each dataset, we used the best of the cross-replicate and rescue thresholds to obtain a final consolidated optimal set of peaks.

All peak sets were then screened against a specially curated empirical blacklist (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeDacMapabilityConsensusExcludable.bed.gz) of regions in the human genome and peaks overlapping the blacklisted regions were discarded (Kundaje et al.,2012b). Briefly, these artifact regions typically show the following characteristics:

  • Unstructured and extreme artifactual high signal in sequenced input-DNA and control datasets as well as open chromatin datasets irrespective of cell type identity.
  • An extreme ratio of multi-mapping to unique mapping reads from sequencing experiments.
  • Overlap with pathological repeat regions such as centromeric, telomeric and satellite repeats that often have few unique mappable locations interspersed in repeats.

References

ENCODE Project Consortium. A user's guide to the encyclopedia of DNA elements (ENCODE). PLoS Biol. 2011 Apr;9(4):e1001046.

ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 2012 Sep 6;489(7414):57-74.

Kharchenko PV, Tolstorukov MY, Park PJ. Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat Biotechnol. 2008 Dec;26(12):1351-9.

Kundaje A, Jung L, Kharchenko PV, Sidow A, Batzoglou S, Park PJ. Assessment of ChIP-seq data quality using strand cross-correlation analysis (submitted), 2012a.

Kundaje A, Li Q, Brown JB, Rozowsky J, Harmanci A, Wilder SP, Batzoglou S, Dunham I, Gerstein M, Birney E, et al. Reproducibility measures for automatic threshold selection and quality control in ChIP-seq datasets (submitted), 2012b.

Li QH, Brown JB, Huang HY, Bickel PJ. Measuring reproducibility of high-throughput experiments. Ann. Appl. Stat. 2011; 5(3):1752-1779.

Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, Bjornson R, Carriero N, Snyder M, Gerstein MB. PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat Biotechnol. 2009 Jan;27(1):66-75.

Data Release Policy

Data users may freely use ENCODE data, but may not, without prior consent, submit publications that use an unpublished ENCODE dataset until nine months following the release of the dataset. This date is listed in the Restricted Until column on the track configuration page and the download page. The full data release policy for ENCODE is available here.

There is no restriction on the use of these specific tracks.

Contact

Richard Myers