GENCODE VM18 Transcript Annotation ENSMUST00000079812.7 (Notch2)
 

GENCODE VM18 Transcript Annotation ENSMUST00000079812.7 (Notch2)

TranscriptGene
GENCODE idENSMUST00000079812.7ENSMUSG00000027878.11
Protein idENSMUSP00000078741.6
HAVANA manual idOTTMUST00000132998.1OTTMUSG00000053240.1
Positionchr3:98013538-98150361chr3:98013538-98150361
Strand+
Biotypeprotein_codingprotein_coding
Annotation Levelmanual (2)
Annotation Methodmanual & automaticmanual & automatic
Transcription Support Leveltsl1
HGNC gene symbolNotch2
CCDSCCDS51013.1
GeneCardsNotch2
APPRIS PRINCIPAL:1Notch2
Tags
CCDSbasicappris_principal_1
Sequences
Predicted mRNAPredicted protein
Annotation Remarks
Protein Data Bank
PubMed
844033279267617918097
769874676096147615640
864560212957458917536
889810091113389108364
918715092424909291577
957096596904729690473
973910698115819858718
98824801019442010393120
103839331047696710508694
105518631061512410704869
107048731082520810842072
108786081095288910952903
110446101112481111171333
112622391140140811414760
114665311157886911641270
117399541173125711284963
117133461186148911823422
121674041093218012242716
122427121224455312421720
123971111244128712617809
126708691273012412753746
127537441284647112866128
129045831296029814514757
129719921294710512907456
146102731465192114732396
147020431475764214701881
127877921095868711545715
111189011467293615019995
149602761506879415037319
151555831522639415076712
125200021518702715465493
154654941051220315499562
155255341460777815578515
150642431565948815689374
158212571588299715866159
159178351614107216000382
161206381624533816169548
163978691589723116524929
165188231660763816458080
158400011691449417015435
168405331249665917080428
172735551719475917229764
173593021086893117531978
174762831765827917261636
175729111461240717915208
170306821791953318037398
181551891818030918330927
183650071815663217303760
171961931845834718410734
187085761863561018507503
187109341872437118838555
186949421757903819020777
189271531895721919217325
192116761918557719223466
193694011769974017366661
182728461880183619573812
194031031953013621267068
198551351980551519723505
198977412005995319914235
199150641991417120174635
201569742033536020351182
180939891660282120624967
207278762104253721150899
204402712097194820870902
208562052125796921302255
200079152149005821285514
215625642129700319797679
217034542169351521677750
213286912142094819551907
217316732055882421646720
216331692179152821779394
218523982190909221402740
217860212131104622018469
217957472199135222190634
219090722210955822173065
217500332215658122274697
222932052108438317573339
224576352125215718299578
225602972261541222366192
226752112267520820299358
203788242097569722503540
226526742313224523079659
230796582331965723362349
233623482324302023382219
232646142327711423314057
235552922314481722396647
233590702367627123460609
234547502340646723604318
238066162250571623665443
233865892388444623467742
239542032286564023884415
241545252433711823825310
239189822422765324353058
245083872391304624449835
247114122414572124741061
249821812455258825184679
246735592544653024398584
245098762550533324859004
257153952557484225813538
259773692583550226062937
264171012666948726453897
255019052577674425992862
266278242654237026527653
261086932707317126912775
182970832669772326119937
273511002736400927102824
269408622712216927564454
275109772763399327621062
264509682645096727358050
284553782845537628360133
275763692834823026188077
283190442569154025957400
283239632859248928656980
287476782456345826293507
270845802763160928036337
284028572914959329113990
286764382903384629326173
290378522903150027141929
2777610829139178
Entrez Gene
18129
RefSeq
RNAProtein
NM_010928.2NP_035058.2
UniProt
Data setAccessionNameData setAccessionName
TrEMBLG5E8J0G5E8J0
Supporting Evidence (ENSMUST00000079812.7)
SourceSequenceSourceSequence
CCDSCCDS51013.1EMBLAI787996.1
EMBLAK157200.1EMBLAV012751.2
EMBLAW913630.1EMBLBB294145.1
EMBLBC059256EMBLBC059256.1
EMBLBE993512.1EMBLBX513654.1
EMBLBY754743.1EMBLCF742927.1
EMBLCJ053541.1EMBLCN525512.1
EMBLD32210.1EMBLX68279
RefSeq_dnaNM_010928RefSeq_dnaNM_010928.2
Uniprot/VarsplicO35516-2

View table schema

Go to All GENCODE VM18 track controls

Data last updated at UCSC: 2018-08-03

Description

The GENCODE Genes track (version M18, July 2018) shows high-quality manual annotations merged with evidence-based automated annotations across the entire mouse genome generated by the GENCODE project. The GENCODE gene set presents a full merge between HAVANA manual annotation process and Ensembl automatic annotation pipeline. Priority is given to the manually curated HAVANA annotation using predicted Ensembl annotations when there are no corresponding manual annotations. The M18 annotation was carried out on genome assembly GRCm38 (mm10).

The Ensembl human and mouse data sets are the same gene annotations as GENCODE for the corresponding release.

Display Conventions and Configuration

This track is a multi-view composite track that contains differing data sets (views). Instructions for configuring multi-view tracks are here. To show only selected subtracks, uncheck the boxes next to the tracks that you wish to hide.

Views available on this track are:
Genes
The gene annotations in this view are divided into three subtracks:
  • GENCODE Basic set is a subset of the Comprehensive set. The selection criteria are described in the methods section.
  • GENCODE Comprehensive set contains all GENCODE coding and non-coding transcript annotations, including polymorphic pseudogenes. This includes both manual and automatic annotations. This is a super-set of the Basic set.
  • GENCODE Pseudogenes include all annotations except polymorphic pseudogenes.
2-way
  • GENCODE 2-way Pseudogenes contains pseudogenes predicted by both the Yale PseudoPipe and UCSC RetroFinder pipelines. The set was derived by looking for 50 base pairs of overlap between pseudogenes derived from both sets based on their chromosomal coordinates. When multiple PseudoPipe predictions map to a single RetroFinder prediction, only one match is kept for the 2-way consensus set.
PolyA
  • GENCODE PolyA contains polyA signals and sites manually annotated on the genome based on transcribed evidence (ESTs and cDNAs) of 3' end of transcripts containing at least 3 A's not matching the genome.

Filtering is available for the items in the GENCODE Basic, Comprehensive and Pseudogene tracks using the following criteria:

  • Transcript class: filter by the basic biological function of a transcript annotation
    • All - don't filter by transcript class
    • coding - display protein coding transcripts, including polymorphic pseudogenes
    • nonCoding - display non-protein coding transcripts
    • pseudo - display pseudogene transcript annotations
    • problem - display problem transcripts (Biotypes of retained_intron, TEC, or disrupted_domain)
  • Transcript Annotation Method: filter by the method used to create the annotation
    • All - don't filter by transcript class
    • manual - display manually created annotations, including those that are also created automatically
    • automatic - display automatically created annotations, including those that are also created manually
    • manual_only - display manually created annotations that were not annotated by the automatic method
    • automatic_only - display automatically created annotations that were not annotated by the manual method
  • Transcript Biotype: filter transcripts by Biotype
  • Support Level: filter transcripts by transcription support level

Coloring for the gene annotations is based on the annotation type:

  • coding
  • non-coding
  • pseudogene
  • problem
  • all 2-way pseudogenes
  • all polyA annotations

Methods

The GENCODE project aims to annotate all evidence-based gene features on the human and mouse reference sequence with high accuracy by integrating computational approaches (including comparative methods), manual annotation and targeted experimental verification. This goal includes identifying all protein-coding loci with associated alternative variants, non-coding loci which have transcript evidence, and pseudogenes. For a detailed description of the methods and references used, see Harrow et al. (2006).

GENCODE Basic Set selection: The GENCODE Basic Set is intended to provide a simplified subset of the GENCODE transcript annotations that will be useful to the majority of users. The goal was to have a high-quality basic set that also covered all loci. Selection of GENCODE annotations for inclusion in the basic set was determined independently for the coding and non-coding transcripts at each gene locus.

  • Criteria for selection of coding transcripts (including polymorphic pseudogenes) at a given locus:
    • All full-length coding transcripts (except problem transcripts or transcripts that are nonsense-mediated decay) were included in the basic set.
    • If there were no transcripts meeting the above criteria, then the partial coding transcript with the largest CDS was included in the basic set (excluding problem transcripts).
  • Criteria for selection of non-coding transcripts at a given locus:
    • All full-length non-coding transcripts (except problem transcripts) with a well characterized Biotype (see below) were included in the basic set.
    • If there were no transcripts meeting the above criteria, then the largest non-coding transcript was included in the basic set (excluding problem transcripts).
  • If no transcripts were included by either of the above criteria, the longest problem transcript is included.

Non-coding transcript categorization: Non-coding transcripts are categorized using their Biotype and the following criteria:

  • well characterized: antisense, Mt_rRNA, Mt_tRNA, miRNA, rRNA, snRNA, snoRNA
  • poorly characterized: 3prime_overlapping_ncrna, lincRNA, misc_RNA, non_coding, processed_transcript, sense_intronic, sense_overlapping

Transcription Support Level (TSL): It is important that users understand how to assess transcript annotations that they see in GENCODE. While some transcript models have a high level of support through the full length of their exon structure, there are also transcripts that are poorly supported and that should be considered speculative. The Transcription Support Level (TSL) is a method to highlight the well-supported and poorly-supported transcript models for users. The method relies on the primary data that can support full-length transcript structure: mRNA and EST alignments supplied by UCSC and Ensembl.

The mRNA and EST alignments are compared to the GENCODE transcripts and the transcripts are scored according to how well the alignment matches over its full length. The GENCODE TSL provides a consistent method of evaluating the level of support that a GENCODE transcript annotation is actually expressed in mouse. Mouse transcript sequences from the International Nucleotide Sequence Database Collaboration (GenBank, ENA, and DDBJ) are used as the evidence for this analysis. Exonerate RNA alignments from Ensembl, BLAT RNA and EST alignments from the UCSC Genome Browser Database are used in the analysis. Erroneous transcripts and libraries identified in lists maintained by the Ensembl, UCSC, HAVANA and RefSeq groups are flagged as suspect. GENCODE annotations for protein-coding and non-protein-coding transcripts are compared with the evidence alignments.

Annotations in the MHC region and other immunological genes are not evaluated, as automatic alignments tend to be very problematic. Methods for evaluating single-exon genes are still being developed and they are not included in the current analysis. Multi-exon GENCODE annotations are evaluated using the criteria that all introns are supported by an evidence alignment and the evidence alignment does not indicate that there are unannotated exons. Small insertions and deletions in evidence alignments are assumed to be due to polymorphisms and not considered as differing from the annotations. All intron boundaries must match exactly. The transcript start and end locations are allowed to differ.

The following categories are assigned to each of the evaluated annotations:

  • tsl1 - all splice junctions of the transcript are supported by at least one non-suspect mRNA
  • tsl2 - the best supporting mRNA is flagged as suspect or the support is from multiple ESTs
  • tsl3 - the only support is from a single EST
  • tsl4 - the best supporting EST is flagged as suspect
  • tsl5 - no single transcript supports the model structure
  • tslNA - the transcript was not analyzed for one of the following reasons:
    • pseudogene annotation, including transcribed pseudogenes
    • immunoglobin gene transcript
    • T-cell receptor transcript
    • single-exon transcript (will be included in a future version)

APPRIS is a system to annotate alternatively spliced transcripts based on a range of computational methods. It provides value to the annotations of the human, mouse, zebrafish, rat, and pig genomes. APPRIS has selected a single CDS variant for each gene as the 'PRINCIPAL' isoform. Principal isoforms are tagged with the numbers 1 to 5, with 1 being the most reliable.

  • PRINCIPAL:1 - Transcript(s) expected to code for the main functional isoform based solely on the core modules in the APPRIS.
  • PRINCIPAL:2 - Where the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the database chooses two or more of the CDS variants as "candidates" to be the principal variant.
  • PRINCIPAL:3 - Where the APPRIS core modules are unable to choose a clear principal variant and more than one of the variants have distinct CCDS identifiers, APPRIS selects the variant with lowest CCDS identifier as the principal variant. The lower the CCDS identifier, the earlier it was annotated.
  • PRINCIPAL:4 - Where the APPRIS core modules are unable to choose a clear principal CDS and there is more than one variant with distinct (but consecutive) CCDS identifiers, APPRIS selects the longest CCDS isoform as the principal variant.
  • PRINCIPAL:5 - Where the APPRIS core modules are unable to choose a clear principal variant and none of the candidate variants are annotated by CCDS, APPRIS selects the longest of the candidate isoforms as the principal variant. For genes in which the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the "candidate" variants not chosen as principal are labeled in the following way:
  • ALTERNATIVE:1 - Candidate transcript(s) models that are conserved in at least three tested species.
  • ALTERNATIVE:2 - Candidate transcript(s) models that appear to be conserved in fewer than three tested species. Non-candidate transcripts are not tagged and are considered as "Minor" transcripts. Further information and additional web services can be found at the APPRIS website.

Downloads

GENCODE GFF3 and GTF files are available from the GENCODE release M18 site.

Release Notes

GENCODE version M18 corresponds to Ensembl 93.

See also: The GENCODE Project

Credits

This GENCODE release is the result of a collaborative effort among the following laboratories: (contact: GENCODE at the Sanger Institute)

Lab/Institution Contributors
GENCODE Principal Investigator, EMBL European Bioinformatics Institute, Cambridge, UK Paul Flicek
GENCODE Co-Principal Investigator, EMBL European Bioinformatics Institute, Cambridge, UK Adam Frankish
GENCODE Co-Principal Investigator, Wellcome Trust Sanger Institute (WTSI), Cambridge, UK Bronwen Aken
Kings College, London, UK Tim Hubbard
HAVANA manual annotation group, EMBL European Bioinformatics Institute, Cambridge, UK Timothy Cutts, Jyoti Choudhary, Ed Griffiths, Ewan Birney, Jose Manuel Gonzalez, Stephen Fitzgerald, Andrew Berry, Alexandra Bignell, Claire Davidson, Gloria Despacio-Reyes, Mike Kay, Deepa Manthravadi, Gaurab Mukherjee, Gemma Barson, Matt Hardy, Angela Macharia
Ensembl, EMBL European Bioinformatics Institute, Cambridge, UK Carlos Garcia, Fergal Martin, Osagie Izuogu
Centre de Regulació Genòmica (CRG), Barcelona, Spain Roderic Guigó, Julien Lagarde, Barbara Uszczyńska
UC Santa Cruz Genomics Institute, University of California Santa Cruz (UCSC), USA David Haussler, Mark Diekhans, Benedict Paten, Joel Armstrong, Ian Fiddes
Computer Science and Artificial Intelligence Lab,Broad Institute of MIT and Harvard, USA Manolis Kellis, Irwin Jungreis
Computational Biology and Bioinformatics, Yale University (Yale), USA Mark Gerstein, Ekta Khurana, Cristina Sisu, Baikang Pei, Yan Zhang, Mihali Felipe
Center for Integrative Genomics,University of Lausanne, Switzerland Alexandre Reymond, Cedric Howald, Anne-Maud Ferreira, Jacqueline Chrast
Structural Computational Biology Group, Centro Nacional de Investigaciones Oncologicas (CNIO), Madrid, Spain Alfonso Valencia, Michael Tress, José Manuel Rodríguez, Victor de la Torre
Former members of the GENCODE project Jennifer Harrow, James Gilbert, Electra Tapanari, Stephen Searle, Rachel Harte, Daniel Barrell, Felix Kokocinski, Veronika Boychenko, Toby Hunt, Catherine Snow, Gary Saunders, Sarah Grubb, Thomas Derrien, Andrea Tanzer, Gang Fang, Mihali Felipe, Joanne Howes, Reena Halai, Pablo Roman-Garcia, Michael Brent, Randall Brown, Jeltje van Baren

References

Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa A, Searle S et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 2012 Sep;22(9):1760-74. PMID: 22955987; PMC: PMC3431492

Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, Gilbert JG, Storey R, Swarbreck D et al. GENCODE: producing a reference annotation for ENCODE. Genome Biol. 2006;7 Suppl 1:S4.1-9. PMID: 16925838; PMC: PMC1810553

A full list of GENCODE publications are available at The GENCODE Project web site.

Data Release Policy

GENCODE data are available for use without restrictions.