Schema for Publications - Publications: Sequences in Scientific Articles

JavaScript is disabled in your web browser

You must have JavaScript enabled in your web browser to use the Genome Browser

Database: hg19 Primary Table: pubsBlatPsl Row Count: 733,957 Data last updated: 2012-04-27
Format description: publications track sequence matches as PSL plus two additional fields
On download server: MariaDB table dump directory

field	example	SQL type	description
`matches`	20	`int(10) unsigned`	Number of bases that match that aren't repeats
`misMatches`	0	`int(10) unsigned`	Number of bases that don't match
`repMatches`	0	`int(10) unsigned`	Number of bases that match but are part of repeats
`nCount`	0	`int(10) unsigned`	Number of 'N' bases
`qNumInsert`	0	`int(10) unsigned`	Number of inserts in query
`qBaseInsert`	0	`int(10) unsigned`	Number of bases inserted in query
`tNumInsert`	0	`int(10) unsigned`	Number of inserts in target
`tBaseInsert`	0	`int(10) unsigned`	Number of bases inserted in target
`strand`	+	`char(2)`	+ or - for query strand. second +/- for genomic strand
`qName`	100000186400000000	`varchar(255)`	sequence ID: 10 digits articleId, 3 digits file Id, 4 digits serial number
`qSize`	20	`int(10) unsigned`	Query sequence size
`qStart`	0	`int(10) unsigned`	Alignment start position in query
`qEnd`	20	`int(10) unsigned`	Alignment end position in query
`tName`	chr11	`varchar(255)`	Target sequence name
`tSize`	135006516	`int(10) unsigned`	Target sequence size
`tStart`	639817	`int(10) unsigned`	Alignment start position in target
`tEnd`	639837	`int(10) unsigned`	Alignment end position in target
`blockCount`	1	`int(10) unsigned`	Number of blocks in alignment
`blockSizes`	20,	`longblob`	Size of each block
`qStarts`	0,	`longblob`	Start of each block in query.
`tStarts`	639817,	`longblob`	Start of each block in target.
`tSeqType`	cg	`varchar(255)`	types of matching sequence db: g=genome, c=cdna, p=protein (comma-sep)
`articleId`	1000001864	`bigint(20)`	articleId of article in hgFixed.pubsBingArticle

Connected Tables and Joining Fields


	hg19.pubsBlat.name (via pubsBlatPsl.articleId) hgFixed.pubsArticle.articleId (via pubsBlatPsl.articleId) hgFixed.pubsMarkerAnnot.articleId (via pubsBlatPsl.articleId) hgFixed.pubsSequenceAnnot.articleId (via pubsBlatPsl.articleId) hg19.pubsBlat.seqIds (via pubsBlatPsl.qName)

Sample Rows

matches	strand	qName	qSize	qEnd	tName	tSize	tStart	tEnd	blockCount	blockSizes	qStarts	tStarts	tSeqType	articleId
20	+	100000186400000000	20	20	chr11	135006516	639817	639837	1	20,	0,	639817,	cg	1000001864
21	-	100000186400000003	21	21	chr11	135006516	636105	636126	1	21,	0,	636105,	g	1000001864
30	+	100000186400000002	30	30	chr11	135006516	635578	635608	1	30,	0,	635578,	g	1000001864
20	+	100000186400000010	20	20	chrX	155270560	43514242	43514262	1	20,	0,	43514242,	g	1000001864
19	-	100000186400000013	19	19	chrX	155270560	43603827	43603846	1	19,	0,	43603827,	cg	1000001864
23	-	100000186400000006	23	23	chr5	180915260	1394100	1394123	1	23,	0,	1394100,	c	1000001864
25	-	100000186400000005	25	25	chr11	135006516	636937	636962	1	25,	0,	636937,	g	1000001864
24	+	100000186400000007	24	24	chr5	180915260	1393640	1393664	1	24,	0,	1393640,	cg	1000001864
20	-	100000186400000009	20	20	chr22	51304566	19951300	19951320	1	20,	0,	19951300,	cg	1000001864
20	-	100000186400000011	20	20	chrX	155270560	43514546	43514566	1	20,	0,	43514546,	g	1000001864

Note: all start coordinates in our database are 0-based, not 1-based. See explanation here.

Publications (pubs) Track Description


	Description This track is based on text-mining of full-text biomedical articles and includes two types of subtracks: Sequences found in publications, grouped by article and searched in genomes with BLAT Identifiers in publications that directly relate to chromosome locations (e.g., gene symbols, SNP identifiers, etc) Both sources of information are linked to the respective articles. Background information on how permission to full-text data was obtained can be found on the project website. Display Convention and Configuration The sequence subtrack indicates the location of sequences in publications mapped back to the genome, annotated with the first author and the year of the publication. All matches of one article are grouped ("chained") together. Article titles are shown when you move the mouse cursor over the features. Thicker parts of the features (exons) represent matching sequences, connected by thin lines to matches from the same article within 30 kbp. The subtrack "individual sequence matches" activates automatically when the user clicks a sequence match and follows the link "Show sequence matches individually" from the details page. Mouse-overs show flanking text around the sequence, and clicking features links to BLAT alignments. All other subtracks (i.e. bands, genes, SNPs) show the number of matching articles as the feature description. Clicking on them shows the sentences and sections in articles where the identifiers were found. The track configuration includes a keyword and year filter. Keywords are space-separated and are searched in the article's title, author list, and abstract. Data The track is based on text from biomedical research articles, obtained as part of the UCSC Genocoding Project. The current dataset consists of about 600,000 files (main text and supplementary files) from PubMed Central (Open-Access set) and around 6 million text files (main text) from Elsevier (as part of the Sciverse Apps program). Methods All file types (including XML, raw ASCII, PDFs and various Microsoft Office formats (Excel, Word, PowerPoint)) were converted to text. The results were processed to find groups of words that look like DNA/RNA sequences or words that look like protein sequences. These were then mapped with BLAT to the human genome and these model organisms: mouse (mm9), rat (rn4), zebrafish (danRer6), Drosophila melanogaster (dm3), X. tropicalis (xenTro2), Medaka (oryLat2), C. intestinalis (ci2), C. elegans (ce6) and yeast (sacCer2). The pipeline roughly proceeds through these steps: For sequences, the best match across all genomes is used, if it is longer than 17 bp and matches at 90% identity. Two sets of BLAT parameters are tried, the default ones for sequences longer than 25 bp, very sensitive ones (stepSize=5) for shorter sequences. Sequences are mapped to genomic DNA. Those that do not match are mapped to RefSeq cDNAs. Hits from the same article that are closer than 30 kbp are joined into one feature (shown as exon-blocks on the browser). All parts of a joined feature have to match at least 25 bp. Non-unique hits are kept in the joined feature with the most members. Joined features with identical members in two different genomes are kept in both genomes. Note that due to the 90% identity filter, some sequences do not match anywhere in the genome. Examples include primers with added restriction sites, mutation primers, or any other sequence that joins or mixes two pieces of genomic DNA not part of RefSeq. Also note that some gene symbols correspond to English words which can sometimes lead to many false positives. Credits Software and processing by Maximilian Haeussler. UCSC Track visualisation by Larry Meyer and Hiram Clawson. Elsevier support by Max Berenstein, Raphael Sidi, Judd Dunham, Scott Robbins and colleagues. Original version written at the Bergman Lab, University of Manchester, UK. Testing by Mary Mangan, OpenHelix Inc, and Greg Roe, UCSC. Feedback Please send ideas, comments or feedback on this track to max@soe.ucsc.edu. We are very interested in getting access to more articles from publishers for this dataset; see the project website. References Aerts S, Haeussler M, van Vooren S, Griffith OL, Hulpiau P, Jones SJ, Montgomery SB, Bergman CM, Open Regulatory Annotation Consortium. Text-mining assisted regulatory annotation. Genome Biol. 2008;9(2):R31. PMID: 18271954; PMC: PMC2374703 Haeussler M, Gerner M, Bergman CM. Annotating genes and genomes with DNA sequences extracted from biomedical articles. Bioinformatics. 2011 Apr 1;27(7):980-6. PMID: 21325301; PMC: PMC3065681 Van Noorden R. Trouble at the text mine. Nature. 2012 Mar 7;483(7388):134-5.

Description

Display Convention and Configuration

Data

Methods

Credits

Feedback

References