Schema for Publications - Publications: Sequences in Scientific Articles
  Database: hg19    Primary Table: pubsBlatPsl    Row Count: 733,957   Data last updated: 2012-04-27
Format description: publications track sequence matches as PSL plus two additional fields
On download server: MariaDB table dump directory
fieldexampleSQL type description
matches 20int(10) unsigned Number of bases that match that aren't repeats
misMatches 0int(10) unsigned Number of bases that don't match
repMatches 0int(10) unsigned Number of bases that match but are part of repeats
nCount 0int(10) unsigned Number of 'N' bases
qNumInsert 0int(10) unsigned Number of inserts in query
qBaseInsert 0int(10) unsigned Number of bases inserted in query
tNumInsert 0int(10) unsigned Number of inserts in target
tBaseInsert 0int(10) unsigned Number of bases inserted in target
strand +char(2) + or - for query strand. second +/- for genomic strand
qName 100000186400000000varchar(255) sequence ID: 10 digits articleId, 3 digits file Id, 4 digits serial number
qSize 20int(10) unsigned Query sequence size
qStart 0int(10) unsigned Alignment start position in query
qEnd 20int(10) unsigned Alignment end position in query
tName chr11varchar(255) Target sequence name
tSize 135006516int(10) unsigned Target sequence size
tStart 639817int(10) unsigned Alignment start position in target
tEnd 639837int(10) unsigned Alignment end position in target
blockCount 1int(10) unsigned Number of blocks in alignment
blockSizes 20,longblob Size of each block
qStarts 0,longblob Start of each block in query.
tStarts 639817,longblob Start of each block in target.
tSeqType cgvarchar(255) types of matching sequence db: g=genome, c=cdna, p=protein (comma-sep)
articleId 1000001864bigint(20) articleId of article in hgFixed.pubsBingArticle

Connected Tables and Joining Fields
        hg19.pubsBlat.name (via pubsBlatPsl.articleId)
      hgFixed.pubsArticle.articleId (via pubsBlatPsl.articleId)
      hgFixed.pubsMarkerAnnot.articleId (via pubsBlatPsl.articleId)
      hgFixed.pubsSequenceAnnot.articleId (via pubsBlatPsl.articleId)
      hg19.pubsBlat.seqIds (via pubsBlatPsl.qName)

Sample Rows
 
matchesmisMatchesrepMatchesnCountqNumInsertqBaseInserttNumInserttBaseInsertstrandqNameqSizeqStartqEndtNametSizetStarttEndblockCountblockSizesqStartstStartstSeqTypearticleId
200000000+10000018640000000020020chr11135006516639817639837120,0,639817,cg1000001864
210000000-10000018640000000321021chr11135006516636105636126121,0,636105,g1000001864
300000000+10000018640000000230030chr11135006516635578635608130,0,635578,g1000001864
200000000+10000018640000001020020chrX1552705604351424243514262120,0,43514242,g1000001864
190000000-10000018640000001319019chrX1552705604360382743603846119,0,43603827,cg1000001864
230000000-10000018640000000623023chr518091526013941001394123123,0,1394100,c1000001864
250000000-10000018640000000525025chr11135006516636937636962125,0,636937,g1000001864
240000000+10000018640000000724024chr518091526013936401393664124,0,1393640,cg1000001864
200000000-10000018640000000920020chr22513045661995130019951320120,0,19951300,cg1000001864
200000000-10000018640000001120020chrX1552705604351454643514566120,0,43514546,g1000001864

Note: all start coordinates in our database are 0-based, not 1-based. See explanation here.

Publications (pubs) Track Description
 

Description

This track is based on text-mining of full-text biomedical articles and includes two types of subtracks:

  • Sequences found in publications, grouped by article and searched in genomes with BLAT
  • Identifiers in publications that directly relate to chromosome locations (e.g., gene symbols, SNP identifiers, etc)

Both sources of information are linked to the respective articles. Background information on how permission to full-text data was obtained can be found on the project website.

Display Convention and Configuration

The sequence subtrack indicates the location of sequences in publications mapped back to the genome, annotated with the first author and the year of the publication. All matches of one article are grouped ("chained") together. Article titles are shown when you move the mouse cursor over the features. Thicker parts of the features (exons) represent matching sequences, connected by thin lines to matches from the same article within 30 kbp.

The subtrack "individual sequence matches" activates automatically when the user clicks a sequence match and follows the link "Show sequence matches individually" from the details page. Mouse-overs show flanking text around the sequence, and clicking features links to BLAT alignments.

All other subtracks (i.e. bands, genes, SNPs) show the number of matching articles as the feature description. Clicking on them shows the sentences and sections in articles where the identifiers were found.

The track configuration includes a keyword and year filter. Keywords are space-separated and are searched in the article's title, author list, and abstract.

Data

The track is based on text from biomedical research articles, obtained as part of the UCSC Genocoding Project.

The current dataset consists of about 600,000 files (main text and supplementary files) from PubMed Central (Open-Access set) and around 6 million text files (main text) from Elsevier (as part of the Sciverse Apps program).

Methods

All file types (including XML, raw ASCII, PDFs and various Microsoft Office formats (Excel, Word, PowerPoint)) were converted to text. The results were processed to find groups of words that look like DNA/RNA sequences or words that look like protein sequences. These were then mapped with BLAT to the human genome and these model organisms: mouse (mm9), rat (rn4), zebrafish (danRer6), Drosophila melanogaster (dm3), X. tropicalis (xenTro2), Medaka (oryLat2), C. intestinalis (ci2), C. elegans (ce6) and yeast (sacCer2). The pipeline roughly proceeds through these steps:

  • For sequences, the best match across all genomes is used, if it is longer than 17 bp and matches at 90% identity. Two sets of BLAT parameters are tried, the default ones for sequences longer than 25 bp, very sensitive ones (stepSize=5) for shorter sequences.
  • Sequences are mapped to genomic DNA. Those that do not match are mapped to RefSeq cDNAs.
  • Hits from the same article that are closer than 30 kbp are joined into one feature (shown as exon-blocks on the browser).
  • All parts of a joined feature have to match at least 25 bp.
  • Non-unique hits are kept in the joined feature with the most members.
  • Joined features with identical members in two different genomes are kept in both genomes.

Note that due to the 90% identity filter, some sequences do not match anywhere in the genome. Examples include primers with added restriction sites, mutation primers, or any other sequence that joins or mixes two pieces of genomic DNA not part of RefSeq. Also note that some gene symbols correspond to English words which can sometimes lead to many false positives.

Credits

Software and processing by Maximilian Haeussler. UCSC Track visualisation by Larry Meyer and Hiram Clawson. Elsevier support by Max Berenstein, Raphael Sidi, Judd Dunham, Scott Robbins and colleagues. Original version written at the Bergman Lab, University of Manchester, UK. Testing by Mary Mangan, OpenHelix Inc, and Greg Roe, UCSC.

Feedback

Please send ideas, comments or feedback on this track to max@soe.ucsc.edu. We are very interested in getting access to more articles from publishers for this dataset; see the project website.

References

Aerts S, Haeussler M, van Vooren S, Griffith OL, Hulpiau P, Jones SJ, Montgomery SB, Bergman CM, Open Regulatory Annotation Consortium. Text-mining assisted regulatory annotation. Genome Biol. 2008;9(2):R31. PMID: 18271954; PMC: PMC2374703

Haeussler M, Gerner M, Bergman CM. Annotating genes and genomes with DNA sequences extracted from biomedical articles. Bioinformatics. 2011 Apr 1;27(7):980-6. PMID: 21325301; PMC: PMC3065681

Van Noorden R. Trouble at the text mine. Nature. 2012 Mar 7;483(7388):134-5.