General

What is ToTeM?

Tool for Text Mining and Visualization, ToTeM, is a web application for investigators to use to explore MEDLINE Abstract using full text searching capabilities of Lucene/Solr . Users can take advantage of the many features including:
  • Count by Fields (faceting) -- View "Counts By" a particular field
  • Extended BioQueries -- Search on a biological concept and get all its gene aliases
  • Associated Diseases and pharmacological data -- Search on a gene and get the associated genes and any drugs
  • Network Visualization -- View author/author networks or MeSH terms
  • Visualize Biological Concept trends-- View trends based on your BioQuery searches

What is Lucene/Solr?

Solr (pronounced 'Solar') is a text search engine that is powered by Lucene, an open source information retrieval software library. It is a fast and efficient way to do full text searching without relational database. For more information, please visit their website by clicking here or opening another web browser page and typing in the url http://lucene.apache.org/solr/.

What makes ToTeM different than MEDLINE?

MEDLINE(r) contains biomedical literature from various sources around the world. ToTeM is used to mine that data for specific keyword using proximity or pair-wise searches that could not be done with a traditional relational database. Also, because MEDLINE is so large and because its abstracts are unstructured, Solr/Lucene allows users to do queries that would be difficult in a relational database. Also due to Solr's built-in capabilities, we can do word stemming, synonym matching and other cool stuff (such as proximity searches), that a regular structured SQL query couldn't do AND we can do it in seconds!

Word stemming is the ability for Solr to search for different forms of the same word beyond just a wildcard search. For example, a query for the word "run" may return matches for running, runner, and ran.

Synonym matching will allow users to enter a gene, but also return hits for the gene's protein id or RefSeq accession id.

Finally, proximity matching will allow users to search for phrases that are near to each other. Users can specify the proximity of a set of words! Can't do that with SQL.

Querying

How do I get started?

To get started, click on the 'ToTeM Query' tab at the top and navigate to the 'Basic Query'. Type your favorite gene, author or keyword in the appropriate field. If you want a general search, type a word or phrase in the 'keywords' field. For best results, narrow down your search by selecting a publication year range. You may also want to use our REST API calls, shown in the "Web Services" tab above. If you want to extend your search using a biological concept (gene symbol or identifier), you can select the 'BioQuery' section. Use the checkboxes to select any of the extended queries section. Click submit and let ToTeM show you the results.

Are inputs case sensitive?

Yes and no. Inputs going through Solr (e.g. text searching) are case insensitive, but extended queries through BioQuery can be case sensitive since it is using bioDBnet.

What is the "Count By Field" option?

The "Count By Field" (i.e. faceting) option is a way for you to group your data by specific fields, i.e. authors, MeSH terms, publication year or Journal title. Solr has faceting capabilities that can quickly return counts based on your facet field. Links are also provided to narrow down your searches. Only certain fields can be faceted.

How are MEDLINE queries constructed?

Lucene/Solr has its own query syntax and it is different than a traditional SQL query. ToTeM will construct the queries using an "OR" clause for each of the fields specified. Due to security issues, ToTeM limits what can be submitted as a query, such as special characters which are usually used for wildcard and proximity searches. Also, due to the way the server is set up, some inputs cannot be interpreted by the server, e.g. some fields have special characters, such as an ampersand (&), and searching for these types of characters will cause the query to fail.

What do we mean by "fully indexed by field"?

All fields are indexed to quickly search text. For our application, we are biology-centric, which means we are more interested in returning biologically relevant articles with precision recall. We do not want to return results for "BRAF" that only match in the author field because that would not give accurate results on the term (unless you wanted the author name Braf). We use this as naive example, but you can see the results would not be biologically relevant to the search for scientists. Instead, we give full text index by certain fields, meaning that you can search on these fields without explicitly specifying the field. You can search for BRAF in the abstract, article title, MesH Terms, gene or chemical fields without specifying the field name.
The following fields require users to explicitly specify the field like affiliation=ABCC or q=affiliation:ABCC:
  • authors
  • affiliation
  • affiliation_dept
  • journaltitle
  • grantAgency
  • grantID
  • grantCountry
Quotes for phrases should surround the query and not the field. E.g. affiliation:"Advanced Biomedical Computing Center"

What is a BioQuery and how is it used?

BioQuery is ToTeM's exclusive feature that expands gene searches using biologically relevant terms and associated terms. Using gene names or other common identifiers, users can specify associated terms that they can also limit or expand their search. These terms are applied as additional filters on the main query. For example, if a user wanted to search for "John Doe" in the Query, but wanted to limit the search to specific genes PTEN, the user could do that in the normal query and it would only return PTEN. However, by using the BioQuery, its synonyms would also be returned. If other categories for disease, or pharmacological are selected, there would be an option to further filter by these options. It is used to narrrow searches based on gene product, drug, or even disease. Using bioDBNet, each target query in the user's selection is cross referenced and pulled out from the following sources:
BioQuery SelectedSource
DrugsDrugBank Drug,PharmGKB Drug
DiseaseGAD Disease,CTD Disease
InteractionsConsensusDBPath, bioGRID, DIP
GO_OntologyGO - Biological Process,GO - Cellular Component,GO - Molecular Function
mirnamirTarbase, mir2disease
GenesGene Symbol
Variantsavsnp147

How do I use the Filter search Text Box on the Results Page


To use the search box on the results page, you can type in ANY term or phrase.
  • Type any search term in the box
  • Certain fields require users to specify the field (click here for more clarification). To search a specific author, type authors:XXX. If you are using a last and first name or initials, then put quotes around the author name, i.e. authors:"Vuong H"
  • To find articles that do not contain a specific term, simply type: -cancer
To apply the filter, simply press the Enter key or use the mouse to click outside of the text box.

How do I negate a term?

Query negation (i.e. excluding a term from the query) is supported by adding a minus sign "-" in front of the term. Remember if you are looking for a specific field that is not fully indexed, you can specify that field by typing the -FIELD:SEARCHTERM. This is also required for the author, journal title, affiliation and grant fields.

Why is the API syntax inconsistent?

Some of our examples show two different ways of using our API. Sometimes, q=affiliation:ABCC is used and sometimes affiliation=ABCC and other times affiliation[]=ABCC. These are all acceptable. If you have multiple searches to perform, it is always better to use the q[]=SEARCH1&q[]=SEARCH2 unless it is a field that you have to explicity specify q[]=affiliation:SEARCHTERM is the same as affiliation[]=SEARCHTERM.

Why isn't there an API for ToTeM's BioQuery?

BioQuery is a unique feature within ToTeM and we reserve this feature for our web based users to reduce robots and restrict programmatic use. Contact us if you would like to have a permanantly link set up.

Are there restrictions on special characters?

Yes, unfortunately to guard against security threats, we only allow certain characters. Some characters (such as accents) may not be displayed correctly and searches may fail.

What are acceptable BioQuery input types?

Currently, the only acceptable BioQuery inputs are gene/protein identifiers (including Ensembl, UniProt Accession and Entry Names), drug names, and GO IDs. We are working on other extended searches. They must all be of the same type, i.e you cannot mix drug names with gene symbols in the same search.

Results

How are results returned?

Currently, results from the ToTeM search page are returned on the web pages as a table. This option is ideal for exploratory analysis. If you would to be able to download results or use our REST API calls, please click here to return other formats for the MEDLINE results only. BioQuery results are only available by using the online web submission form and can only be downloaded through the web.

Why are "Count By Fields" queries return slower than normal queries?

Solr caches the results so the initial query may take longer than subsequent queries. Because Solr is performing counts on each field (i.e the field that the user requested the counts for), it may take longer than a normal query. We are returning all results and thereforore it takes longer to be published by the web page.

What do the brackets in the Results mean?

The brackets mean that the field was translated from another language. Since MEDLINE itself is a globally used resource, it includes journal articles from other countries and have translated from the original journal into English.

Is there a limit to the number of hits returned?

Currently, on our web interface, we do not limit the number of hits returned, as the web page handles pagination. For our REST API, we do limit the number returned to 1000. By Default, the number of rows returned is 10, so you must specify "&rows=1000" and a start position "&start=0"Querying, writing results to the page, the amount of available memory on the server and I/O all affect the time it takes to return. Click here for more information on how to retrieve more than 1000.

Why is there a significant difference between the query time reported on the page and the time I see results?

Rendering the webpage takes significantly longer than querying MEDLINE. The lag time between the query and the webpage is due to extra functions that the webpage is performing (such as highlighting and adding links), especially when performing linking for faceted queries.

Why does it say that BioQuery Results failed?

BioQuery relies on bioDBnet, a database converter tool which can convert disparate identifiers between all organisms. Since many genes are the same between species, but have different capitalizations, the query is case sensitive. Although we try to mitigate as much as possible, all lower case genes may sometimes fail at this stage, but still may have results in Solr, since Solr is case insensitive.

Can I use the API to download the BioQuery Results?

Due to the computational nature of BioQueries, BioQuery results can only be downloaded from the web interface. You may use bioDBnet if you wish to do batch searches on any concept.

Why aren't all my sessions showing up on the menu?

ToTeM only tracks activity of the user when they submit a query through the web sessions, not API calls. If you try to directly call a page using any parameters on the URL, the page will not be recorded.

What are the different output formats?

The ToTeM webservice page allows different output types for your final results. They are standardized formats and each are listed below with a link to the format
  • JSON - JSON (JavaScript Object Notation) is used for many web applications
  • CSV - Comma separated list of fields. Fields with commas in the results will be surrounded by quotes
  • APA - APA (American Psychological Association) style is the most used in social sciences.
  • MLA - MLA (Modern Language Association) is the most commonly used style within the liberal arts and humanities.
  • Bibtex - (from wiki) a reference management software for formatting lists of references. It is used together with LaTEX document preparation system.
  • NBIB (Endnote) - this is a file extension associated with the Citation manager service developed by the U.S. National Library of Medicine.
You should also specify the number of rows returned as only 10 are returned by default.

Visualization

What are the current visualization options?

Currently, we offer Cytoscape visualization of grouped queries, such as Authors and MeSH terms, also by BioQueries. Using Cytoscape.js, you can explore these data by using the filter and search options provided on the website. You can also choose what type of graph to display your data. The different types of graphs can be better explained by clicking here. You can also create a snapshot of your visualization and save to your desktop to use in publications. We currently only retain the network data for 7 days. From the webform, submit a query:
image of the ToTeM basic web submission page
Figure 1. ToTeM web form submission page.


The visualization option is only available through the web. On the results page, you will have the side menu, select the "Visualization Options".
View of Viz Opt
Figure 2. Results page from a BioQuery Submission with "BRAF" and all related lookup terms selected.

In the drop down menu, select the network categorization that you wish to view in Cytoscape.
View of Viz Opt with Dropdown Menu
Figure 3. Closeup view of the Visualization options. For this tutorial, we selected "BioQuery Results" to visualize.

Click "Go Viz!" and your visualization will pop up in another tab or window.
View of BioQuery Cytoscape Visualization
Figure 4. Cytoscape visualization of BioQuery Results for BRAF and all related search terms.

There are navigation options on the right hand menu which allow you to change the graph type, search for a node, filter the nodes and to Save your file. The save option allows you to save the graph displayed in the window. To Save to your desktop, use the right click on your mouse and select "Save Image As". A menu option should appear allowing you to save to your local computer.

To View Trends  Sparkline is used to visualize trends between co-occurring concepts through each decade. From Figure 2 above, click on the button "View By Trends By Year". By default, there should be trend lines for Cell Lines and Body Sites by decade for your query term.
trendlines by bioquery results
Figure 5. Trendlines for BioQuery BRAF results

If your initial query was a bioquery, you will also have the categories listed in the drop down menu. You can changed the trendlines according to your selected category to view the trends through the decades.

Can I upload my own file for network visualization?

Although ToTeM takes a json as input; it has specific format to be used within Cytoscape.js. We do not currently have that option to convert to the correct format. You can use the "Custom facet" option to view the results for a set of data in MEDLINE, but it uses the counts of each of the terms and its relation to your initial query for the nodes and edges. You can view the custom faceting results in Cytoscape by choosing the Visualization Options in the left hand menu on the results page, then select BioQuery Results in the drop down menu and click "Go Viz!" On the Cytoscape visualization page,

Web Services - API

General Information

We have three ways of accessing ToTeM.
  1. BEGINNER: Use ToTeM web interface to specify return formats
  2. INTERMEDIATE: Use ToTeM URL to specify a query term and return the interactive web page
  3. ADVANCED: Use ToTeM's Query Builder to help build complex queries so you can directly access from any application.
Any of the following queries described in this section can be used as an API call by adding "&wt=json" or "&wt=csv" to the URL. None of these methods returns bioQuery results. To return bioQuery results, you must use the web interface to submit your query.

For specific output types for regular queries, please click here.

Basic text query

The basic query for text search is in the format:
https://bioinfo-abcc.ncifcrf.gov/totem/results_template3.php?q=BRAF
where "BRAF" is your text query. When a database field is not specified, i.e. authors, title, gene, etc., then ToTeM will search the entire index.

Search on a specific field

By default, the basic search will search on all indexed text. If you are looking for text in a specific field, you can specify that by adding the field name to the query like:
https://bioinfo-abcc.ncifcrf.gov/totem/results_template3.php?chemical=Indoles
where "chemical" is the field identifier, and "Indoles" is the text you wish to search on. These are separated by a colon (:). The field identifiers for Medline are:
  • id - PMID identifier
  • authors - Authors' Name
  • journaltitle - Journal Title
  • abstract - The abstract of the article
  • title - Title of the Journal Article or Chapter
  • volume_data - Volume, Issue and Page (if available)
  • MeshTerms -MeSH terms
  • chemical - Chemicals (if available)
  • gene - Genes(if available)
  • pubyear - Year of publication
  • grantCountry - Country of grant
  • grantID - The grant IDs (if available) funding
  • grantAgency - The grant funding agency (if available)

They are also case sensitive!!!!

Multi word search

The basic syntax for a text search for two or more words is to add a '+' sign between them:
https://bioinfo-abcc.ncifcrf.gov/totem/results_template3.php?q=BRAF+PTEN
or
https://bioinfo-abcc.ncifcrf.gov/totem/results_template3.php?q[]=BRAF&q[]=PTEN
You may specify as many as you wish, up to your browser's URL limitation. ToTeM will search on any of the fields you provided. If you wish to perform query that includes all of your terms, then add "&q_op=AND":
https://bioinfo-abcc.ncifcrf.gov/totem/results_template3.php?q=BRAF+PTEN&q_op=AND

Specify a year range

You can specify a numeric range in your query by using the syntax:
https://bioinfo-abcc.ncifcrf.gov/totem/results_template3.php?&q=pubyear:[1900 TO 2001]
where the date range is 1900-2001.

Faceted Query

A faceted query returns the specified query with counts based on a specified field. For example, if you wanted to search on the term "BRAF", but wanted to return the counts by journal, you would use the form:
https://bioinfo-abcc.ncifcrf.gov/totem/results_template3.php?q=BRAF&facet=true&facet.field=authors_facet
where you specify:
  • text query:q=BRAF,
  • turn on faceting: facet=true
  • specify facet field: facet.field=authors_facet

Faceted searches take longer since they are searching and returning all results within the text, whereas other searches return in batches.

REST API - Return a different format

You can return a list of results by using the API. This is how you specify a return format other than json.
https://bioinfo-abcc.ncifcrf.gov/totem/results_template3.php?q=BRAF&wt=csv
https://bioinfo-abcc.ncifcrf.gov/totem/results_template3.php?q=BRAF&wt=json
The accepted options are <json |csv |mla | apa | bibtex| nbib>, otherwise it will return json object by default This returns a json/csv of results. This option only returns the first 1000 rows. You can also specify which fields to return by specifying the 'fl' parameter:
https://bioinfo-abcc.ncifcrf.gov/totem/results_template3.php?q=BRAF&wt=csv&fl=id,authors

REST API - Limit the fields

To limit the database fields returned, you need to specify the field list :
https://bioinfo-abcc.ncifcrf.gov/totem/results_template3.php?q=BRAF&fl=authors,title
where field list is specified by fl=authors,title
If you wish to specify the fields returned in your API results, the acceptable fields that can be specified in bold text are below next to a brief description of the field. All other inputs will be ignored. This is case sensitive!!
  • id - PMID identifier
  • authors - Authors' Name
  • affiliation - Author affiliation (if available)
  • journaltitle - Journal Title
  • title - Title of the Journal Article or Chapter
  • volume_data - Volume, Issue and Page (if available)
  • MeshTerms -MeSH terms
  • chemical - Chemicals (if available)
  • pubyear- Year of publication
  • grantCountry - Country of grant
  • grantID - The grant IDs (if available) funding
  • grantAgency - The grant funding agency (if available)

REST API - Specify the number of rows returned

To limit the rows returned for the API, you need to specify the rows. By default, only the first 10 rows are returned. To specify more :
https://bioinfo-abcc.ncifcrf.gov/totem/results_template3.php?q=BRAF&rows=100&wt=csv
Caution:Specifying all rows returned may cause the website to crash so we limit the number allowable returned to 1000. If you wish to specify more than 1000, then you can modify your query to return 1000 at a time by specifying the start row and the number of rows returned:
https://bioinfo-abcc.ncifcrf.gov/totem/results_template3.php?q=BRAF&rows=1000&start=1000&wt=csv
This will return the NEXT set of 1000 rows.

REST API - Retrieve a previous submission

Each query is given a unique identifier and is retained for 1 week after submission. To retrieve a specific identifier:
https://bioinfo-abcc.ncifcrf.gov/totem/results_template3.php?totem_id=TOTEM_IDENTIFIER
Replace "TOTEM_IDENTIFIER" with your unique identifier.

Perl Script

use strict;
use warnings;
use LWP::UserAgent;

my $url = 'https://bioinfo-abcc.ncifcrf.gov/totem/results_template3.php.php?q=BRAF';
my $contact = ''; # Please set your email address here to help us debug in case of problems.
my $agent = LWP::UserAgent->new(agent => "libwww-perl $contact");
push @{$agent->requests_redirectable}, 'GET';
my $response = $agent->get($url);

while (my $wait = $response->header('Retry-After')) {
  print STDERR "Waiting ($wait)... ";
  sleep $wait;
  $response = $agent->get($response->base);
}
$response->is_success ? print $response->content :
die 'Failed, got ' . $response->status_line . ' for ' . $response->request->uri . "\n";