How Tos

Here are some examples for how to use CyPRUS:

  1. Click on "Search and Visualize Features" from the menu tab
  2. Enter a list of protein names, UniProt accessions, UniProt entry names, or protein names and amino acid positions for the "Inputs" field. The inputs can be comma-separated or multiple-line format. For amino acid positions with a mutation, the format should be in <Protein_Name>:<Original_Residue><Position><Mutated_Residue> (MTOR:A8S). It can be comma-separated or multiple-line format (case-senstive).
  3. Select an organism from the select menu. The protein identitifiers have to match with the organism selected.
  4. Click the "Submit" button

Once submitted, an excel like table and image of the first protein in the table will be generated on a new page. how to retrieve protein features for gene symbols result table for protein

Click on an isoform identifier to view an isoform and its features. result table for gene symbols isoform

The disrupted features will be highlighted in red and marked as "disrupted" in the "Status" column. result table for gene symbols isoform table

Click on the "Variants" category to view the impacts of the variants. Variants information are obtained from the UniProt feature API (https://www.ebi.ac.uk/uniprot/services/restful/features).

variant impact view

  1. Click on "Customize Features" from the menu tab
  2. Enter a protein name for "Protein Name" field. Select an organism from the select menu. If radio button for "Show Isoforms" is "Yes", a list of isoform IDs will be display in a drop down menu based on the user's protein and organism input selections. By default, "canonical" is the selected. Users can then enter a list of protein features that they want to display in the viewer in a <coordinate>:<feature_name> format.
  3. Click on the "Submit" button
  4. visualize custom protein features

Once submitted, a protein feature viewer will be generated on the buttom of the page. protein feature viewer

Users can view their custome features under the "Custom Features" category.

Supported sequence annotations

This tool extracts sequence features based on Gene(s) or Gene position(s) from Human UniProt. For the "Search and Visualize Features" option, the input takes the Gene Symbol(s) or Gene Position(s) in a list or csv format. A user can select the name of feature(s) he or she wants to extract. If no feature type is selected, then the tool will return all the sequence features for that gene or affected by that position. When entering Gene Position(s), the input takes the form of <Gene_symbol>:<Original_Residue><Position><Mutated_Residue> (for example: MTOR:A8S). This option will return all or selected features that are affected by the mutation(s).

Currently, following features are supported by this tool (this list is in reference to http://www.uniprot.org/help/sequence_annotation) :

Molecule processing

  • chain - Extent of a polypeptide chain in the mature protein
  • peptide - Extent of an active peptide in the mature protein
  • signal - Sequence targeting proteins to the secretory pathway or periplasmic space
  • transit - Extent of a transit peptide for organelle targeting
  • init_met - Cleavage of the initiator methionine
  • propep - Part of a protein that is cleaved during maturation or activation

Regions

  • region - Region of interest in the sequence
  • domain - Position and type of each modular protein domain
  • repeat - Positions of repeated sequence motifs or repeated domains
  • zn_fing - Position(s) and type(s) of zinc fingers within the protein
  • motif - Short (up to 20 amino acids) sequence motif of biological interest
  • compbias - Region of compositional bias in the protein
  • topo_dom - Location of non-membrane regions of membrane-spanning proteins
  • np_bind - Nucleotide phosphate binding region
  • transmem - Extent of a membrane-spanning region
  • dna_bind - Position and type of a DNA-binding domain
  • ca_bind - Position(s) of calcium binding region(s) within the protein
  • coiled - Positions of regions of coiled coil within the protein
  • lipid -
  • intramem - Extent of a region located in a membrane without crossing it

Amino acid modifications

  • mod_res - Modified residues excluding lipids, glycans and protein cross-links
  • carbohyd - Covalently attached glycan group(s)
  • non_std - Occurence of non-standard amino acids (selenocysteine and pyrrolysine) in the protein sequence.
  • disulfide - Cysteine residues participating in disulfide bonds.
  • crosslnk - Residues participating in covalent linkage(s) between proteins.

Natural variants

  • variant - Description of a natural variant of the protein

Natural variants - subsitution

  • subsitution - When original sequence length is > 1 and variant seqeunce length is > 1

Natural variants - insertion

  • insertion - When original sequence length is == 1 or 0 and variant seqeunce length is > 1

Natural variants - deletion

  • deletion - When original sequence length is > 1 or 0 and variant seqeunce length is == 0

Experimental information

  • conflict - Description of sequence discrepancies of unknown origin
  • mutag - Site which has been experimentally altered by mutagenesis
  • unsure - Regions of uncertainty in the sequence
  • non-cons - Indicates that two residues in a sequence are not consecutive
  • non-ter - The sequence is incomplete. Indicate that a residue is not the terminal residue of the complete protein

Secondary structure

  • helix - Helical regions within the experimentally determined protein structure
  • turn - Turns within the experimentally determined protein structure
  • strand - Beta strand regions within the experimentally determined protein structure

Sites

  • site - Binding site for any chemical group (co-enzyme, prosthetic group, etc.)
  • act_site - Amino acid(s) directly involved in the activity of an enzyme
  • binding - Binding site for any chemical group (co-enzyme, prosthetic group, etc.)
  • metal - Binding site for a metal ion

Variants

  • user_variant - User inputted variants

User Inputed Feature

  • user_feature - User inputted feature

Uniprot version and statistics

DATABASE VERSION RELEASE_DATE UPDATE_DATE COUNT
uniprot 202102 14-SEP-21 19-SEP-21 751465

Data source and process

We download UniProt XML files from ftp://ftp.uniprot.org/pub/databases/uniprot/ and load the data into the internal MongoDB collections. The entire process is fully automated. We have scheduled cron jobs that updates the database whenever an update is available on the UniProt site.

data processing flow chart

Isoform features coordinates calculation

First, features that overlap with a deletion events are removed from the graphical viewer. The remaining features' coordinates are calculated according to the splice events.

For example, in the canonical form of CASC4, there are topological domain - Cytoplasmic located at 1-14 and transmembrane region - Helical; Signal-anchor for type II membrane protein located at 15-35. Both of these features were discrupted on isoform Q6P4E1-5, because of a 22 amino acid deletion.

An intact coiled-coil region is transformed from 35 - 198 to 13 - 176 (35 - 22 = 13, 198 - 22 = 176) and a subsitition (20 aa in the original sequence and 23 aa in the subsitute sequence) is transformed from 414 - 433 to 392 - 414 (414-22 = 392, 433 - 22 + (23 - 20) = 414).

When visualizing isoform features, deletion and insertion are accounted in the isoform length and subsitution will be show as blue rectangles.

isoform features

Human orthologs

From HomoloGene database, we obtained 13 organisms that have common HomoloGene identifiers. We then use bioDbNet (biological DataBase network) to convert identifiers from one organism into homolog identifiers of a different organism.

Users can remove or sort the organisms based on their requirements.

Following are the 13 species:

  1. Chimpanzee
  2. Rhesus Monkey
  3. Mouse
  4. Rat
  5. Dog
  6. Cow
  7. Chicken
  8. Yeast
  9. Mosqito
  10. Fruit Fly
  11. Zebra Fish
  12. Roundworm
  13. Thale-cress

orthologs search