Parsing rules for Sequence Repository

Poster un nouveau sujet   Répondre au sujet

Voir le sujet précédent Voir le sujet suivant Aller en bas

Parsing rules for Sequence Repository

Message par ProlineAdmin le Jeu 9 Fév - 9:29

The Sequence Repository module is needed to retrieve Protein Sequence from fasta files and to calculate related information such as coverage ...
In order to be efficient, it is better to install it on the same computer as the one executing your Mascot Server, as this module will access and read fasta files.

Concerning the parsing_rules file, it is configured using regular expres​sion(java one). To create a valid regular expression the RegEx site is very helpful.
We will give you here 3 examples of parsing rules


  1. Case one : Uniprot file. Suppose your fasta file are formatted like : uniprot_<someTextWithOUT'_'>_2017_02.fasta. In these files, entries are
    >sp|P53319|6PGD2_YEAST 6-phosphogluconate dehydrogenase, decarboxylating 2 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) GN=GND2 PE=1 SV=1 and protein name to extract is 6PGD2_YEAST
  2. Case two: Uniprot file 2. Suppose your fasta file are formatted like : UP_<someTextWithOUT'_'>_20170225.fasta. In these files, entries are
    >sp|P53319|6PGD2_YEAST 6-phosphogluconate dehydrogenase, decarboxylating 2 OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) GN=GND2 PE=1 SV=1 and protein name to extract is P53319
  3. Case three: TAIR file. Suppose your fasta file are formatted like : TAIR10_<anyText>.fasta. In these files, entries are
    >AT1G51380.1 | Symbols:  | DEA(D/H)-box RNA helicase family protein | chr1:19047960-19049967 FORWARD LENGTH=392 and protein name to extract is AT1G51380.1


the parsing rules should be :

parsing-rules = [{
   name="uniprot1",
  fasta-name=["uniprot"],                                   // all files which name start with 'uniprot' will be considered by this rule
   fasta-version="uniprot _([^_]*)_(.*).fasta",       // uniprot version will be extract from second '_' to the end of the file name. 2017_02 in Case 1
   protein-accession =">\\w{2}\\|[^\\|]*\\|(\\S+)" //extract last part of the entry as accession. 6PGD2_YEAST in Case 1
},
{
  name="uniprot2",
   fasta-name=["UP_"],                                // all files which name start with 'UP_' will be considered by this rule
   fasta-version="UP_[^_]*_(.*).fasta",          // uniprot version will be extract from second '_' to the end of the file name. 20170225 in Case 2
  protein-accession =">\\w{2}\\|([^\\|]+)\\|" //extract second part of the entry as accession. P53319 in Case 2
},
{
  name="TAIR",
   fasta-name=["TAIR"],                         // all files which name start with 'TAIR' will be considered by this rule
   fasta-version="TAIR([^_]*)_.*.fasta",   // TAIR version is extract after TAIR word and before first '_'
  protein-accession =">(\\S+)"  // Protein accession is extract from beginning to first space
}]

ProlineAdmin
Admin

Messages : 25
Date d'inscription : 06/12/2016

Voir le profil de l'utilisateur http://proline.profiproteomics.fr/

Revenir en haut Aller en bas

Voir le sujet précédent Voir le sujet suivant Revenir en haut

- Sujets similaires

Poster un nouveau sujet   Répondre au sujet
 
Permission de ce forum:
Vous pouvez répondre aux sujets dans ce forum