Draft gff3 standard for representation of CRISPR Arrays

Link to beta version of Perl code for conversion to gff3 (CRISPRFileToGFF.pl)

What it does:

Converts CRT, PILER-CR, CRISPRFinder and CRISPRCasFinder JSON to gff3 format. CRISPRDetect will produce gff as one of the outputs.

The gff3 file can then be used for display, procesing or as input to other software e.g. CRISPRStudio.

Motivation:

The basic structure for a CRISPR array is a number of direct repeats (DR) separated by unique spacers, functional arrays will be associated with Cas genes on the same genome, often nearby.

Early software for CRISPR detection grew from programs that described repeats. More recently software has tried to capture more information e.g. related to CRISPR system or strand. This has been extended to considering evolutionary information, e.g. mutations in DR or sequences comparisons.  e.g.s are minced, crt, crisprfinder, CRISRDetect.

There are a number of different systems available to predict CRISPR arrays and systems in genomes. Currently prediction programs have a range of outputs. A common standard would facilitate the downstream analysis of these arrays e.g. prediction of  targets, genome comparisons. Databases are available of CRISPRs but these also use different formats.

Here we define a draft standard that will facilitate interchange of data, and provide a conversion program.  The extensible standard is in gff3 (introduced with CRISPRDetect).  

Examples of use:

[CRISPRConvert]$ perl CRISPRFileToGFF.pl -in sample_crt.txt -out cf_out.gff

[CRISPRConvert]$ perl CRISPRFileToGFF.pl -in sample_ccf.json -out ccf_out.gff

[CRISPRConvert]$ perl CRISPRFileToGFF.pl -in sample_cf.txt -out cf_out.gff

[CRISPRConvert]$ perl CRISPRFileToGFF.pl -in sample_pilercr.txt -out test.gff

Notes on the different input formats:

For CRISPRCasFinder, all information are available, so displayed them as a GFF.

For CRT, PILER-CR and CRISPRFinder, strand and score are not available. So score = NA, and strand = "+" since data are extracted from the original sequence provided.

For PILER_CR, consensus repeat is given, but individual repeat sequences are not given. However, the coordinates of individual repeats are given. So the start-end coordinates were taken as given for repeats, but the sequences were taken as the consensus.

For CRISPRFinder, consensus repeat is given, but neither individual repeat sequences nor their start-end coordinates are given. So calculate the start-end coordinates of repeats using the start-end coordinates of the entire array and spacers, and the sequences were taken as the consensus.

For CRT, individual repeat sequences are given along with their start-end coordinates, but the consensus repeat is not given​. So the consensus as the most-common repeat sequence (by making a tally using a hash map data structure).

Output example:

from CRISPRCasFinder

CH482383 CRISPRCasFinder repeat_region 953775 953888 114 + . ID=CRISPR1_953775_953888;Name=CRISPR1_953775_953888;Note=CCGCTGCGCCGACGTTTCATGGCGGATA;Dbxref=SO:0001459;Ontology_term=CRISPR;Array_quality_score=1
CH482383 CRISPRCasFinder direct_repeat 953775 953802 28 + . ID=CRISPR1_REPEAT1_953775_953802;Name=CRISPR1_REPEAT1_953775_953802;Parent=CRISPR1_953775_953888;Note=CCGCTGCGCCGACGTTTCATGGCGGATA;Dbxref=SO:0001459;Ontology_term=CRISPR;Array_quality_score=1
CH482383 CRISPRCasFinder binding_site 953803 953860 58 + . ID=CRISPR1_SPACER1_953803_953860;Name=CRISPR1_SPACER1_953803_953860;Parent=CRISPR1_953775_953888;Note=TGCCTCCCGCGGCCCCGGAGTCGGCCCGTAGGGCGAATAACGCCACCGGCGTTATCCG;Dbxref=SO:0001459;Ontology_term=CRISPR;Array_quality_score=1

Native output without conversion from CRISPRDetect (cd_out.gff)

© 2019, University of Otago, Dunedin, New Zealand

Dr Chris Brown and Dr Te-yuan Chyou, Biochemistry and Genetics Otago