CRT format, CRISPRDetect has a similar format

Program source: The CRT (CRISPR Recognition Tool) application can be downloaded from here.

Parameters used: While using CRT, please make sure that you only use the parameters mentioned below, otherwise the format might get changed and our program may not identify information correctly. To avoid format changes, instead of using option '-screen', provide a output filename.

Sample command: "java -cp CRT1.2-CLI.jar crt ecoli.fna ecoli.out"

Allowed options:
        -minNR        minimum number of repeats a CRISPR must contain; default 3
        -minRL        minimum length of a CRISPR's repeated region;  default 19
        -maxRL        maximum length of a CRISPR's repeated region;  default 38
        -minSL        minimum length of a CRISPR's non-repeated region (or spacer region);  default 19
        -maxSL        maximum length of a CRISPR's non-repeated region (or spacer region);  default 48
        -searchWL    length of search window used to discover CRISPRs; (range: 6-9)

Sample CRT file

PILER-CR format:

To predict CRISPR array in your sequence using PILER-CR click on this link

Program source: The PILER-CR application can be downloaded from here.

Parameters used: While using PILER-CR, please make sure that you only use the parameters mentioned below, otherwise the format might get changed and our program may not identify information correctly. Use of "-noinfo" is a must.

Sample command: "pilercr -noinfo -quiet -in ecoli.fna -out ecoli.out"

Allowed options:

		Basic options:
		   -in          Sequence file to analyze (FASTA format).
		   -out         Report file name (plain text).
		   -seq         Save consensus sequences to this FASTA file.
		   -trimseqs              Eliminate similar seqs from -seq file.
		   -noinfo                Don't write help to report file.
		   -quiet                 Don't write progress messages to stderr.

		Criteria for CRISPR detection, defaults in parentheses:
		   -minarray           Must be at least  repeats in array (3).
		   -mincons            Minimum conservation (0.9).
									
				[At least N repeats must have identity >= F with the consensus sequence. Value is in range 0 .. 1.0.
				 It is recommended to use a value < 1.0 because using 1.0 may suppress true arrays due to boundary misidentification.]
				 
		   -minrepeat          Minimum repeat length (16).
		   -maxrepeat          Maximum repeat length (64).
		   -minspacer          Minimum spacer length (8).
		   -maxspacer          Maximum spacer length (64).
		   -minrepeatratio     Minimum repeat ratio (0.9).
		   -minspacerratio     Minimum spacer ratio (0.75).
									
				['Ratios' are defined as minlength / maxlength, thus a value close to 1.0 requires lengths to be 
				  similar, 1.0 means identical lengths. Spacer lengths sometimes vary significantly, so the default
				  ratio is smaller. As with -mincons, using 1.0 is not recommended.]

		Parameters for creating local alignments:
		   -minhitlength       Minimum alignment length (16).
		   -minid              Minimum identity (0.94).

Sample PILER-CR file

CRISPRFinder format, CRISPRCasFinder also supported

Program source: The CRISPRFinder is an web-application can be found here.

CRISPRFinder input section:The CRISPRFinder input section is shown in the image below. You can click here to go to the page.

How to obtain CRISPRFinder output :
		
		To perform a CRISPR prediction and obtaining the output file, follow the steps:
		A. Upload or paste your genomic sequence in the corresponding text box and press submit.
		B. The next page shows a table with headers 'Confirmed CRISPRs' and 'Questionable CRISPRs' along with links to the corresponding files. 
			Clicking on a link will take you to the corresponding CRISPR arrays visualization.
		C. Click on the button named "Crispr Properties" will open the output file you need. Save the file, and you can use the file or its 
			content as input in CRISPRTarget.
		
		[Note: You can concatenate all the predicted CRISPR Arrays (individual output files) in one file and upload/paste in CRISPRTarget.]
        
        More detailed description can be found here.

Sample CRISPRFinder file

Upload just the spacers in FASTA or multiFASTA format :

>identifier_1_1
GGGTTGGGGGTTTTA
>identifier_1_12
AACGGCGTTGGGGGTTATT
>identifier_2_1
GCCCAGGTTGGGGGTTCGTT
.
.

Note: If this option is used the spacer sequence cannot be extended into the adjacent DR. However the traget will be extended to show the flanking handles

B. Remove redundant spacers.

This option is useful if your input has multiple spacers from a number of related species (e.g. all E.coli spacers). Identical spacers will be removed and listed in a file. However, please note that reverse-complement of the spacers are not checked.

C. Upload the FASTA sequence file which was used for generating the CRISPR.

Uploading the source sequence of the CRISPR array is optional unless you want a longer handle region greater than the length of relative direct repeat(s). As most (except CRISPRFinder) CRISPR finding tools provide both the direct repeats as well as the spacers, the handle regions can be extracted from the adjacent direct repeats. But this has a length limitations, which restricts the maximum handle length same as the adjacent direct repeats length.

2. Select Target databases

Note: Hold down Control key and click on any database from the list to select/unselect multiple or no databases.

Selected databases are provided in CRISPRTarget. The database updates are provided in the news.

In general Databases will be updated after each release of Genbank (even months) or RefSeq (odd months). The nt and env databases come from the BLAST databases and will be updated monthly.

Genbank (BLAST Nucleotide) databases, these are held locally but updated bimonthly with GenBank or RefSeq releases.

Databases are updated with full releases of Genbank and RefSeq (Note: Files (e.g. RefSeq plasmid.1.1.genomic.fna.gz. or GenBank gbphg1.seq.gz) files are downloaded, converted to fasta if needed, concatenated, converted to blast databases, and BLAST+ run locally)

a) The nr/nt collection (~43 billion bases Genbank. This database contains "All GenBank + EMBL + DDBJ + PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences)." (no longer availble due to size).
b) env_nt. This contains "Sequences from environmental samples, such as uncultured bacterial samples isolated from soil or marine samples. The largest single source is Sargasso Sea project. This does not overlap with nucleotide nr". This is part of the whole genome shotgun (wgs) but these sequences have no taxonomic classification other than metagenome.
c) Phage division (phg)

RefSeq databases:

here

a) RefSeq-Plasmid.
b) RefSeq-Viral.
c) RefSeq-Archaea
d) RefSeq-Bacteria (no longer availble due to size).

CAMERA database: We included viral parts of the CAMERA databases. 913,9883 gene sequences, 1 Billion bases (Files: CAM_PROJ_ReclaimedWaterVirues.read.fa, CAM_PROJ_MarineVirome.read.fa, CAM_P_0000909.read.fa, CAM_P_0000792.read.fa). ACLAME. 125,190 sequences, 96 million bases (V0.04, 8/2009, last version).

IMG/VR database: IMG_VR: IMGVR_all_nucleotides.fna.gz . The current version is v4 Sept 2022 (or 6.1). IMG_VR_2022-09-20_6.1 - IMG/VR v4 - high-confidence genomes only (~80Gb). First version provided from 7/2018, legacy version in in the test directory 43 Gb Oct 2020 v3 or 5.1

This includes IMGVR viral contigs with IDS like IMGVR_UViG_2579779064_000006|2579779064|2579849396|1-17041, some genomes from RefSeq with IDs like NC_027986.1, and other entries Gammaproteobacteria_gi_553770258, and UGV-GENOME-3293712.

To interpret the results you will need the information file downloaded from here (requires JGI registration): IMG_VR: IMGVR_all_Sequence_information.tsv

HUVirDB Database downloaded from here : HuVirDB Assemblies opengut.ucsf.edu/HuVirDB-1.0.fasta.gz Cite Soto-Perez et al., 2019, Cell Host & Microbe 26, 325–335 https://doi.org/10.1016/j.chom.2019.08.008

User database:

If you are interested in analyzing SRA sequences, you need to download the relative sequences from SRA databases, and convert them to FASTA format. The FASTA formatted sequences can be used as "User Database". For more information refer here.

3. Select BLAST parameters

The CRISPRTarget BLASTn parameters favour gapless matches but allow a number of mismatches at this screening stage, with a higher gap penalty 10, rather than 5 than the NCBI defaults.

The default values used by NCBI BLASTn for short sequences <30 bases (defaults for long sequences are in brackets) are:

Gap open -5(-5)
Gap extend -2(-2)
Match +1(+1)
Mismatch -3(-3)
Word size 7(11)
Expect (E): 1000 (10)
Filter: No (Yes)

Blastn-short (noticed 8/2018) now uses +2/-3 more similar to +1/-1 used here. A 30 base exact matrch is 60 for blastn-short and 30 for CRISPRTarget

The initial CRISPRTarget defaults are the same except that a gap is penalised more highly (-10), the mismatch penalty is -1 and the E filter is 1.
In addition, there is also no filter or masking for low complexity. BLAST calculates the scores over the length of the match, and only shows this
match. For example, a spacer of 32 bases that matches to a target in 17 of 20 bases would score 20-3=17 and 20 bases would be output. The
expected (E) values of the match will be more likely to pass the filter if smaller databases are used (e.g. the default phg and plasmid).

Changing BLAST parameters: Please note that only certain combinations of parameters produce valid statistics (this others will not work).
For +1, -1 an attempt to use some combinations might fail. See the following example for allowed paramters:

$ blastn -db database -query myseq -gapopen 1 -gapextend 1 -reward 1 -penalty -1

BLAST engine error: Error: Gap existence and extension values 1 and 1 are not supported for substitution scores 1 and -1

3 and 2 are supported existence and extension values

2 and 2 are supported existence and extension values

1 and 2 are supported existence and extension values

0 and 2 are supported existence and extension values

4 and 1 are supported existence and extension values

3 and 1 are supported existence and extension values

2 and 1 are supported existence and extension values

4 and 2 are supported existence and extension values

Any values more stringent than 4 and 2 are supported (e.g. 10, 2)

Suggestion: Useful changes might be:

a. Reducing the gap penalty to 4 or 5 if you have reason to believe that gaps are tolerated in your system.

b. Increasing the E to 10 or 100 in the unlikely event you are not getting hits.

c. Increasing the mismatch penalty to -3 screens out mismatches.

6. Set the DB size (effective database size).
This is optional. This should be the total size of the databases you search.

BLAST calculates the E (Expect) value based on the size of the database searched.
If one search against multiple databases is done the database need not be specified as BLAST does it internally.

To compare the significance of matches in two or more consecutive searches of different databases,
this value should be set as the sum of the two databases sizes (e.g. for 270 Mb + 80 Mb= 350 Mb enter '350000000').

Once the submit button is pressed, CRISPRTarget shows progress with links to intermediate files.

The log will look like the above picture. Typically, for single CRISPR Array (with relatively small number say 31 spacers in the above case) takes just few seconds. However, the total computational time depends on the number of databases selected as well as total number of spacer sequences.

All the matches that pass the BLAST filter and score cutoff (e.g. 20) are shown. They can be reordered and scores recalculated.

The protospacer target is extended by extracting the user-specified length of 5' or 3' handle sequences from the BLAST database.

CRISPRTarget interactive scoring- All putative spacer/protospacer targets passing the BLAST screen are displayed in an interactive manner. An initial score is calculated by scoring matches (+1) and mismatches (-1) across the whole length of the spacer without gaps. Specific user defined 'seed' regions can be required to match at either or both ends of the protospacer.

A match to predefined, or novel user defined, PAM sequences can increase the score. In order to penalise self-matches that would match 100% in both spacers and flanking handles (e.g. to the original genomic array sequence), a score can be used that penalises matches (e.g. -1) in the flanking handles. Mismatch penalties can also be used to identify targeting that is facilitated by mismatches in the handles (e.g. type III-A).

Finally, a cutoff score can be applied to display only those matches with the best scores.

A. Spacer orientation to display:

By default the Spacer sequence (top most in any set) is shown in 5' to 3' orientation, and the protospacer sequence (the target sequence which base pairs with spacer sequence) is shown in 3' to 5'. However user can choose to display the other strand of the Spacer sequence, which brings the other strand of the protospacer sequence in the middle. This option is especially useful when the orientation of the CRISPR array is not known/certain.

B. Order output based on spacer ID:

a spacer ID is represented with 3 elements, the sequence ID,CRISPR Index and spacer Index, separated by underscore (e.g: EF434469_1_13). By default the output is sorted in descending order of the calculated score. However, if the user wants to show/arrange the output based on the spacer ID, selecting this option will achieve that. This option can be very useful in visually inspecting the output, as it maintains the order of the spacer for every CRISPR.

C. Cutoff score:

The cutoff score is used for filtering out the low scoring matches from the output. The default value is 20, but user can use any cutoff or no cutoff value to show/hide matches.

D. Spacer match score:

The default values for match and mismatch are +1 and -1 respectively. These values along with the cutoff score provide a smart way to push the matches with gaps down the order or even omit them from display. The spacer sequence is shown in the right side image.

E. Scores for the 3' region of protospacer:

All the parameters shown in the above image is for the 3' handle and its adjacent region. Each of the options are described in detail below:

5' crRNA handle length : The default value used is 8, but user can increase/decrease the length of the handle ranging from 0 to any number
(e.g: 100). There is no upper limit/restriction, but if the source sequence is not available, the length will be automatically adjusted. The
handle sequence belongs to the repeat sequences (unless the handle length is greater than repeat sequence length).

Score for each base match and mismatch : The default value set to 0, but user can alter the values to any positive or negative number
(e.g: match: -1, mismatch: +1). If handle is present, these values can greatly help identifying the self matches. As, for self matches, the handle
sequence of spacer will found all base pairing nucleotides. penalizing base pairing in the handle region will send the self matches down the order
or even filter them out (using the cutoff score).

Select PAM from the list : PAM (Protospacer Adjacent Motif) is often the key indicator of true positive CRISPR target match. It can be used to
identify the targets of known CRISPR systems. The PAM types are shown below:
I-A: NGG
I-B: NGG
I-E: CAT,CTT,CCT,CTC
I-C: GAA
I-F: GG

Give PAM : User can also give a PAM motif (e.g: CGT). CRISPRTarget supports user given PAM to have IUPAC code. The following nucleotide codes
are supported.

IUPAC_code Base
A Adenine
C Cytosine
G Guanine
T/U Thymine (or Uracil)
R A or G
Y C or T
S G or C
W A or T
K G or T
M A or C
B C or G or T
D A or G or T
H A or C or T
V A or C or G
N any base
./- gap

PAM match score: The default value is +5, but can use any positive or negative integer value. PAM score can greatly help reordering the output
and bring the true positive or target matches high in order. A combination of Spacer match/mismatch score, handle match/mismatch score and PAM
match score along with cutoff score can greatly improve the outcome, specially when the output consists of several hundreds of matches.

Seed require complementarity in the leading bases: This is one of the most important feature to directly filter out unsuitable matches
from the output. As BLAST report may contain hits with partial match between spacer and protospacer. Often it doesn't start from the first base
of the spacer, but for many CRISPR systems it's a must that the leading bases (adjacent to the PAM) should not have any mismatch, or a mismatch
might be allowed (e.g: 5th base) but not to the other leading bases (e.g: no mismatch to the first 1 to 4, and 6 to 8). Researchers with such
known models (properties), can apply the condition right at the very begining. For the said example, the input should be given as below:

"Seed require complementarity in the leading 1-8 bases except base 5 of the spacer and protospacer pair."

Note: if you want to exclude multiple bases, then give them comma separated. For the above example, if you want to exclude base 3 and 5, then
give input as below:

"Seed require complementarity in the leading 1-8 bases except base 3,5 of the spacer and protospacer pair."

F. Scores for the 5' region of protospacer:

All the parameters shown in the above image is for the 5' handle and its adjacent region. Each of the options are described in detail below:

3' crRNA handle length : The default value used is 8, but user can increase/decrease the length of the handle ranging from 0 to any number
(e.g: 100). There is no upper limit/restriction, but if the source sequence is not available, the length will be automatically adjusted. The
handle sequence belongs to the repeat sequences (unless the handle length is greater than repeat sequence length).

Score for each base match and mismatch : The default value set to 0, but user can alter the values to any positive or negative number
(e.g: match: -1, mismatch: +1). If handle is present, these values can greatly help identifying the self matches. As, for self matches, the handle
sequence of spacer will found all base pairing nucleotides. penalizing base pairing in the handle region will send the self matches down the order
or even filter them out (using the cutoff score).

Select PAM from the list : PAM (Protospacer Adjacent Motif) is often the key indicator of true positive CRISPR target match. It can be used to
identify the targets of known CRISPR systems. The PAM types are shown below:

II-A: WTTCTNN,TTTYRNNN
II-B: CNCCN,CCN

Give PAM : User can also give a PAM motif (e.g: CGT). CRISPRTarget supports user given PAM to have IUPAC code. The following nucleotide codes
are supported.

IUPAC_code Base
A Adenine
C Cytosine
G Guanine
T/U Thymine (or Uracil)
R A or G
Y C or T
S G or C
W A or T
K G or T
M A or C
B C or G or T
D A or G or T
H A or C or T
V A or C or G
N any base
./- gap

PAM match score: The default value is +5, but can use any positive or negative integer value. PAM score can greatly help reordering the output
and bring the true positive or target matches high in order. A combination of Spacer match/mismatch score, handle match/mismatch score and PAM
match score along with cutoff score can greatly improve the outcome, specially when the output consists of several hundreds of matches.

Seed require complementarity in the leading bases: This is one of the most important feature to directly filter out unsuitable matches
from the output. As BLAST report may contain hits with partial match between spacer and protospacer. Often it doesn't start from the first base
of the spacer, but for many CRISPR systems it's a must that the leading bases (adjacent to the PAM) should not have any mismatch, or a mismatch
might be allowed (e.g: 5th base) but not to the other leading bases (e.g: no mismatch to the first 1 to 4, and 6 to 8). Researchers with such
known models (properties), can apply the condition right at the very begining. For the said example, the input should be given as below:

"Seed requires complementarity in the leading 1-8 bases except base 5 of the spacer and protospacer pair."

Note: if you want to exclude multiple bases, then give them comma separated. For the above example, if you want to exclude base 3 and 5, then
give input as below:

"Seed requires complementarity in the leading 1-8 bases except base 3,5 of the spacer and protospacer pair."