TSSpredator - A TSS prediction and classification tool

TSSpredator is a tool for the comparative detection of transcription start sites (TSS) from RNA-seq data. It can integrate data from different experimental conditions but also from different organisms on the basis of a multiple whole-genome alignment.

So far TSSpredator has been successfully applied to many data sets generated using the so-called dRNA-seq protocol by Sharma et al., 2010 1. In addition we have started using it on RNA-seq data produced with the protocol by Ettwiller 2 or Innocenti 3. In this UserGuide we will concentrate on dRNA-seq data, in section Other protocols we will give a short overview of how to use TSSpredator on data produced by that protocols.

Content

Usage

TSSpredator needs so-called wiggle or graph files in order to predict TSS from RNAseq data. TSSpredator does not provide methods to compute these from raw fastq reads. In order to generate these, we recommend to use the pipeline READemption [4], which is described in section READemption.

Getting started

Preparing the input files

TSSpredator can be run either analyse TSS data derived from one bacterial organism to compare different conditions or from several different bacterial organisms to compare different strains/species. Comparison of different strains/species For this case, a multiple genome alignment of the genomes studied needs to be precomputed. We recommend to compute this using progressive Mauve (available from <http://darlinglab.org/mauve/user-guide/progressivemauve.html>). It saves the alignment in an xmfa file format, which is then loaded into TSSpredator. For each analysed organism, TSSpredator needs the genome file in fastA format. These need to be the identical fastA files that were used for the multiple genome alignment. In addition TSSpredator is able to use an annotation file in gff or gtf format for each genome in order to classify found TSS. Please make sure that the identifier of the header in the annotation file is written after ##Type DNA identifier or ##sequence-region identifier, as you can see in the following example:

##gff-version 3
#!gff-spec-version 1.14
#!source-version NCBI C++ formatter 0.2
##Type DNA AJSZ01000001.1
AJSZ01000001.1 RefSeq gene 217 633 . - . ID=STSU 00005;

or

##gff-version 3
#!gff-spec-version 1.14
#!source-version NCBI C++ formatter 0.2
##sequence-region AJSZ01000001.1
AJSZ01000001.1 RefSeq gene 217 633 . - . ID=STSU 00005;

Identifier coherence

One important issue before starting TSSpredator is to check the input data for the used identifier. The genome fastA file(s), the respective annotation file(s) and all wiggle files should have the same identifier in the header. The following example shows, how it can look like:

Genome fasta header:
>NC 000915.1 Helicobacter pylori
Wiggle file:
chrom=NC 000915.1
Annotation file:
##gff-version 3
#!gff-spec-version 1.14
#!source-version NCBI C++ formatter 0.2
##Type DNA NC 000915.1
NC 000915.1 RefSeq gene 217 633 . - . ID=NC 000915.1:nusB;

The first column of the annotation file must be the same identifier as the header or a part of it, i.e. it should be contained. But it should be noted that for further analyses, for example when using the Integrative Genome Browser, the fastA IDs have to be the same as the ones in the first column of the annotation file. If the first column is only a part of the header ID, TSSpredator will run without problems but in further analyses problems can occur. The user has the option to address this problem after TSS prediction, depending on what post-TSS prediction work will be done. In the following typical error messages and warnings are shown. If there is no match of the identifiers between all three files, TSSpredator will stop and print the following exception, see Figure 1. This could be the case if an annotation file or another species is used or the header IDs are not identical.

_images/wrongID.png

Figure 1

If an annotation file is missing, prediction of TSS is still done without classification and all TSS will be classified as orphan. See Figure 2.

_images/noGFF.png

Figure 2

In the case of a multi-contig genome of a genome containing a plasmid together with a chromosome, TSSpredator first checks all headers and in the case of mismatched headers, warnings are printed in the message area about the failed header evaluation, see figure Figure 3. TSSpredator will not stop TSS prediction.

_images/contigs.png

Figure 3

Overview

There are two ways to use TSSpredator. The most convenient way is to use its graphical user interface (GUI), which is described in section User Interface. Here, all settings and parameters can be specified that are needed for the prediction. For a detailed description of the parameters see section 6. After setting up the study the configuration can be saved. Pressing the RUN button starts the prediction procedure. All results are saved in the specified output folder. The most important result file is the Master Table (MasterTable.tsv ). For a detailed description of all result files see section Output. Another way to utilize TSSpredator is via its command line interface. This is especially useful for automatization or integration in an analysis pipeline. For this, TSSpredator has to be started with a single argument, which is the path of a configuration file (e.g. called ‘config.conf’), as it is saved by TSSpredator’s GUI. For example:

java -Xmx1G -jar TSSpredator.jar config.conf

Methods

Normalization

Before the comparative analysis we normalize the expression graph data that is used as input. A percentile normalization step is applied to normalize the graphs from the enriched library. For this the 90th percentile (default, see normalization percentile) of all data values is calculated for each graph of a treated library. This value is then used to normalize this graph as well as the respective graph of the untreated library. Thus, the relative di erences between each pair of libraries (treated and untreated) are not changed in this normalization step. All graphs are multiplied with the overall lowest value to restore the original data range. To account for di erent enrichment rates a further normalization step is applied. During this step a prediction of TSS candidates is performed for each strain/condition. These candidates are then used to determine the median enrichment factor for each library pair (default, see enrichment normalization percentile). Using these medians all untreated libraries are then normalized against the library with the strongest enrichment.

The SuperGenome

To be able to assign TSS that have been detected in di erent genomes to each other TSSpredator computes a common coordinate system for the genomes. This is done on the basis of a whole-genome alignment. For the generation of whole-genome alignments the software Mauve can be used, for example. It is able to detect genomic rearrangements and builds multiple whole-genome alignments as a set of collinearly aligned blocks. The resulting xmfa  le is then read by TSSpredator and the alignment information in the blocks is used to calculate a joint coordinate system for the aligned genomes and mappings between this coordinate system and the original genomic coordinates. In addition to the cross-genome comparison of detected TSS this allows for an alignment of RNA-seq expression graphs, which can then be visualized in a genome browser. If different experimental conditions are compared with the same genome, the SuperGenome construction step is skipped (see Type of Study setting).

TSS prediction

The initial detection of TSS in the single strains/conditions is based on the localization of positions, where a signi cant number of reads start. Thus, for each position i in the RNA-seq graph corresponding to the treated library the algorithm calculates e(i)-e(i-1), where e(i) is the expression height at position i. In addition, the factor of height change is calculated, i.e. e(i)/e(i-1). To evaluate if the reads starting at this position are originating from primary transcripts the enrichment factor is calculated as e _treated(i)/e_untreated(i). For all positions where these values exceed the threshold a TSS candidate is annotated. If the TSS candidate reaches the thresholds in at least one strain/condition the thresholds are decreased for the other strains/conditions. We declare a TSS candidate to be enriched in a strain/condition if the respective enrichment factor reaches the respective threshold. A TSS candidate has to be enriched in at least one strain/condition and is discarded otherwise. If a TSS candidate does not appear to be enriched in a strain/condition but still reaches the other thresholds it is only indicated as detected. However, a TSS candidate can only be labeled as detected in a condition if its untreated expression value does not exceed its treated expression value by a factor higher than the chosen processing site factor. Otherwise we consider it to be a processing site. TSS candidates that are in close vicinity (TSS cluster distance) are grouped into a cluster and by default only the TSS candidate with the highest expression is kept (see cluster method). The final TSS annotations are then characterized with respect to their occurrence in the dfifferent strains/conditions and in which strain/condition they appear to be enriched. The TSS are then further classified according to their location relative to annotated genes. For this we used a similar classification scheme as previously described [1]. Thus for each TSS it is decided if it is the primary or secondary TSS of a gene, if it is an internal TSS, an antisense TSS or if it cannot be assigned to one of these classes (orphan). A TSS is classifed as primary or secondary if it is located upstream of a gene not further apart than the chosen UTR length. The TSS with the strongest expression considering all conditions is classified as primary. All other TSS that are assigned to the same gene are classified as secondary. Internal TSS are located within an annotated gene on the sense strand and antisense TSS are located inside a gene or within a chosen maximal distance (antisense UTR length) on the antisense strand. These assignments are indicated by a 1 in the respective column of the MasterTable. Orphan TSS, which are not in the vicinity of an annotated gene, are indicated by zeros in all four columns.

User Interface

See Figure 4 for the explanation of the GUI of TSSpredator.

Study setup In the study setup area (Figure 4 A) general settings for the study can be made. Most importantly these are the type of the study (comparison of strains (requires alignment file) or conditions), the number of genomes/conditions and replicates, and the path to the output directory. A project name can also be specified.

Parameter area In the parameter area (Figure 4 B) specific parameters of the TSS prediction procedure can be changed. Instead of changing individual parameters it is also possible to select a parameter preset from the drop-down menu.

Genome/Condition related settings For each genome/condition of the study a tab is generated (Figure 4 C), in which settings specific for this genome/condition can be made. This includes the name and ID, and file paths to the genomic sequence (FASTA) and the genome annotation (GFF). In addition, for each replicate a tab is displayed within the respective genome/condition tab, where the RNA-seq wiggle files of the replicate can be entered. If only fastq files are available instead of wiggle files, it is recommended to use the READemption pipeline (see section READemption to generate them).

Message Area In the message area (Figure 4 D) information about a running prediction process is displayed. Thus, it can be easily determined in which step a running prediction procedure currently is. At the end of the procedure a brief summary is shown.

Load/Save Configuration Using the Save or Load button (Figure 4 E) a configuration including all settings can be saved, or a previously saved configuration can be loaded, respectively. Loading a configuration overwrites all current settings.

Run prediction By pressing the RUN button (Figure 4 F) the prediction procedure is started using the current settings and parameters. Information about the running process is displayed in the message area. The running prediction can be canceled using the Cancel button. Note that this might result in incomplete output files.

images/TSSpredatorgui.png:width:200:align:center

Figure 4

Settings and Parameters

Study Setup

Project Name Enter a name for the study. Type of Study Choose between Comparison of different conditions and Comparison of different strains/species. For a cross-strain analysis an alignment file has to be provided (see below). In addition, in each genome tab an individual genomic sequence and genome annotation has to be set. When comparing different conditions no alignment file is needed and the genomic sequence and genome annotation of the organism has to be set in the first genome tab only.

Number of Genomes Set the number of different strains/conditions in the study. Press the ‘Set’ button to generate a settings tab for each strain/condition and each replicate of the study.

Number of Replicates Set the number of different replicates for each strain/condition. Press the ‘Set’ button to generate a settings tab for each strain/condition and each replicate of the study.

Output Data Path Select the folder in which all result files will be placed.

Alignment File Select the xmfa alignment file containing the aligned genomes. If the study compares different conditions, this field is inactive.

TSS prediction parameters

In the following the parameters affecting the TSS prediction procedure are described. Instead of changing the parameters manually it is also possible to select predefined parameters sets using the parameter presets drop-down menu.

step height This value relates to the minimal number of read starts at a certain genomic position to be considered as a TSS candidate. To account for different sequencing depths this is a relative value based on the 90th percentile of the expression height distribution. A lower value results in a higher sensitivity.

step height reduction When comparing different strains/conditions and the step height threshold is reached in at least one strain/condition, the threshold is reduced for the other strains/conditions by the value set here. A higher value results in a higher sensitivity. Note that this value must be smaller than the step height threshold.

step factor This is the minimal factor by which the TSS height has to exceed the local expression background. This feature makes sure that a TSS candidate has to show a higher expression in regions of locally high expression than would be necessary in regions where no expression background is detected. A lower value results in a higher sensitivity. Set this value to 1 to disable the consideration of the local expression level.

step factor reduction When comparing different strains/conditions and the step factor threshold is reached in at least one strain/condition, the threshold is reduced for the other strains/conditions by the value set here. A higher value results in a higher sensitivity. Note that this value must be smaller than the step factor threshold.

enrichment factor The minimal enrichment factor for a TSS candidate. The threshold has to be exceeded in at least one strain/condition. If the threshold is not exceeded in another condition the TSS candidate is still marked as detected but not as enriched in this strain/condition. A lower value results in a higher sensitivity. Set this value to 0 to disable the consideration of the enrichment factor.

processing site factor The maximal factor by which the untreated library may be higher than the treated library and above which the TSS candidate is considered as a processing site and not annotated as detected. A higher value results in a higher sensitivity.

step length Minimal length of the TSS related expression region (in base pairs). This value depends on the length of the reads that are stacking at the TSS position. In most cases this feature can be disabled by setting it to ‘0’. However, it can be useful if RNA-seq reads have been trimmed extensively before mapping.

base height This value relates to the minimal number of reads in the non-enriched library that start at the TSS position. This feature is disabled by default.

Normalization Settings

normalization percentile By default a percentile normalization is performed on the RNA-seq data. This value defines the percentile that is used as a normalization factor. Set this value to ‘0’ to disable normalization.

enrichment normalization percentile By default a percentile normalization is performed on the enrichment values. This value defines the percentile that is used as a normalization factor. Set this value to ‘0’ to disable normalization.

Output options

write RNA-seq graphs If this option is enabled, the normalized RNA-seq graphs are written into the output folder. Disable this option, if the normalized graphs are not needed or if they have been written before. Note that writing the graphs will increase the runtime.

TSS Clustering Settings

cluster method TSS candidates in close vicinity are clustered and only one of the candidates is kept. HIGHEST keeps the candidate with the highest expression. FIRST keeps the candidate that is located most upstream.

TSS clustering distance This value determines the maximal distance (in base pairs) between TSS candidates to be clustered together. Set this value to ‘0’ to disable clustering.

Comparative Settings

allowed cross-genome/condition shift This is the maximal positional difference (bp) for TSS candidates from different strains/conditions to be assigned to each other.

allowed cross-replicate shift This is the maximal positional difference (bp) for TSS candidates from different replicates to be assigned to each other.

matching replicates This is the minimal number of replicates in which a TSS candidate has to be detected. A lower value results in a higher sensitivity.

Classification Settings

UTR length The maximal upstream distance (in base pairs) of a TSS candidate from the start codon of a gene that is allowed to be assigned as a primary or secondary TSS for that gene.

antisense UTR length The maximal upstream or downstream distance (in base pairs) of a TSS candidate from the start or end of a gene to which the TSS candidate is in antisense orientation that is allowed to be assigned as an antisense TSS for that gene. If the TSS is located inside the coding region on the antisense strand it is also annotated as an antisense TSS.

Genome specific Settings

Name Brief unique name for this strain/condition, which can be freely chosen. As this name is also used in some filenames any special characters (including spaces) should be avoided.

Alignment ID The identifier of this genome in the alignment file. If Mauve was used to align the genomes, the identifiers are just numbers assigned to the genomes in the order as they have been chosen as input in Mauve.

The first lines of the alignment file should also contain this information:

#FormatVersion Mauve1
#Sequence1File genomeA.fa
#Sequence1Format FastA
#Sequence2File genomeB.fa
#Sequence2Format FastA

In this example `genomeA’ has ID 1 and `genomeB’ has ID 2. When loading an alignment file (xmfa) TSSpredator tries to set the alignment IDs automatically.

genome FASTA FASTA file containing the genomic sequence of this genome.

genome annotation GFF/GTF file containing genomic annotations for this genome (as can be downloaded from NCBI).

output ID The specified output ID defines which gene tag in the attributes column (in the provided gff/gtf annotation file) should be used for TSS classification. Examples are locus tag or gene id.

Graph Files

enriched plus Select the file containing the RNA-seq expression graph for the plus strand (forward) from the 5’ enrichment library.

enriched minus Select the file containing the RNA-seq expression graph for the minus strand (reverse) from the 5’ enrichment library.

normal plus Select the file containing the RNA-seq expression graph for the plus strand (forward) from the library without 5’ enrichment.

normal minus Select the file containing the RNA-seq expression graph for the minus strand (reverse) from the library without 5’ enrichment.

Output

Master Table (MasterTable.tsv)

This table contains information on positions and class assignments of all automatically annotated TSS. The table consists of the following columns:

SuperPos The position of the TSS in the SuperGenome.

SuperStrand The strand of the TSS in the SuperGenome.

MapCount Number of strains into which the TSS can be mapped. Separate entry lines exist for each strain to which the TSS can be mapped whether the TSS was detected in that strain or not.

detCount The number of strains/conditions in which this TSS was detected in the RNA-seq data.

Condition The identifier of the strain/condition to which the rest of the line relates.

detected Contains a ‘1’ if the TSS was detected in this strain/condition.

enriched Contains a ‘1’ if the TSS is enriched in this strain/condition.

stepHeight The expression height change at the position of the TSS. This relates to the number of reads starting at this position. (e(i) - e(i-1); e(i): expression height at position i)

stepFactor The factor of height change at the position of the TSS. (e(i)/e(i-1); e(i): expression height at position i)

enrichmentFactor The enrichment factor at the position of the TSS.

classCount The number of classes to which this TSS was assigned.

Pos Position of the TSS in that genome.

Strand Strand of the TSS in that genome.

Locus tag The locus tag of the gene to which the classification relates.

Product The product description of this gene.

UTRlength The length of the untranslated region between the TSS and the respective gene (nt). (Only applies to ‘primary’ and ‘secondary’ TSS.)

GeneLength The length of the gene (nt).

Primary Contains a ‘1’ if the TSS was classified as ‘primary’ with respect to the gene stated in ‘locusTag’.

Secondary Contains a ‘1’ if the TSS was classified as ‘secondary’ with respect to the gene stated in ‘locusTag’.

Internal Contains a ‘1’ if the TSS was classified as ‘internal’ with respect to the gene stated in ‘locusTag’.

Antisense Contains a ‘1’ if the TSS was classified as ‘antisense’ with respect to the gene stated in ‘locusTag’.

Automated Contains a ‘1’ if the TSS was detected automatically.

Manual Contains a ‘1’ if the TSS was annotated manually.

Putative sRNA Contains a ‘1’ if the TSS might be related to a novel sRNA. (Not evaluated automatically)

Putative asRNA Contains a ‘1’ if the TSS might be related to an asRNA.

Sequence -50 nt upstream + TSS (51nt) Contains the base of the TSS and the 50 nucleotides upstream of the TSS.

Supplemental Files

*strain* _super.fa Contains the genome sequence of each strain mapped to the coordinate system of the SuperGenome. All 4 files together actually contain the whole-genome alignment. These files can be used in genome browsers that allow the user to load several sequences simultaneously.

*strain*_super.gff  Contains the gene annotations of each strain mapped to the coordinate system of the SuperGenome.

*strain*_super*Type*Strand.gr Contains the xy-graphs of each strain mapped to the coordinate system of the SuperGenome. Type is either ‘FivePrime’ (treated) or ‘Normal’ (untreated). Strand is either ‘Plus’ or ‘Minus’. Note that the files now contain the value 0.0001 instead of 0 as a value of 0 (i.e. no entry line) now indicates a gap. This is necessary for IGB’s thresholding feature (see below).

superTSS.gff  Contains all TSS predicted in the four strains in the coordinate system of the SuperGenome. Also all TSS that were only predicted in one strain are listed. The information in how many strains (and in which) a TSS was detected is given in superClasses.tsv. In the header line all parameter names and values which are used for the run are reported.

TSSstatistics.tsv Contains some general statistics about the TSS prediction results.

Other protocols

Cappable-Seq

Cappable-Seq was developed by Ettwiller et al. in 2016 for the directly capturing of the 5’ end of primary transcripts [2]. Besides the Cappable-seq library a control library is also prepared omitting the streptavidin capture step. Now TSSpredator can be ran with Cappable-seq library as the enriched library and the control library as the normal not enriched library.

TagRNA-Seq

TagRNA-seq is a modified RNA-seq method which is based on the differential labelling of 5’ RNA ends, enabling the discrimination of primary from processed 5’ RNA ends. Details of the method can be found in the paper of Innocenti, N. et al. [3]. Using distinct sequence tags for processed start sites (PSS) and transcription start sites (TSS) 5’ RNA ends are differentially labeled. Now prior alignment reads can be sorted by their tag sequences. To use this data with TSSpredator the TSS reads should be used as the enriched library and the PSS reads or PSS and unassigned reads as the normal unenriched library. Which reads are better for using as the unenriched library is not confirmed yet.

READemption

READemption 4 is a pipeline for the computational evaluation of RNA-Seq data. It was originally developed to process dRNA-Seq reads (as introduced by Sharma et al. [1] originating from bacterial samples. Meanwhile it has been extended to process data generated in different experimental setups. An elaborate description of all current features of READemption is available under <https://reademption.readthedocs.io/en/latest/#>. In this section we describe how to use READemption for the purpose of producing wiggle data from raw reads produced using the dRNA-seq protocol.

Comparison of different conditions As already in section installation explained, TSSpredator is able to analyse genomes consisting multiple contigs or a chromosome with one ore more plasmids or several chromosomes. After preparing the input data as explained, one can create a reademption folder with the following command:

reademption create -f TestAnalysis

After the creation you will get a message to copy your references and reads into the folders TestAnalysis/input/reference sequences and TestAnalysis/input/reads/ respectively. For the multi-fastA case all genomes of the contigs, chromosomes or plasmids should be in one fastA file. After all needed files are copied, you can run the following command to start mapping:

Please check the website for more informations about the parameter which can be set. In this step all provided reads undergo a processing step and the processed reads will be mapped to the genome (in the multi-fastA case to each entry) using segemehl 5. For generating wiggle files the subcommand coverage has to be used:

reademption coverage -f TestAnalysis

Using the bam files generated by the subcommand align, one-base coverages are calculated, for each sample for the forward and reverse strand separately. Positions with zero coverage are not listed in the wiggle files. The coverage step creates three folders, raw coverages and coverages with different normalizing factors. For TSSpredator we use the wiggles files contained in the coverage-tnoar-min-normalized folder.

Comparison of different strains/species For the analysis of different strains or species reademption should be started for each strain/species separately. As the normalization step should be done over all wiggle files from all strains/species you have to make the normalization by your own. We recommend to take the wiggle files from the folder coverage-tnoar-mil-normalized, divide the coverage column of all files by 1.000.000 and multiply by the lowest number of aligned reads of all considered libraries. The lowest number can be found by comparing the file names of the wiggle files. After normalization the wiggle files can be used by TSSpredator as described before.

4

Konrad U. Förstner, Jörg Vogel, Cynthia M. Sharma. 2014, “READemption – A tool for the computational analysis of deep-sequencing-based transcriptome data.”, Aug 13, Bioinformatics.

5

Hoffmann S, Otto C, Kurtz S, Sharma CM, Khaitovich P, Vogel J, Stadler PF, Hackermueller J, “Fast mapping of short sequences with mismatches, insertions and deletions using index structures”, PLoS Comput Biol (2009) vol. 5 (9) pp. e1000502

HowTo: example analysis with S.aureus

Quick step-by-step guide to use TSSpredator with the example data provided here. This data is the S. aureus data comparing wildtype and a knockout mutant.

  1. Open a Terminal and navigate to the folder with the TSSpredator jar file. Type:

java -jar TSSpredator-1.1beta.jar

You can also try to start TSSpredator by a double-click on the TSSpredator-1.1beta.jar file. You will be asked for the memory to be allocated. Click the big button for an automated selection or choose a value. You can now choose to load the config file called TSSpredator_Saureus.config that is provided and continue with step 8 or you continue with the following steps 2-7. If you want to use the config file you need to adjust the paths in the config file to the correct paths on your computer. This can easily be done with a text editor.

  1. Choose an output folder. In the upper right corner of the TSSpredator window you can choose an output folder, where all result files will be put. You can choose the results folder in example-data/Saureus-2Cond-2Repl/Archive/

  2. Set the number of replicates. In the upper left corner of the TSSpredator window you can set the number of replicates. For this example data, set this number to 2 and press Set.

  3. You can also give a name to the Project in the upper left corner.

  4. Load genomes, annotations. In the lower part of the TSSpredator window you will find a tab for each strain (genome) that was loaded from the alignment. In each tab load the respective files for the genomic sequence (fasta) and the genome annotation (GFF). The files can be found in example-data/Saureus-2Cond-2Repl/Archive/ in the genome_anno folder.

  5. Load the RNA-seq data. In each tab you will also find two tabs for the RNA-Seq Graph Files (One for each replicate). The respective (wiggle) graph files can be found in example-data/Saureus-2Cond-2Repl/Archive/S_aureus_coverages folder. Each file name from the wildtype is set up as GM_SA_WT[2,3]_[plus,minus]_*_[forward,reverse]_in_NC_009641.1.gr Each file name from the knockout is set up as GM_SA_rny[1,2]_[plus,minus]_*_[forward,reverse]_in_NC_009641.1.gr and can thus be easily identified. For example: GM_SA_rny1_minus_TEX.fa_div_by_6538401.0_multi_by_5727088.0_ forward.wig_GM_SA_rny1_minus_TEX.fa_forward_in_NC_009641.1 denotes the graph file of the any mutant, first replicate of the normal (non-enriched) library mapped to the forward strand.

  6. Save the configuration. The configuration is now complete using standard parameters. If you want, you can save the configuration for later use by clicking the Save button in the lower left corner.

  7. Start the detection procedure. Start the process by clicking the RUN button in the lower right corner.

  8. View the results. An overview statistic will be shown at the end of the process in the message area (lower part of the TSSpredator window). Detailed results can be found in the output folder. The MasterTable contains detailed information on all predicted TSS. In TSSstatistics.tsv you can find a more detailed overview of TSS classifications.

HowTo: example analysis with Campylobacter data

Quick step-by-step guide to use TSSpredator with the example data provided here. The example data is the multistrain Campylobacter data from Dugar et al., PLOS Genetics 9(5), 2013.

  1. Open a Terminal and navigate to the folder with the TSSpredator jar file. Type:

java -jar TSSpredator-1.1beta.jar

You can also try to start TSSpredator by a double-click on the TSSpredator-1.1beta.jar file. You will be asked for the memory to be allocated. Click the big button for an automated selection or choose a value. (For the full example data (4 strains, 2 replicates each) it is recommended to start the software with at least 1GB RAM). You can now choose to load the config file called Campy.config that is provided and continue with step 8 or you continue with the following steps 2-7. If you want to use the config file you need to adjust the paths in the config file to the correct paths on your computer. This can easily be done with a text editor.

  1. Select an alignment file. In the upper right corner of the TSSpredator window you can choose an alignment file. Choose alignment.xmfa, which is in the example-data/Campy-Data-multipleStrains folder. You will be asked if the names and the number of genomes should be read from the file: click Yes.

  2. Choose an output folder. In the upper right corner of the TSSpredator window you can choose an output folder, where all result files will be put. You can choose the results folder in example-data/Campy-Data-multipleStrains.

  3. Set the number of replicates. In the upper left corner of the TSSpredator window you can set the number of replicates. For this example data, set this number to 2 and press Set.

  4. You can also give a name to the Project in the upper left corner.

  5. Load genomes, annotations. In the lower part of the TSSpredator window you will find a tab for each strain (genome) that was loaded from the alignment. In each tab load the respective files for the genomic sequence (fasta) and the genome annotation (GFF). The files can be found in example-data/Campy-Data-multipleStrains in the fasta and gff folders, respectively.

  6. Load the RNA-seq data. In each tab you will also find two tabs for the RNA-Seq Graph Files (One for each replicate). The respective (wiggle) graph files can be found in example-data/Campy-Data-multipleStrains in the graphs folder. Each file name is set up as StrainName_R[1,2]_[enriched,normal]_RefseqID_[plus,minus].gr and can thus be easily identified. For example: 81-176_R1_enriched_NC_008787_minus.gr denotes the graph file of strain 81-176 with the RefSeq-ID NC_008787, first replicate of the enriched library mapped to the minus strand.

  7. Save the configuration. The configuration is now complete using standard parameters. If you want, you can save the configuration for later use by clicking the Save button in the lower left corner.

  8. Start the detection procedure. Start the process by clicking the RUN button in the lower right corner.

  9. View the results. An overview statistic will be shown at the end of the process in the message area (lower part of the TSSpredator window).

    Detailed results can be found in the output folder.

    The MasterTable contains detailed information on all predicted TSS. In TSSstatistics.tsv you can find a more detailed overview of TSS classifications.

Note

This project is under active development.

1

Cynthia M Sharma, Steve Hoffmann, Fabien Darfeuille, Jèrèmy Reignier, Sven Findeiß , Alexandra Sittka, Sandrine Chabas, Kristin Reiche, Jörg Hackermüller, Richard Reinhardt, et al. “The primary transcriptome of the major human pathogen Helicobacter pylori” Nature, 464(7286):250, 2010.

2

Laurence Ettwiller, John Buswell, Erbay Yigit, and Ira Schildkraut. “A novel enrichment strategy reveals unprecedented number of novel transcription start sites at single base resolution in a model prokaryote and the gut microbiome”. BMC genomics, 17(1):1-14, 2016.

3

Nicolas Innocenti, Monica Golumbeanu, Aymeric Fouquier d’Hérouel, Caroline Lacoux, Rémy A Bonnin, Sean P Kennedy, Francoise Wessner, Pascale Serror, Philippe Bouloc, Francis Repoila, et al. “Whole-genome mapping of 5 RNA ends in bacteria by tagged sequencing: a comprehensive view in Enterococcus faecalis”. Rna, 21(5):1018-1030, 2015.