Training: Command Line

In silico serotyping

Install SeroBA (Epping et al 2018) as per instructions at https://github.com/sanger-pathogens/seroba#installation and git clone the database from the following link https://github.com/sanger-pathogens/seroba.git.

Files required to run serotyping using SeroBA:

  1. paired-end fastq files
  2. database
  3. sample list (only for running on multiple samples)

Run in silico serotyping on a single sample:

serotype runSerotyping <full path to the database> <read 1> <read 2> <output folder prefix>

Run in silico serotyping on multiple samples:

  1. create a list of sample names and save it as samplelist (e.g. the sample name for 24371_8#283_1.fastq.gz is 24371_8#283)
  2. for f in $(cat samplelist); do seroba runSerotyping <path to the database> ${f}_1.fastq.gz ${f}_2.fastq.gz ${f}; done
  3. seroba summary ./

Output:
summary.tsv

These instructions are available to download here: Instructions

GPSC assignment

Install popPUNK (Lees et al 2018) as per instructions at https://poppunk.readthedocs.io/en/latest/installation.html and download the GPS reference database “GPS_query.tar.bz2” from the following link Database and the GPSC designations “gpsc_definitive.csv” from CSV this page.

Files required to run GPSC assignment using popPUNK:

  1. queries.txt: a list of paths to assemblies you wish to assign GPSCs to
  2. GPS_query: GPS reference database, uncompress GPS_query.tar.bz2
  3. gpsc_definitive.csv: Published GPSC designations for the references

output directory name is assigned using --output
number of threads can be changed using –threads

Run GPSC assignment:

poppunk --assign-query --ref-db GPS_query --distances GPS_query/GPS_query.dists --model-dir GPS_query --q-files queries.txt --output GPSC_assignment --threads 8 --full-db --external-clustering gpsc_definitive.csv

Outputs:
_clusters.csv: popPUNK clusters with dataset specific nomenclature
_external_clusters.csv: GPSC v2 scheme designations

Novel Clusters: Will be assigned NA in the _external_clusters.csv as they have not been seen in the v2 dataset used to designate the GPSCs. The popPUNK _clusters.csv file can be used to determine if NA isolates are the same cluster or not.

Please email: globalpneumoseq@gmail.com to have novel clusters added to the database and a GPSC cluster name assigned after you have checked for low level contamination which may contribute to biased accessory distances.

Merged clusters: Unsampled diversity may represent missing variation linking two clusters. GPSCs are then merged. For example if GPSC23 and GPSC362 merged, the GPSC would be then reported as GPSC23, with a merge history of GPSC23;362.

These instructions are available to download here: Instructions