Training: Command Line

In silico serotyping

Install SeroBA (Epping et al 2018) as per instructions at and git clone the database from the following link

Files required to run serotyping using SeroBA:

  1. paired-end fastq files
  2. database
  3. sample list (only for running on multiple samples)

Run in silico serotyping on a single sample:

serotype runSerotyping <full path to the database> <read 1> <read 2> <output folder prefix>

Run in silico serotyping on multiple samples:

  1. create a list of sample names and save it as samplelist (e.g. the sample name for 24371_8#283_1.fastq.gz is 24371_8#283)
  2. for f in $(cat samplelist); do seroba runSerotyping <path to the database> ${f}_1.fastq.gz ${f}_2.fastq.gz ${f}; done
  3. seroba summary ./


These instructions are available to download here:

GPSC assignment

Install PopPUNK 2.4 as per instructions at PopPUNK documentation and download the GPS reference database and the GPS designation.

GPS reference database (n=42,163):

GPS designation (933 GPSCs):

Files required to run GPSC assignment using PopPUNK 2.4:

  1. A 2-column tab-delimited file containing sample name and path to the corresponding assembly (no header)
  2. GPS reference database <GPS_v6>
  3. GPS designation <GPS_v6_external_clusters.csv>

output directory name is assigned using --output
number of threads can be changed using –threads

Run GPSC assignment:

poppunk_assign --db GPS_v6 --distances GPS_v6/GPS_v6.dists --query <2-column path to assembly> --output <GPSC_assignment> --external-clustering GPS_v6_external_clusters.csv

_clusters.csv: popPUNK clusters with dataset specific nomenclature
_external_clusters.csv: GPSC v6 scheme designations

Novel Clusters are assigned NA in the _external_clusters.csv as they have not been defined in the v6 dataset used to designate the GPSCs. Please email: to have novel clusters added to the database and a GPSC cluster name assigned after you have checked for low level contamination which may contribute to biased accessory distances.

Merged clusters: Unsampled diversity may represent missing variation linking two clusters. GPSCs are then merged. For example if GPSC23 and GPSC362 merged, the GPSC would be then reported as GPSC23, with a merge history of GPSC23;362.

The instructions for PopPUNK v2.4 are available to download

The instructions for PopPUNK v1 are available to download