Example #1

In short, a complete workflow from Illumina reads to genome assembly, through annotation and comparative genomics.

This example uses just the Illumina data (https://www.ebi.ac.uk/ena/browser/view/ERR3335404) generated for the study “Whole genome sequence of Mycobacterium ulcerans CSURP7741, a French Guyana clinical isolate” (https://www.ebi.ac.uk/ena/browser/view/PRJEB30628), “P7741” as a short identity code. Mycobacterium ulcerans is an environmental non-tuberculous mycobacterium responsible for Buruli ulcer.

One reference genome paper is “Complete genome sequence of the frog pathogen Mycobacterium ulcerans ecovar Liflandii” and it’s sequence https://www.ncbi.nlm.nih.gov/nuccore/NC_020133. Mycobacterium ulcerans liflandii 128FXT, complete sequence has NCBI Reference Sequence: NC_020133.1 and is refered to as just “Liflandii” (sequence stored as ./genomes/Liflandii.fasta). Other reference genomes of Mycobacterium ulcerans are utilized in the workflow (Agy99, SGL03 and Shinsuense) and one of Mycobacterium tuberculosis (H37Rv). This work is based upon the excellent presentations and analysis demonstrated by Vincent Appiah (vincentappiah, link) and Daren Ginete (link). The work has been reproduced, adapted and extended in order that it works on a Mac laptop with updated versions of analysis software available in February 2024 and using the Quarto package for presentation and re-usability.

Conda env, data download, QC reports

download data and QC of raw Illumina with fastqc

# use clean install of conda on machine in private space 
source ~/anaconda3/etc/profile.d/conda.sh
conda -V
# all those envs explained on Set Up page
conda env list
conda activate bacterial-genomics-tutorial-sw7
# uncomment appropriately to re-run
# download data to data 
# chmod +x download_data.sh
# ./download_data.sh
# run fastqc
# mkdir QC_RAW_READS
# fastqc data/*.fastq.gz -o QC_RAW_READS

# show the raw data
ls -lh data/ | perl -nae 'print "@F[4..8]\n"'

conda deactivate

conda 24.1.2
# conda environments:
#
base                  *  /Users/sw/anaconda3
bacterial-genomics-tutorial-sw1     /Users/sw/anaconda3/envs/bacterial-genomics-tutorial-sw1
bacterial-genomics-tutorial-sw2     /Users/sw/anaconda3/envs/bacterial-genomics-tutorial-sw2
bacterial-genomics-tutorial-sw3     /Users/sw/anaconda3/envs/bacterial-genomics-tutorial-sw3
bacterial-genomics-tutorial-sw4     /Users/sw/anaconda3/envs/bacterial-genomics-tutorial-sw4
bacterial-genomics-tutorial-sw5     /Users/sw/anaconda3/envs/bacterial-genomics-tutorial-sw5
bacterial-genomics-tutorial-sw7     /Users/sw/anaconda3/envs/bacterial-genomics-tutorial-sw7
example                  /Users/sw/anaconda3/envs/example
for_drep                 /Users/sw/anaconda3/envs/for_drep
just_quast               /Users/sw/anaconda3/envs/just_quast

    
40M Feb 23 16:02 P7741_R1.fastq.gz
45M Feb 23 16:02 P7741_R2.fastq.gz

Results files are QC html reports.

./QC_RAW_READS/P7741_R1_fastqc.html

./QC_RAW_READS/P7741_R2_fastqc.html

Trimming and post trim QC reports

Using Trimmomatic: A flexible read trimming tool for Illumina NGS data from Björn Usadel lab (hallöchen!). Alternative is Sickle (A windowed adaptive trimming tool for FASTQ files using quality)

# use clean install of conda on machine in private space 
source ~/anaconda3/etc/profile.d/conda.sh
conda activate bacterial-genomics-tutorial-sw7

# uncomment appropriately to re-run

# mkdir trimmed_reads
read1=data/P7741_R1.fastq.gz
read2=data/P7741_R2.fastq.gz
OutputForwardPaired=trimmed_reads/P7741_R1.fastq.gz    
OutputForwardUnpaired=trimmed_reads/P7741_unpaired_R1.fastq.gz
OutputReversePaired=trimmed_reads/P7741_R2.fastq.gz
OutputReverseUnpaired=trimmed_reads/P7741_unpaired_R2.fastq.gz
threads=4

# trimmomatic PE -threads $threads $read1 $read2 \
# $OutputForwardPaired $OutputForwardUnpaired \
# $OutputReversePaired $OutputReverseUnpaired \
# ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:8:true HEADCROP:3 TRAILING:10 MINLEN:25

# mkdir QC_TRIMMED_READS
# fastqc trimmed_reads/* -o QC_TRIMMED_READS

ls -1 QC_TRIMMED_READS

conda deactivate

P7741_R1_fastqc.html
P7741_R1_fastqc.zip
P7741_R2_fastqc.html
P7741_R2_fastqc.zip
P7741_unpaired_R1_fastqc.html
P7741_unpaired_R1_fastqc.zip
P7741_unpaired_R2_fastqc.html
P7741_unpaired_R2_fastqc.zip

Results files are QC html reports. vincentappiah’s results are different around 16:29. In video he speaks of sickle trimming, but the code uses trimmomatic.

./QC_TRIMMED_READS/P7741_R1_fastqc.html

./QC_TRIMMED_READS/P7741_R2_fastqc.html

./QC_TRIMMED_READS/P7741_unpaired_R1_fastqc.html

./QC_TRIMMED_READS/P7741_unpaired_R2_fastqc.html

Assembly

using SPAdes - St. Petersburg genome assembler (SPAdes)

# Real problems setting up spades with conda - notes on Set Up page
# Solution was to get the spades Mac binaries
#
# It runs but takes quite some time on my mini-server
# Therefore delete directory
# P7741_SPADES_OUT
# and re-run if needed
# ~/SPAdes-3.15.5-Darwin/bin/spades.py --careful -o P7741_SPADES_OUT -1 trimmed_reads/P7741_R1.fastq.gz -2 trimmed_reads/P7741_R2.fastq.gz

echo "number of contigs"
grep '>' P7741_SPADES_OUT/contigs.fasta | wc -l
echo ""
echo "number of scaffolds"
grep '>' P7741_SPADES_OUT/scaffolds.fasta | wc -l
echo ""

echo "directory listing"
ls -1 P7741_SPADES_OUT

number of contigs
    3243

number of scaffolds
    3232

directory listing
K21
K33
K55
K77
assembly_graph.fastg
assembly_graph_after_simplification.gfa
assembly_graph_with_scaffolds.gfa
before_rr.fasta
contigs.fasta
contigs.paths
corrected
dataset.info
input_dataset.yaml
misc
mismatch_corrector
params.txt
pipeline_state
run_spades.sh
run_spades.yaml
scaffolds.fasta
scaffolds.paths
spades.log
tmp
warnings.log

Polishing

with

bwa (alignment via Burrows-Wheeler transformation)
samtools (for manipulating high-throughput sequencing data)
pilon (automatically improve draft assemblies and find variation among strains including large event detection)

Use script to do 4 rounds (not guaranteed to be optimal) of polishing. Runs but takes quite some time on my mini-server but it goes.

# conda 
source ~/anaconda3/etc/profile.d/conda.sh
conda activate bacterial-genomics-tutorial-sw7

# To clean re-run delete dir polishing_process
# re-run script
# ./polish.sh

# list out the files developed 
find polishing_process

conda deactivate

polishing_process
polishing_process/pilon_stage4.changes
polishing_process/pilon_stage1.fasta.ann
polishing_process/pilon_stage3.fasta.amb
polishing_process/pilon_stage3.fasta.bwt
polishing_process/stage1.pilon
polishing_process/pilon_stage1.fasta
polishing_process/pilon_stage3.fasta
polishing_process/raw_assembly.fasta.ann
polishing_process/mapping3.sorted.bam
polishing_process/stage3.pilon
polishing_process/pilon_stage4.fasta
polishing_process/mapping4.sorted.bam
polishing_process/pilon_stage2.fasta.amb
polishing_process/stage4.pilon
polishing_process/raw_assembly.fasta
polishing_process/pilon_stage2.fasta.bwt
polishing_process/raw_assembly.fasta.pac
polishing_process/raw_assembly.fasta.sa
polishing_process/pilon_stage2.fasta
polishing_process/pilon_stage1.fasta.pac
polishing_process/stage2.pilon
polishing_process/mapping1.sorted.bam
polishing_process/pilon_stage1.fasta.sa
polishing_process/pilon_stage3.fasta.ann
polishing_process/pilon_stage1.fasta.amb
polishing_process/mapping1.sorted.bam.bai
polishing_process/mapping2.sorted.bam
polishing_process/pilon_stage1.fasta.bwt
polishing_process/pilon_stage3.changes
polishing_process/raw_assembly.fasta.bwt
polishing_process/pilon_stage2.fasta.pac
polishing_process/pilon_stage2.changes
polishing_process/raw_assembly.fasta.amb
polishing_process/pilon_stage2.fasta.ann
polishing_process/pilon_stage3.fasta.sa
polishing_process/pilon_stage2.fasta.sa
polishing_process/mapping2.sorted.bam.bai
polishing_process/pilon_stage3.fasta.pac
polishing_process/mapping3.sorted.bam.bai
polishing_process/mapping4.sorted.bam.bai
polishing_process/pilon_stage1.changes

QC of Assembly

using quast with results linked in below

# conda
source ~/anaconda3/etc/profile.d/conda.sh

# env just with the quast software
conda activate just_quast

# genomes dir taken from vincentappiah repo
# quast throws an error with default script
# -t 1 solves 
# other potential issue may exist as the install of quast was not completed...
# see SetUp page

# uncomment to run again
# mkdir QC_ASSEMBLY
# quast.py -t 1 -o QC_ASSEMBLY -R genomes/Liflandii.fasta P7741_SPADES_OUT/scaffolds.fasta P7741.polished.fasta

# Results notes
# N50 quality didn't change with polishing, GC% didn't change,
# Slight change in misassemblies 

# list out the top dir 
ls -1 QC_ASSEMBLY

conda deactivate

aligned_stats
basic_stats
contigs_reports
genome_stats
icarus.html
icarus_viewers
quast.log
report.html
report.pdf
report.tex
report.tsv
report.txt
transposed_report.tex
transposed_report.tsv
transposed_report.txt

Quality Assessment Tool for Genome Assemblies (QUAST)

QC_ASSEMBLY/report.html

Icarus QUAST Contig Browser

QC_ASSEMBLY/icarus.html

QC_ASSEMBLY/icarus_viewers/contig_size_viewer.html

QC_ASSEMBLY/icarus_viewers/alignment_viewer.html

Reorder contigs

using ragtag

apart from the paper listed below, there is a more recent publication

Automated assembly scaffolding using RagTag elevates a new tomato system for high-throughput genome editing

# conda 
source ~/anaconda3/etc/profile.d/conda.sh
conda activate bacterial-genomics-tutorial-sw7

# genomes/Agy99.fasta is NC_008611.1 Mycobacterium ulcerans Agy99, complete sequence
# note this is a different genome to genomes/Liflandii.fasta

ragtag.py -c

# uncomment to re-run
## ragtag.py scaffold genomes/Agy99.fasta P7741.polished.fasta -o P7741_reordered

# Extract the reordered contig with a custom (modified!) python script
# The scripts accept name of the ragtag file containing the reordered contigs 
# and accession number for the reference genome

python extract_reordered.py P7741_reordered/ragtag.scaffold.fasta NC_008611.1

# at this stage vincentappiah gets length of 5291728 
# and a gc of 64.98
# potentially the difference is due to newer version of ragtag
# i.e. v1.0.2 vs RagTag v2.1.0

conda deactivate


Alonge, Michael, et al. "RaGOO: fast and accurate reference-guided scaffolding of draft genomes."
Genome biology 20.1 (2019): 1-17.
https://doi.org/10.1186/s13059-019-1829-6
    
GC Percent, opt=ignore: 65.073454
GC Percent, opt=remove: 65.701833
GC Percent, opt=weighted: 65.551660
Sequence Length: 5012494 bp
draft genome sequence extracted

Multi-Locus Sequence Typing (MLST) and Antibiotic Resistance

using https://github.com/tseemann/mlst and https://github.com/tseemann/abricate

The results show the expected identity (7 alleles of mycobacteria) and there is an antibiotic resistance gene.

But need to use cgMLST, wgMLST and more…

Need to look at “chewBBACA is a comprehensive pipeline including a set of functions for the creation and validation of whole genome and core genome MultiLocus Sequence Typing (wg/cgMLST) schemas..”

# conda 
source ~/anaconda3/etc/profile.d/conda.sh
conda activate bacterial-genomics-tutorial-sw7

echo "Alleles"
echo ""
mlst --csv P7741.reordered.fasta > mlst.csv

echo ""
echo "Antibiotic resistance"
echo ""
abricate P7741.reordered.fasta > amr.summary.tab
cat amr.summary.tab

conda deactivate

Alleles

[11:44:55] This is mlst 2.23.0 running on darwin with Perl 5.030003
[11:44:55] Checking mlst dependencies:
[11:44:55] Found 'blastn' => /Users/sw/anaconda3/envs/bacterial-genomics-tutorial-sw7/bin/blastn
[11:44:55] Found 'any2fasta' => /Users/sw/anaconda3/envs/bacterial-genomics-tutorial-sw7/bin/any2fasta
[11:44:55] Found blastn: 2.15.0+ (002015)
[11:44:55] Excluding 3 schemes: ecoli abaumannii vcholerae_2
[11:44:57] Found exact allele match mycobacteria_2.S7-10
[11:44:57] Found exact allele match mycobacteria_2.L16-450
[11:44:57] Found exact allele match mycobacteria_2.S12-394
[11:44:57] Found exact allele match mycobacteria_2.L19-11
[11:44:57] Found exact allele match mycobacteria_2.S19-13
[11:44:57] Found exact allele match mycobacteria_2.L35-10
[11:44:57] Found exact allele match mycobacteria_2.S14Z-10
[11:44:57] mlst also supports --json output for the modern bioinformatician.
[11:44:57] Done.

Antibiotic resistance

Using nucl database ncbi:  5386 sequences -  2023-Nov-4
Processing: P7741.reordered.fasta
Found 1 genes in P7741.reordered.fasta
Tip: you can use the --summary option to combine reports in a presence/absence matrix.
Done.
#FILE   SEQUENCE    START   END STRAND  GENE    COVERAGE    COVERAGE_MAP    GAPS    %COVERAGE   %IDENTITY   DATABASE    ACCESSION   PRODUCT RESISTANCE
P7741.reordered.fasta   P7741   1110389 1110934 -   aac(2')-Ic  1-546/546   ========/====== 2/2 99.82   82.27   ncbi    NG_047229.1 aminoglycoside N-acetyltransferase AAC(2')-Ic   GENTAMICIN;TOBRAMCYIN

Rapid prokaryotic genome annotation

using https://github.com/tseemann/prokka also from Torsten Seemann

# conda 
source ~/anaconda3/etc/profile.d/conda.sh
conda activate bacterial-genomics-tutorial-sw7

cpus=4
# uncomment to re-run
# ran is around 10 minutes with 4 cores
#prokka --cpus $cpus --kingdom Bacteria --locustag P7741 --outdir P7741_annotation --prefix P7741 --addgenes P7741.reordered.fasta

# stats
echo "Summary"
cat P7741_annotation/P7741.txt 
echo ""

# get some counts of genome features
# vincentappiah numbers differ for reason speculated over in above sections
echo "Another bespoke summary"
python get_annot_stats.py P7741_annotation P7741
echo ""

# show first 3 protein sequences
echo "First 3 protein sequences"
head -13 P7741_annotation/P7741.faa
echo ""

# show some 3 genes/CDS
echo "First 3 genes / CDS"
head -7 P7741_annotation/P7741.tsv
echo ""

# show some pseudogenes
echo "Some pseudogenes"
./get_pseudo.pl P7741_annotation/P7741.faa > P7741_annotation/P7741.pseudo.txt
head -10 P7741_annotation/P7741.pseudo.txt
echo "...."

conda deactivate

Summary
organism: Genus species strain 
contigs: 1
bases: 5012494
CDS: 4805
gene: 4852
rRNA: 3
tRNA: 43
tmRNA: 1

Another bespoke summary
Counting Annotated Features.............................
CDS:4805
gene:4852
tRNA:43
tmRNA:1
rRNA:3
IS110 family transposase IS117:1
IS30 family transposase ISMsm8:1
Insertion_Sequences:2
Pseudogenes:141

First 3 protein sequences
>P7741_00001 Anti-sigma-M factor RsmA
MHITVSTPPPVIPLSHDQVLDLLQRAPDYGPFADPSRRASCLNGLGYPASTPILGAQPVD
INARPGVLLVLAGDAPADLAVYAVALNCSAADTGLLASTTLPRLPDS
>P7741_00002 Thioredoxin reductase
MTTDSSADATIHDVIVIGSGPAGYTAALYTARAQLAPLVFEGTSFGGALMTTTEVENYPG
FREGITGPELMDEMREQALRFGADLRMEDVQSVSLDGPIKSVVTSDGETHRARAVILAMG
AAARYLHVPGEQELLGRGVSSCATCDGFFFRDQDIAVIGGGDSAMEEATFLTRFARSVTL
VHRRDEFRASKIMINRAQANDKIRILTNKIVLAVDGETTVTGLQLRDTVTGEETTLPVTG
VFVAIGHEPRSSLVRDAVDVDPDGYVLVNGRTTGTSLEGVFAAGDLVDRTYRQAVTAAGS
GCAAAIDAERWLAEHEATGDADSTDTLIGAQQ
>P7741_00003 Thioredoxin
MTDSEKSSATIEVSDASFSTEVLSSNKPVLVDFWATWCGPCKMVAPVLEEIATERADNLT
VAKLDVDANPETARNFQVVSIPTMILFKDGEPVKRIVGAKGKAALLRELSDAVPNLG

First 3 genes / CDS
locus_tag   ftype   length_bp   gene    EC_number   COG product
P7741_00001 gene    324 rsmA_1          
P7741_00001 CDS 324 rsmA_1          Anti-sigma-M factor RsmA
P7741_00002 gene    999 trxB_1          
P7741_00002 CDS 999 trxB_1  1.8.1.9 COG0492 Thioredoxin reductase
P7741_00003 gene    354 trxA            
P7741_00003 CDS 354 trxA        COG0526 Thioredoxin

Some pseudogenes
Found 4805 genes.
Found 141 potential psuedo-genes
P7741_00021 & P7741_00022 => HTH-type transcriptional activator TipA
P7741_00062 & P7741_00063 => putative protein
P7741_00063 & P7741_00064 => putative protein
P7741_00161 & P7741_00162 => ESX-1 secretion-associated protein EspK
P7741_00272 & P7741_00273 => putative protein
P7741_00273 & P7741_00274 => putative protein
P7741_00276 & P7741_00277 => putative monooxygenase
P7741_00279 & P7741_00280 => Putative citrate synthase 2
P7741_00281 & P7741_00282 => Sphingomyelinase
P7741_00293 & P7741_00294 => putative protein
....

Comparative Analysis

Average Nucleotide Identity

dRep is a python program for rapidly comparing large numbers of genomes
Comments in dendrogram.sh says it needs https://github.com/ParBLiSS/FastANI and that leads to https://drep.readthedocs.io/en/latest/installation.html
FastANI is developed for fast alignment-free computation of whole-genome Average Nucleotide Identity (ANI).
Script dendrogram.sh calls dRep: it is installed in conda bacterial-genomics-tutorial-sw7 but not possible to add fastani to this build do to errors (with perl versions if I remember correctly), so build a particular environment for dRep (see Set Up page).

# conda 
source ~/anaconda3/etc/profile.d/conda.sh
# particular environment for dRep
conda activate for_drep

# dRep check_dependencies
# uncomment to re-run
# ./dendogram.sh

conda deactivate

Some of the outputs from the PDFs generated are given as screen-shot images here.
Cluster of P7741/Liflandii/Shinsuense is different between methods ANI methods but agreement with the distinction of SGL03 and Agy99 that are approaching 100% Average Nucleotide Identity.

fastANI cluster: just the *Mycobacterium ulcerans* genomes

MASH cluster: H37Rv (green label - hard to read in image) is *Mycobacterium tuberculosis* and is clearly very different, an outlier with respect to ANI, to the *Mycobacterium ulcerans* genomes.

Genome ring structures using BRIG

using https://github.com/happykhan/BRIG

BRIG is a standalone interactive application with interface to run BLAST and to render the output as ring structure visualizations.
get BRIG-0.95-dist: start with enough RAM, java -Xmx1500M -jar BRIG.jar
Needs Genbank formats as generated by prokka (e.g. P7741.gbk).
- This required re-running prokka for reference genomes as get_genome_gffs.sh previously deleted the analysis folder after taking gff
- Now I keep the complete folder for future usage
The forms in BRIG need acclimatization, but once understood, the tool works is great
- the path to BLAST binary folder needs to be set in “Configure BRIG options…”
  - in this case, pairwise genome blastn jobs are performed
  - compute job progress can be seen in the specified “Output folder”
    - e.g. WGS_Bacteria/BRIG/Mycobacterium_ulcerans_5g
    - folder I used for the 5 genomes of Mycobacterium ulcerans
- it’s useful to set-up a symbolic link at same level as BRIG install to gkb directory
- extensive image rendering options are available to customize the graphic

# conda 
source ~/anaconda3/etc/profile.d/conda.sh
conda activate bacterial-genomics-tutorial-sw7

# find gbks and copy to folder 
# mkdir gbk
# find . -name "*.gbk" | perl -ne 'chomp ; print "cp $_ gbk/\n"' -
# could redirect to a shell script, but prefer to develop commands and cut/paste into terminal for short batch processes 
cp ./H37Rv/H37Rv.gbk gbk/
cp ./SGL03/SGL03.gbk gbk/
cp ./Shinsuense/Shinsuense.gbk gbk/
cp ./P7741_annotation/P7741.gbk gbk/
cp ./Agy99/Agy99.gbk gbk/
cp ./Liflandii/Liflandii.gbk gbk/
  
conda deactivate

Mycobacterium ulcerans: 5 genomes compared - P7741 was assembled in the workflow. Click on image gets the SVG version.

Mycobacterium ulcerans liflandii vs Mycobacterium tuberculosis (var. H37Rv). Click on image gets the SVG version.

Generate Genome Feature Files

using https://github.com/tseemann/prokka to run 5 reference genomes Agy99, Liflandii, SGL03, Shinsuense, H37Rv

# conda 
source ~/anaconda3/etc/profile.d/conda.sh
# 
conda activate bacterial-genomics-tutorial-sw7

# takes just over an hour on laptop to 
# generate gffs for 6 bacterial genomes
# ./get_genome_gffs.sh 

ls -lh gffs | awk '{print $5 " " $6 $7 " " $8" " $9}' 

conda deactivate

   
7.2M Feb28 18:55 Agy99.gff
5.5M Feb28 19:28 H37Rv.gff
8.0M Feb28 19:04 Liflandii.gff
6.4M Feb28 19:28 P7741.gff
7.2M Feb28 19:12 SGL03.gff
7.6M Feb28 19:21 Shinsuense.gff

Generate Pangenome

using

roary and roary_plots.py https://sanger-pathogens.github.io/Roary/
- Roary is a high speed stand alone pan genome pipeline, which takes annotated assemblies in GFF3 format (produced by Prokka (Seemann, 2014)) and calculates the pan genome.
FastTree https://microbesonline.org/fasttree/
- FastTree infers approximately-maximum-likelihood phylogenetic trees from alignments of nucleotide or protein sequences.

# conda 
source ~/anaconda3/etc/profile.d/conda.sh
# 
conda activate bacterial-genomics-tutorial-sw7

# uncomment to re-run
# ./get_pangenome.sh 

# Pangenome analysis using roary and fasttree
# i have 4 cores
threads=4

# # Uncomment appropriately to re-run
# roary -f pangenome -p $threads -e -n -v --mafft gffs/*.gff
# # Generate alignment file
# FastTree -nt -gtr pangenome/core_gene_alignment.aln > pangenome/mytree.newick


# # Plot phylogenetic tree and presence/absense in svg format
# python roary_plots.py --labels --format svg pangenome/mytree.newick pangenome/gene_presence_absence.csv
# # same as png
# python roary_plots.py --labels              pangenome/mytree.newick pangenome/gene_presence_absence.csv
# # move all pangenome plots to img folder
# mv pangenome_*.{svg,png} img

# # generates gene_count_summary.png that needs to be renamed and moved 
# # 3 closely related genomes
# python gene_count_summary.py P7741 Agy99 Liflandii pangenome/gene_presence_absence.csv
# mv gene_count_summary.png  img/P7741_Agy99_Liflandii.png
# # 2 closely related genomes and Mtycobacterium tuberculosis
# python gene_count_summary.py P7741 H37Rv Liflandii pangenome/gene_presence_absence.csv
# mv gene_count_summary.png  img/P7741_H37Rv_Liflandii.png

echo "Genomes used with Roary:"
ls -1 gffs/*.gff
echo ""

echo "pangenome/summary_statistics.txt"
cat pangenome/summary_statistics.txt

conda deactivate

Genomes used with Roary:
gffs/Agy99.gff
gffs/H37Rv.gff
gffs/Liflandii.gff
gffs/P7741.gff
gffs/SGL03.gff
gffs/Shinsuense.gff

pangenome/summary_statistics.txt
Core genes  (99% <= strains <= 100%)    91
Soft core genes (95% <= strains < 99%)  0
Shell genes (15% <= strains < 95%)  11844
Cloud genes (0% <= strains < 15%)   0
Total genes (0% <= strains <= 100%) 11935

Image above: A sample of genes_presence_absense.csv where number of present = 5 genomes. Click on image to get the full CSV file - 11936 lines.

Image above: Pangenome Tree and Matrix of the Mycobacterium tuberculosis genome versus the Mycobacterium ulcerans genomes. Of the 11935 only 91 are considered core genes across all 6 genomes (a score of between >= 99% similarity). A click on the image gets the scalable SVG version.

Genes unique to the set P7741, Agy99 and Liflandii (1395 = 390 + 2 + 120 + 141 + 3 + 6 + 733). Within the set, for example Agy99 has 120 unique genes. Of the 1395 unique genes within the set, Agy99 and P7741 share only 5, whilst P7741 and Liflandii share 144. Seems like P7741 and Liflandii are phylogenetically closer than to each other than to Agy99.

Genes unique to the set P7741, H37Rv, Liflandii (5169 = 390 + 0 + 3904 + 141 + 1 + 0 + 733). Mycobacterium tuberculosis (var. H37Rv) is clearly, and as expected, relatively distant to the Mycobacterium ulcerans genomes.

Bug note

# With Quarto, the strange bug means that if an {r} executable block is included
# then the preceding {zsh} or {bash} blocks are executed, otherwise not!
# Simply include this and all is executed as desired.