Diverged from Escherichia around 100Mya ago, Salmonella evolved into two species subsequently, i.e., S. enterica and S. bongori. S. enterica further diversified into seven or more subspecies and more than 2000 serovars. Most of the serovars belong to S. enerica subsp. enterica, including the generalists with broad hosts and specialists with host specificity. The generalists such as S. typhimurium and S. enteritidis cause self-limited Non-Typhoidal Salmonellae diseases, while the specialists exemplified by S. typhi and S. paratyphi A lead to invasive infections. Besides the host adaptability, Salmonella serovars also vary extensively for other phenotypes such as invasion, persistence, antibiotic resistance, etc.
To facilitate understanding how the genome evolution influences the molecular and biological phenotypes of Salmonella strains, the ancient orthologous chromosomes of Salmonella genus, species, subspecies and other major phylogenetic nodes were traced. The evolutionary trajectories were delineated and annotated. A database, dbESG, was initiated to assist the illustration of Salmonella genome evolution dynamically.
Evolutionary annotation and comparison of Salmonella ancient and extant representative genomes.
Software tools, pipelines and codes for genome analysis.
The genomic, comparative genomic and phylogenomic datasets of Salmonella.
Databases, webservers and other related resources.
References:
Hu Y, et al. Evolution of Salmonella Chromosomes and Its Influence on Chromosomal Topology, Interaction and Gene Expression. Submitted.
A series of programs and scripts have been developed to facilitate bacterial comparative genomic analysis.
Fig 1. A pipeline and related tools for bacterial comparative genomic analysis
BactCG is designed to analyze the core genome of a group of bacterial strains. Typically, pairwise alignment is repeatedly performed between each pair of genes (or proteins) from two bacterial strains respectively. Mutual best alignment pairs are identified, generating the core gene set. The order of computation reaches n2 for n strains. BactCG takes one representative strain as reference, and makes mutual alignment between the genes (or proteins) of the other strains and those from the reference strain to reveal the core gene set. The computation order of BactCG decreases to n.
BactCG is developed with GO programming language. The current version can only be compiled and implemented in Mac or Linux system. The homology cutoff is customized, involving two parameters: minimal length coverage and minimal sequence identity, which has been set as 0.9 and 0.8 by default respectively.
Operating system: Linux or Mac
Software requirements: Golang compilation environment (required when compiling with source code; not required when using compiled program directly). Downloading and installing the standalone version of NCBI BLAST (>= version 2.3.30
) from the link: https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/ .
Compilation: The compiled executable program does not need to be installed, and can be run directly in Mac system. After downloading and decompress the source package (BactCG.tar.gz
), you need to pre-install and configure the Golang compilation environment, and then compile the source code one by one to get the executable file and save it to the bin subfolder in BactCG directory. Commands for compilation:
$
$
$
cd bin
go build ../codes/[module].go
cp ./CG ../
Download the genome-derived proteome sequences for each bacterial strain to be analyzed, and put them in a single file named as xxx.faa
where xxx
normally represents the identification of strain. Put the proteome sequence files in the test
subfolder under BactCG directory after removing the example files. Designate one strain as the reference, and then implement the program in terminal by only one command.
Taking the strains whose proteome sequences are stored in the test
subfolder as an example, where LT2 is set as the reference strain.
$
$
cd BactCG
./CG test LT2 0.8 0.9
The running progress of the program will be shown actively in the terminal. Once finished, a new subdirectory named result
will be generated in BactCG, where there are 7 subfolders. The final core gene set is saved in the file CG.tab.txt
in the cg_result
subfolder.
The codes and executable files were stored in the codes
and bin
subdirectory respectively. The executable files are for Mac system. For Linux system, the source codes need to be re-compiled. BactCG is accessible here .
The ancient orthologous genomes of bacteria were inferred with a two-step Backbone-Patching approach semi-manually. In the Backbone step, the most anciently diverged clades were identified according to the phylogenomic tree with strains covering the major branches of the genus, species or subspecies to be studied, and one representative strain was selected randomly from either clade. Orthologous fragments were analyzed with Mauve version 2.4.0 and an iterative Maximum Homologous Block (MHB) algorithm, and combined to generate the backbone of ancient orthologous genome. Two patching sub-steps followed. Firstly, genomes of other representative strains were aligned between the two clades, and the orthologous fragments were retrieved, which were further compared to the backbone. The sub-fragments not covered by the backbone genome were patched in manually and the backbone ancient orthologous genome was updated iteratively. Secondly, the genomes of closely-related outgroup strains or the genome of nearest ancestor were also aligned against the representative strains of either clade respectively and the orthologous fragments were extracted to further patch the ancient orthologous genome.
BactAG is developed with GO programming language. The current version can only be compiled and implemented in Mac or Linux system.
Operating system: Linux or Mac
Software requirements: Golang compilation environment (required when compiling with source code; not required when using compiled program directly). Downloading and installing the standalone version of Mauve (>= version 2.4.0
) from the link: https://darlinglab.org/mauve/user-guide/installing.html .
Compilation: The compiled executable program does not need to be installed, and can be run directly in Mac system. After downloading and decompress the source package (BactAG.zip
), you need to pre-install and configure the Golang compilation environment, and then compile the source code one by one to get the executable file and save it to the bin subfolder in BactAG directory. Commands for compilation:
$
$
$
$
cd bin
go build ../codes/[module].go
cp ../codes/*.pl ./
cp -r AG_Inference.py bin test
Download the genome sequences for each bacterial strain to be analyzed, and put them in a single file named as xxx.fasta
where xxx
normally represents the identification of strain. Put the genome sequence files in the test
subfolder under BactAG directory after removing the example files. Designate one strain as the reference (for example RKS2986), and then implement the program in terminal by only one command.
Taking the strains whose genome sequences are stored in the test/
subfolder as an example, where RKS2986 is set as the reference strain and nC_AG is set as the ancient neighbor node.
$
$
$
cd BactAG/test
python AG_Inference.py backbone -n nC_AG.fasta -bo ./result/backboneOutput -b ./bin -p /root/softawre/miniconda3/bin/progressiveMauve -s RKS2986.fasta -f ./
python AG_Inference.py patching -n nC_AG.fasta -bo ./result/backboneOutput -o ./result/AGoutput -b ./bin -p /root/softawre/miniconda3/bin/progressiveMauve -s RKS2986.fasta -f ./
The running progress of the program will be shown actively in the terminal. Once finished, a new subdirectory named result
will be generated in BactAG, where there are 2 subfolders. The final ancient genome is saved in the file AG_ORTH.fasta
in the AGoutput
subfolder.
The codes and executable files were stored in the codes
and bin
subdirectory respectively. The executable files are for Mac system. For Linux system, the source codes need to be re-compiled. BactAG is accessible here .
BactPG is designed to analyze the pan-genome of a group of bacterial strains. Typically, pairwise alignment is repeatedly performed between each pair of genes (or proteins) from two bacterial strains respectively. Mutual best alignment pairs are identified, generating the pan-gene set. BactPG analyzes each combination of all strains. In each combination, BactPG takes one representative strain as reference, and makes mutual alignment between the genes (or proteins) of the other strains. Then, the gene sets of each combination are merged to form the pan-genome.
BactPG is developed with GO programming language. The current version can be compiled and implemented in Mac, Linux or windows system. The homology cutoff is customized, involving two parameters: minimal length coverage and minimal sequence identity, which has been set as 70% and 0.7 by default respectively.
Operating system: Mac, Linux or windows system
Software requirements: Golang compilation environment (required when compiling with source code; not required when using compiled program directly). Downloading and installing the standalone version of NCBI BLAST (>= version 2.3.30
) from the link: https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/ .
Compilation: The compiled executable program does not need to be installed, and can be run directly. After downloading and decompress the source package (BactPG.zip
), you need to pre-install and configure the Golang compilation environment, and then compile the source code(BactPG.go
) to get the executable file. Commands for compilation:
$
$
cd BactPG
go build ../codes/BactPG.go
Download the genome-derived proteome sequences for each bacterial strain to be analyzed, and put them in a single file named as xxx.fasta
where xxx
normally represents the identification of strain. Put the proteome sequence files in the example_seq
subfolder under BactCG directory.
Taking the strains whose proteome sequences are stored in the example_seq
subfolder as an example:
$
$
cd BactPG
./BactPG ./example_seq [absolute path of makeblastdb] [absolute path of blastp] 70 0.7
The running progress of the program will be shown actively in the terminal. Once finished, The final pan-gene set is saved in the file PG.txt
.
The codes and executable files were stored in the codes
and BactPG
subdirectory respectively. The executable files are for windows system. For Linux or Mac system, the source codes need to be re-compiled. BactPG is accessible here .
BactPGA is developed to facilitate automatic annotation of ancient or extant individual genomes according to the pan-genome annotation results. Once sequenced and assembled, the target genome could be annotated for the encoding genes with RASTtk or PGAG. BactPGA mainly classify the genes into pan-genome families. BactPGA can also be used to annotate the results of 1DGR or other comparative genomic analysis.
Operating system: Linux/Mac
Software requirements: Golang compilation environment (required when compiling with source code; not required when using compiled program directly); BactCG, which can be downloaded via the link: http://61.160.194.165:3080/ESG/tools/BactCG/ ; downloading and installing the standalone version of NCBI BLAST (>= version 2.3.30) from the link: https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/ .
Compilation: The compiled executable program does not need to be installed, and can be run directly in Mac system. After downloading and decompress the source package (BactPGA.tar.gz), you need to pre-install and configure the Golang compilation environment, and then compile the source code one by one to get the executable file and save it to the bin subfolder in BactPGA directory. Commands for compilation:
$
$
cd bin
go build ../codes/[module].go
To correctly implement BactPGA, the following files should be prepared ahead and transferred into results
subfolder: (1) the gbk file of the target genome; (2) tab-separated pan-gene set file; (3) seq
subfolder with all the proteome FASTA files (.faa) of the pan-genome strains, and the proteome FASTA file (.faa) of the target strain.
Retrieve the genome annotation file (nA_AG.gbk
) of target strain, the pan-genome annotation data (26_PG.txt
), the seq
subfolder with the proteomes of the pan-genome strains and the target strain (nA_AG.faa
) from the test
subfolder and transfer into the result
subfolder. The following commands are implemented:
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
cd results
rm -R *
rm *
rm -R ../BactCG1.0/result/
rm -R ../BactCG1.0/seq/
cp ../test/seq ../BactCG1.0/
cp ../test/26_PG.txt ./
cp ../test/nA_AG.gbk ./
../BactCG1.0/CG ../BactCG1.0/seq nA_AG 0.7 0.7
cp ./26_PG.txt ../BactCG1.0/result/out_mutbest_filt/
cp ../bin/PGA ../BactCG1.0/result/out_mutbest_filt/
../bin/gbParse ./nA_AG.gbk >nA_AG_PGAG.tab.txt
cp ./nA_AG_PGAG.tab.txt ../BactCG1.0/result/out_mutbest_filt/
../BactCG1.0/result/out_mutbest_filt/PGA ../BactCG1.0/result/out_mutbest_filt/26_PG.txt ../BactCG1.0/result/out_mutbest_filt/nA_AG_PGAG.tab.txt PGAG nA_AG >../BactCG1.0/nA.AG_PGAG_PGA.tab.txt
cp ../BactCG1.0/nA.AG_PGAG_PGA.tab.txt ./
The nA.AG_PGAG_PGA.tab.txt
in results
subfolder of BactPGA
is the final result.
The codes, executable files and testing data were stored in the codes
, bin
and test
subdirectory respectively. The executable files are for Mac system. For Linux/Win system, the source codes need to be re-compiled. BactPGA is accessible here .
Bact1DGR is developed to represent individual bacterial genomes as blocks annotated with the evolutionary origins. Both the phylogenetic information of the target strain and the ancient genomes of the nodes along its evolutionary trajectory are referred to. The representation scheme can facilitate understanding the sequence evolution of bacterial genomes and intuitive comparison of multiple bacterial genomes. The procedure involves a couple of steps: (1) locating the end phylogenetic branch where the target strain falls, tracing all the nodes along the phylogenetic route of the branch, and delineating the evolutionary trajectory of the target strain; (2) aligning the genome of target strain against that of the oldest ancestor, identifying the orthologous fragments and labeling the homologous genome blocks of the target strain; (3) aligning the genome of target strain against that of the second oldest ancestor, identifying the orthologous fragments and labeling the homologous genome blocks of the target strain that have not been labeled; (4) performing the step 3 iteratively till the genome of the most recent ancestor is compared and labeled accordingly, and finding out the strain-specific sequence blocks.
Scripts facilitating the implementation of Bact1DGR are developed with GO and Perl programming language. The Bact1DGR results can also be used for comparative genomic analysis. For genome comparison, the most recent common ancestor (MRCA, differentiating node) of the strains to be compared should be determined in the first place, and then only the blocks with an origin later than the MRCA including the strain-specific blocks contain the differential information.
Operating system: Linux/Mac/Win
Software requirements: Golang compilation environment (required when compiling with source code; not required when using compiled program directly); Perl; Downloading and installing the standalone version of Mauve (>= version 2.4.0) from the link: https://darlinglab.org/mauve/user-guide/installing.html .
Compilation: The compiled executable program does not need to be installed, and can be run directly in Mac system. After downloading and decompress the source package (Bact1DGR.tar.gz), you need to pre-install and configure the Golang compilation environment, and then compile the source code one by one to get the executable file and save it to the bin subfolder in Bact1DGR directory. Commands for compilation:
$
$
cd bin
go build ../codes/[module].go
Put the genome sequence of target strain (e.g., Diarizaone_AG) in the test
subfolder. Also put the genomes of ancestors in the same subfolder, and make sure the evolutionary route of the target strain (e.g., S.genus.AG
, S.enterica.ancient
, nA_AG
, nB2_AG
, Diarizaone_AG
). Use and modify the following pipeline (the installation path of the software tools highlighted in red should be adjusted according to your own system; the designation of the files should be modified according to your own tasks):
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
cd test
/Applications/Mauve.app/Contents/MacOS/progressiveMauve --output=DiarizonaeAG.vs.GenusAG.xmfa --output-guide-tree=DiarizonaeAG.vs.GenusAG.guide_tree --backbone-output=DiarizonaeAG.vs.GenusAG.backbone Diarizonae_AG.fasta S.genus.AG.fasta
./progBackbonePrep DiarizonaeAG.vs.GenusAG.backbone >DiarizonaeAG.vs.GenusAG.backbone.txt
./homBlkReorder DiarizonaeAG.vs.GenusAG.backbone.txt >DiarizonaeAG.vs.GenusAG.homblk.txt
./orthoParsing DiarizonaeAG.vs.GenusAG.homblk.txt >DiarizonaeAG.vs.GenusAG.orthBlk.txt
./1dgrExt1 DiarizonaeAG.vs.GenusAG.orthBlk.txt Diarizonae_AG Genus_AG >DiarizonaeAG.GenusAG.1dgr.txt
/Applications/Mauve.app/Contents/MacOS/progressiveMauve --output=DiarizonaeAG.vs.EntSpAG.xmfa --output-guide-tree=DiarizonaeAG.vs.EntSpAG.guide_tree --backbone-output=DiarizonaeAG.vs.EntSpAG.backbone Diarizonae_AG.fasta S.enterica.ancient.fasta
./progBackbonePrep DiarizonaeAG.vs.EntSpAG.backbone >DiarizonaeAG.vs.EntSpAG.backbone.txt
./homBlkReorder DiarizonaeAG.vs.EntSpAG.backbone.txt >DiarizonaeAG.vs.EntSpAG.homblk.txt
./orthoParsing DiarizonaeAG.vs.EntSpAG.homblk.txt >DiarizonaeAG.vs.EntSpAG.orthBlk.txt
./1dgrExt1 DiarizonaeAG.vs.EntSpAG.orthBlk.txt Diarizonae_AG Ent_sp_AG >DiarizonaeAG.EntSpAG.1dgr.txt
perl 1dgrMerge.pl DiarizonaeAG.GenusAG.1dgr.txt DiarizonaeAG.EntSpAG.1dgr.txt Diarizonae_AG >DiarizonaeAG.merged_1.1dgr.txt
/Applications/Mauve.app/Contents/MacOS/progressiveMauve --output=DiarizonaeAG.vs.nAAG.xmfa --output-guide-tree=DiarizonaeAG.vs.nAAG.guide_tree --backbone-output=DiarizonaeAG.vs.nAAG.backbone Diarizonae_AG.fasta nA_AG.fasta
./progBackbonePrep DiarizonaeAG.vs.nAAG.backbone >DiarizonaeAG.vs.nAAG.backbone.txt
./homBlkReorder DiarizonaeAG.vs.nAAG.backbone.txt >DiarizonaeAG.vs.nAAG.homblk.txt
./orthoParsing DiarizonaeAG.vs.nAAG.homblk.txt >DiarizonaeAG.vs.nAAG.orthBlk.txt
./1dgrExt1 DiarizonaeAG.vs.nAAG.orthBlk.txt Diarizonae_AG nA_AG >DiarizonaeAG.nAAG.1dgr.txt
perl 1dgrMerge.pl DiarizonaeAG.merged_1.1dgr.txt DiarizonaeAG.nAAG.1dgr.txt Diarizonae_AG >DiarizonaeAG.merged_2.1dgr.txt
/Applications/Mauve.app/Contents/MacOS/progressiveMauve --output=DiarizonaeAG.vs.nB2AG.xmfa --output-guide-tree=DiarizonaeAG.vs.nB2AG.guide_tree --backbone-output=DiarizonaeAG.vs.nB2AG.backbone Diarizonae_AG.fasta nB2_AG.fasta
./progBackbonePrep DiarizonaeAG.vs.nB2AG.backbone >DiarizonaeAG.vs.nB2AG.backbone.txt
./homBlkReorder DiarizonaeAG.vs.nB2AG.backbone.txt >DiarizonaeAG.vs.nB2AG.homblk.txt
./orthoParsing DiarizonaeAG.vs.nB2AG.homblk.txt >DiarizonaeAG.vs.nB2AG.orthBlk.txt
./1dgrExt1 DiarizonaeAG.vs.nB2AG.orthBlk.txt Diarizonae_AG nB2_AG >DiarizonaeAG.nB2AG.1dgr.txt
perl 1dgrMerge.pl DiarizonaeAG.merged_2.1dgr.txt DiarizonaeAG.nB2AG.1dgr.txt Diarizonae_AG >DiarizonaeAG.merged.1dgr.txt
perl 1dgrExtF.pl DiarizonaeAG.merged.1dgr.txt 4606690 Diarizonae_AG DiarizonaeAG >Diarizonae_AG.1DGR.txt
The script 1dgrExtF.pl
at the final step includes the following parameters:
$
perl 1dgrExtF.pl <1DGR_MERGED_FILE> <Full_Length_of_Target_Genome> <Target_Strain_Name> <1DGR_BLOCK_PREFIX>
The final 1DGR result of the Diarizonae_AG genome is saved in the file with a name that could be designated (Diarizonae_AG.1DGR.txt
in the example) in the test
subfolder.
The codes, executable files and testing data were stored in the codes
, bin
and test
subdirectory respectively. The executable files are for Mac system. For Linux/Win system, the source codes need to be re-compiled. Bact1DGR is accessible here .
Subsp. | Strain | Serotype | Assembly level | Size (bp) | Genes |
II (Salamae) | RKS2986; SARC4 | 42:f,g,t:-- | Complete | 4,861,844 | 4,756 |
IIIb (Diarizonae) | RKS2978; SARC7 | 50:k:z | Complete | 5,065,792 | 5,047 |
IV (Houtenae) | RKS3027; SARC10 | 16:z4,z32:-- | Complete | 4,567,406 | 4,530 |
VI (Indica) | RKS3057; SARC14 | 11:b:e,n,x | Complete | 4,726,531 | 4,738 |
VII | RKS3013; SARC15 | 1,40:g,z51:- | Complete | 4,467,812 | 4,448 |
Here is the list of Salmonella strains used for core and pan genome analysis.
The core gene families of Salmonella are shown in the document.
The pan-genome families of Salmonella are shown in the document.
The raw read counts for Salmonella gene families were shown in the matrix, where the pan-gene accession was used to designate each family.
The reference genomes used for Salmonella 3C data analysis with coordinates adjusted according to those of S. typhimurium 14028S can be downloaded here.
The raw Salmonella chromosomal contact matrices with bin size of 5kb were attached here.