AEROPATH Target Database
home | explore | portfolio | statistics | background | references | news | usage | acknowledgements

Pseudomonas Genome data

The reference version of the Pseudomonas genome used by the Aeropath Target Database is the PAO1 strain as provided by the Pseudomonas Genome Database. As well as the DNA and protein sequences this valuable resource makes available data on synonyms, alternative identifiers, genomic co-ordinates, relevant Pubmed records, subcellular location, Gene Ontology (GO) terms and family annotations from TigrFam, COG and CDD, all of which has been incorporated into the Aeropath Target Database. The Pseudomonas Genome Database provides regular data updates that are also incorporated.


Our approach to assessing druggability crucially depends on the availability of large-scale chemogenomic resources. The EBI’s ChEMBL database comprises published bioactivity data abstracted almost 45,000 papers from the medicinal chemistry literature covering a period of 30 years. Importantly, we maximize the available information by ensuring we update our predictions to keep track with the latest releases of ChEMBL.

ChEMBL contains a broad range of assay endpoints but for this analysis we are concerned primarily with assays describing binding affinity, or their surrogates. As such only assays whose endpoint was 'Ki', 'Kd' or 'IC50' were selected. As we are most concerned with compounds that bind potently we also filter out compounds whose binding affinity is >10μm. However, as this carries the risk of excluding small but highly efficient binders we “rescue” compounds whose binding affinity is >10μm but whose Ligand Effciency (free energy of binding per heavy atom) is greater than 0.3. Quantitative Estimate of Druglikeness (QED) is used to score all compounds in ChEMBL for their oral druglikeness. The QED scores for each ChEMBL compound are aggregated across each protein target by taking their mean to give a target-level score.

To assess the likely druggability of proteins in the P. aeruginosa proteome we identify sequence homologs in ChEMBL using BLAST+. The ideal target P. aeruginosa target would have high similarity to a ChEMBL target that had a large number of highly desirable compounds. The targets are therefore ranked by performing a Pareto optimization of mean ChEMBL target desirability against % identity between the P. aeruginosa and ChEMBL proteins, followed by secondary ranking by the number of compounds. The approach has the advantage that as well as providing a quantitative measure of druggability any predictions also impart the identity of small molecule ligands that can act as potential lead compounds, assay controls or crystallization agents, as well as information on synthetic routes and assay protocols.


We define “perturbance” as the property of a molecular target whereby modulation of that target will impact on the system at the cellular level. In this context we describe perturbance by flagging those proteins that are either i) shown to be essential by means of a gene-knockout experiment ii) a virulence factor. For P. aeruginosa we are fortunate to have two sets of genome-scale transposon mutagenesis data, published by Liberati et al. and Jacobs et al. The Liberarti et al. mutagenesis data was performed on the PAO1 strain of P. aeruginosa and identified 364 genes as being essential including 30 “potentially essential” genes. The Jacobs et al. study was performed on the PA14 strain and suggests that 773 genes are essential (or 13.9% of the genome), including 97 “potentially essential”. Note the Jacobs set subsumes the Liberarti set.

Information on Virulence Factors was taken from the Virulence Factor database.


Selectivity is established by performing homology and orthology searches with each P. aeruginosa protein against the human proteome. The version of the human proteome used is taken from UniProt and comprises canonical sequences for each protein (i.e. no isoforms). Homology was established by means of standard sequence similarity searches using the NCBI C++ toolkit BLAST+ with an E-value threshold of 10-4. Orthology was determined as the subset of homology relationships that were predicted to occur in the same orthologous group by the OrthoMCL algorithm.


Four bacterial species were identified as being of interest with respect to establishing the likely spectrum of activity of any new lead compounds against the target of interest. The species were selected on the basis that they were i) Gram-negative, ii) pathogenic and iii) had genome sequences that were available and fully annotated. The four selected species were:

Homology and orthology were determined by the same means as with predicting selectivity (i.e. BLAST+ and OrthMCL respectively).

Structural Biology

The EBI’s SIFTS resource comprehensively maps the sequences found in solved structures deposited in the Protein Data Bank (PDB) to their corresponding UniProt record at the level of individual residues. These data were used to identify P. aeruginosa proteins whose structures have been solved (note that many of these structures will not be full length). Sequence similarity searching, again using BLAST+, was performed to identify protein homologs of known structure.

In addition to identification of structural homologs we used a published prediction method, XtalPred, to assess the likely crystallizability of each P. aeruginosa protein. The algorithm calculated 9 biochemical and biophysical features and compares each one to a background distribution to give a probability. The individual probabilities are combined into a single crystallization score that is used to assign each protein to one of five crystallization classes: optimal, suboptimal, average, difficult and very difficult. Importantly, these data are pre-calculated on several microbial genomes. The XtalPred developers kindly provided the data for us to incorporate in to the Aeropath Target Database. In addition to the protein predictions a range of secondary programs are also run which provide useful residue-level annotations that are also incorporated into the database: PSIPRED provides secondary structure prediction; DISOPRED2 provides prediction of structurally disordered regions; COILS predicts coiled-coil regions; TMHMM predicts transmembrane helices and SEG calculates low-complexity regions.