![]()
| ||||||
|
In order to test the different methods, we prepared a benchmark set with 227 OMIM phenotypes that are known to be produced by mutations in more than one gene. This allowed us to use the same diseases to test the PHENOTYPE method and the methods that use information from several mapped loci (GO ENRICHMENT, and INTERACTIONS method), or another gene or genes already linked to the disease (KNOWN GENES method). To test the methods we picked for each disease a fixed target gene, and we took a region of 30 Mb around this gene. The predictions by all the methods were run on these regions. In the benchmark table, we display details for each test case, including the genomic location and number of genes (Entrez Gene entries) on each band:
The benchmark for PHENOTYPE, GO ENRICHMENT and KNOWN GENES methods was performed as follows: PHENOTYPE methodGiven an OMIM phenotype, we first extracted the MeSH C terms from the bibliography linked in the entry. We compute a GO scoring system according to our method. We then ranked the RefSeq genes by their GO annotations according to the scoring system and compared them by BLASTX with the genome sequence of 30 MB around the target gene. We sorted the hits according to the order of the RefSeq genes homolog to them, and checked in which position the target gene was predicted.
GO ENRICHMENT methodFor each disease, we pooled together all the GO annotations from the genes in the 30 Mb target region plus the GO annotation from all the genes (Entrez Gene entries) on regions of 30 Mb around the rest of the genes also linked to the disease (see the benchmark table). The idea is to attempt to observe an (unexpected by chance) enrichment of some GO terms, which would indicate a group of genes perfoming the same or similar functions. Because all the regions are linked to the disease, the assumption is that the enriched GO terms would describe the features of the reponsible genes. This idea was exploited first by Turner et al. in the POCUS method. Here, we used the GO enrichment to score the genes in the target regions. We tried this method using either all GO terms or the GO hierarchy. The all-terms approach worked better and the corresponding results are reported here.
KNOWN GENES methodFor each disease, we pooled together the GO terms annotating all the genes that produce the phenotype but the target gene. We scored the genes in the target region by the semantic similarity of their annotations to the pooled GO terms. Semantic similarity was used by Adie et al on SUSPECTS. To measure semantic similarity we used Resnik distance.
INTERACTIONS methodTo test the INTERACTIONS method we used for each disease the target bands and an additional one. Only 130 phenotypes from the set of 227 had at least one interaction in our prefiltered set of human STRING interactions between genes in the target regions and genes in the rest of the linked bands. The filtering consisted in removing those interactions that were predicted only from literature mining. This was done in order to discard less reliable interactions. For the 130 left diseases, we consider all the interactions between the genes in the 30 Mb target region and the genes in a second target region of 30 Mb around another of the linked genes. The genes in the target band that were making any interactions were sorted by the STRING score associated to the interaction.
Supplementary material for past versions
|