Supplementary information for G2D
G2D Home

In order to test the different methods, we prepared a benchmark set with 227 OMIM phenotypes that are known to be produced by mutations in more than one gene. This allowed us to use the same diseases to test the PHENOTYPE method and the methods that use information from several mapped loci (GO ENRICHMENT, and INTERACTIONS method), or another gene or genes already linked to the disease (KNOWN GENES method).

To test the methods we picked for each disease a fixed target gene, and we took a region of 30 Mb around this gene. The predictions by all the methods were run on these regions. In the benchmark table, we display details for each test case, including the genomic location and number of genes (Entrez Gene entries) on each band:

  • The benchmark table of 227 diseases that correspond to OMIM entries currently linked to at least two different genes.

The benchmark for PHENOTYPE, GO ENRICHMENT and KNOWN GENES methods was performed as follows:

    PHENOTYPE method

    Given an OMIM phenotype, we first extracted the MeSH C terms from the bibliography linked in the entry. We compute a GO scoring system according to our method. We then ranked the RefSeq genes by their GO annotations according to the scoring system and compared them by BLASTX with the genome sequence of 30 MB around the target gene. We sorted the hits according to the order of the RefSeq genes homolog to them, and checked in which position the target gene was predicted.

    GO ENRICHMENT method

    For each disease, we pooled together all the GO annotations from the genes in the 30 Mb target region plus the GO annotation from all the genes (Entrez Gene entries) on regions of 30 Mb around the rest of the genes also linked to the disease (see the benchmark table). The idea is to attempt to observe an (unexpected by chance) enrichment of some GO terms, which would indicate a group of genes perfoming the same or similar functions. Because all the regions are linked to the disease, the assumption is that the enriched GO terms would describe the features of the reponsible genes. This idea was exploited first by Turner et al. in the POCUS method. Here, we used the GO enrichment to score the genes in the target regions. We tried this method using either all GO terms or the GO hierarchy. The all-terms approach worked better and the corresponding results are reported here.

    KNOWN GENES method

    For each disease, we pooled together the GO terms annotating all the genes that produce the phenotype but the target gene. We scored the genes in the target region by the semantic similarity of their annotations to the pooled GO terms. Semantic similarity was used by Adie et al on SUSPECTS. To measure semantic similarity we used Resnik distance.

Methods PHENOTYPE and KNWON GENES performed better than GO ENRICHMENT (see table 1). The table indicates how many times the responsible gene was predicted as the top candidate (1st), among the best 10 candidates (<10th), among the best 30 candidates (<30th), and the top percentage average (average). The performance of PHENOTYPE and KNOWN GENES was very similar, with KNOWN GENES predicting more times the responsible gene as the first time, but PHENOTYPE performed slightly better on average. For details on each disease, see full table with details below.

table 1

  • Full table Performance of PHENOTYPE, GO ENRICHMENT, and KNOWN GENES methods in the benchmark of 227 diseases.

      INTERACTIONS method

      To test the INTERACTIONS method we used for each disease the target bands and an additional one. Only 130 phenotypes from the set of 227 had at least one interaction in our prefiltered set of human STRING interactions between genes in the target regions and genes in the rest of the linked bands. The filtering consisted in removing those interactions that were predicted only from literature mining. This was done in order to discard less reliable interactions. For the 130 left diseases, we consider all the interactions between the genes in the 30 Mb target region and the genes in a second target region of 30 Mb around another of the linked genes. The genes in the target band that were making any interactions were sorted by the STRING score associated to the interaction.

    In about one in four of the 130 cases the responsible gene in the target band was interacting with any gene in the second locus. This means that currently this method could work in less than one in four times. However, in these cases where the method could be applied we observed a very strong association between having a very high STRING score and being the responsible gene. When ordering the interactions by STRING score, the resposible gene was among the top 3 candidates around 46% of the times and among the top 10 candidates 70% of the times. This means that this method has a very low recall but high precission when STRING scores are very high (see figure 1, table 2).

    figure 1

  • figure 1, table 2 Performance of the INTERACTIONS method in the subset of the benchmark of 227 diseases where it could be applied. Columns: gene 1 is the responsible gene for the disease OMIM id; gene 2 is a gene in a second locus that has also been associated to the disease and interacts with gene 1 according to STRING; STRING score is a score representing how reliable is the interaction between gene 1 & gene 2, with its maximum at 999; band sizes in number of genes in the two loci used for the benchmark (30 Mb, 15 Mb around the responsible gene); position gene 1 is the position of the responsible pair of genes in the list of candidates produced by this method ordered by the STRING scores.


    Supplementary material for past versions

  • Supplementary material corresponding to version v 2

  • Supplementary material corresponding to version v 1