The similarity of paralogous genes is measured with the identity of the amino acid sequences. The identities were fetched from Ensembl and computed with CLUSTAL W, a multiple sequence alignment tool. Similarities of non-paralogous genes are set to zero.
Caution: the sequence identity is a asymmetric measure, i.e. the identity of aligned sequences a and b can be different.
Measures the degree of sequence identity to various species (mouse, rat, macaque, fruitfly, dog, guinea_pig, pig, rabbit, worm, cow) between human genes
The similarity between two genes is calculated from the correlation of two one dimensional vectors consisting of the numbers of orthologs and the max. protein identities to every species.
This similarity is based Gene Ontology (GO) terms. GO terms as well as their associations to genes were fetched from Ensembl.
In the first step the similarity between two GO terms was computed with the R-Implementation (Bioconductor package GOSemSim) of the information content based Resnik's measure.
In the second step the combination method best-match average was applied to the output of the first step in order to assign a similarity to each pair of term sets.
This similarity is based Gene Ontology (GO) terms. GO terms as well as their associations to genes were fetched from Ensembl.
In the first step the similarity between two GO terms was computed with the R-Implementation (Bioconductor package GOSemSim) of the information content based Resnik's measure.
In the second step the combination method best-match average was applied to the output of the first step in order to assign a similarity to each pair of term sets.
This similarity is based Gene Ontology (GO) terms. GO terms as well as their associations to genes were fetched from Ensembl.
In the first step the similarity between two GO terms was computed with the R-Implementation (Bioconductor package GOSemSim) of the information content based Resnik's measure.
In the second step the combination method best-match average was applied to the output of the first step in order to assign a similarity to each pair of term sets.
Interpro identifiers were fetched from Ensembl. Sets of identifiers (one set for each gene) were compared with the Cosine measure.
A binary vector was assigned to each protein-coding gene from the following features:
These features were fetched from the Uniprot/Swiss-Prot database. The similarity between two vectors was computed with the Cosine measure.
Citations were fetched from the Ensembl Variation database. Ensembl itself derived the citations from dbSNP submissions and text mining performed by EPMC and UCSC
The Cosine measure was used to compute similarities between sets of citations.
Caution: Only publications of genetic variants were considered.
Citations were fetched from Pubtator. The Cosine measure was used to compute similarities between sets of citations.
Tissue-specific gene expression dataset from The Human Protein Atlas. Based on RNAseq data from 32 tissues.
As similarity measure the square (r2) of Spearman's rank correlation coefficient was used.
Celline-specific gene expression dataset from The Human Protein Atlas. Based on RNAseq data from 44 cell lines.
As similarity measure the square (r2) of Spearman's rank correlation coefficient was used.
Brainspan dataset consists of gene expressions across different brain tissues/regions, ages and individuals.
Brainspan dataset consists of gene expressions across different brain tissues/regions, ages and individuals. Only samples from individuals > 13 years if age included.
This similarity reflects the the evidence dimension "Experimental/Biochemical Data" in StringDB.
The genomic distance similarity is based on GRCh37/hg19. It equals to 0, if the distance is > 1MB or if the two genes are located on different genomes. Otherwise the similarity is calculated using the formula -1*((distance/1MB)-1).