kazu.utils.download_gilda_contexts

This script does the following things:

  1. Query Ensembl to get gene to protein ID maps

  2. Query wikidata sparql to get a list of wikidata ids to Ensembl gene or Ensembl protein IDs

  3. Query wikidata api to get a list of wikipedia page urls with the wikidata IDs from 2

  4. Query Wikipedia API to get page content for each page from 3

  5. Join wiki page content to Ensembl gene ids based on above relationships

Functions

create_wiki_mappings(gene_df, protein_df, ...)

divide_chunks(items)

extract_open_targets(path, proxies)

get_biomart_gene_to_protein(proxies)

get_retry()

get_sparql_df(query, proxies)

get_wikipedia_contents_from_urls(urls, proxies)

get_wikipedia_url_from_wikidata_id(df_genes, ...)

retry_wiki_with_maxlag(url, params, proxies)

Classes

WikipediaEnsemblMapping

WikipediaEnsemblMapping(ensembl_gene_id: str, ensembl_protein_ids: set[str] = <factory>, wiki_gene_ids: set[str] = <factory>, wiki_protein_ids: set[str] = <factory>, wiki_gene_urls_to_text: dict[str, typing.Optional[str]] = <factory>, wiki_protein_urls_to_text: dict[str, typing.Optional[str]] = <factory>)

class kazu.utils.download_gilda_contexts.WikipediaEnsemblMapping[source]

Bases: object

WikipediaEnsemblMapping(ensembl_gene_id: str, ensembl_protein_ids: set[str] = <factory>, wiki_gene_ids: set[str] = <factory>, wiki_protein_ids: set[str] = <factory>, wiki_gene_urls_to_text: dict[str, typing.Optional[str]] = <factory>, wiki_protein_urls_to_text: dict[str, typing.Optional[str]] = <factory>)

__init__(ensembl_gene_id, ensembl_protein_ids=<factory>, wiki_gene_ids=<factory>, wiki_protein_ids=<factory>, wiki_gene_urls_to_text=<factory>, wiki_protein_urls_to_text=<factory>)[source]
Parameters:
Return type:

None

get_context()[source]
Return type:

str | None

ensembl_gene_id: str
ensembl_protein_ids: set[str]
wiki_gene_ids: set[str]
wiki_gene_urls_to_text: dict[str, str | None]
wiki_protein_ids: set[str]
wiki_protein_urls_to_text: dict[str, str | None]
kazu.utils.download_gilda_contexts.create_wiki_mappings(gene_df, protein_df, ensembl_gene_to_protein_mappings, wikidata_id_to_wikipedia_urls, wikipage_to_text)[source]
Parameters:
Return type:

set[WikipediaEnsemblMapping]

kazu.utils.download_gilda_contexts.divide_chunks(items)[source]
Parameters:

items (list[str])

Return type:

Iterable[list[str]]

kazu.utils.download_gilda_contexts.extract_open_targets(path, proxies)[source]
Parameters:
Return type:

None

kazu.utils.download_gilda_contexts.get_biomart_gene_to_protein(proxies)[source]
Parameters:

proxies (dict[str, str])

Return type:

DataFrame

kazu.utils.download_gilda_contexts.get_retry()[source]
Return type:

Retry

kazu.utils.download_gilda_contexts.get_sparql_df(query, proxies)[source]
Parameters:
Return type:

DataFrame

kazu.utils.download_gilda_contexts.get_wikipedia_contents_from_urls(urls, proxies)[source]
Parameters:
Return type:

dict[str, str]

kazu.utils.download_gilda_contexts.get_wikipedia_url_from_wikidata_id(df_genes, df_proteins, proxies)[source]
Parameters:
Return type:

defaultdict[str, set[str]]

kazu.utils.download_gilda_contexts.retry_wiki_with_maxlag(url, params, proxies)[source]
Parameters:
Return type:

dict[str, Any]