kg_covid_19.transform_utils.scibite_cord package¶

Submodules¶

kg_covid_19.transform_utils.scibite_cord.scibite_cord module¶

class kg_covid_19.transform_utils.scibite_cord.scibite_cord.ScibiteCordTransform(input_dir: str = None, output_dir: str = None)¶

Bases: kg_covid_19.transform_utils.transform.Transform

ScibiteCordTransform parses the SciBite annotations on CORD-19 dataset to extract concept to publication annotations and co-occurrences.

contract_uri(iri) → str¶

Contract a given IRI.

Contract a given IRI, with special parsing and transformations depending on the nature of the IRI.

Args:: iri: IRI as string
Returns:: str.

extract_termite_hits(data: Dict) → Set¶

Parse termite hits

Args:: node_handle: File handle for nodes.csv. edge_handle: File handle for edges.csv. data: A dictionary.
Returns:: None.

static get_identifier_by_prefix(record, prefix)¶

Get identifier from a list based on prefix.

Args:: record: record from NCBI gene_info. prefix: prefix of the identifier to extract.
Returns:: str

static is_curie(s: str) → bool¶

Check if a given string is a CURIE.

Args:: s: string
Returns:: bool.

static is_iri(s) → bool¶

Check ig a given string is an IRI.

Args:: s: string
Returns:: bool.

load_country_code(input_dir: str, output_dir: str) → None¶

load_gene_info(input_dir: str, output_dir: str, species_id: List = None) → None¶

Load mappings from NCBI gene_info (gene_info.gz).

Args:: input_dir: A string pointing to the directory to import data from. output_dir: A string pointing to the directory to output data to. species_id: A list with the species IDs.
Returns:: None.

parse_annotation_doc(node_handle, edge_handle, doc: Dict) → None¶

Parse a JSON document corresponding to a publication.

Args:: node_handle: File handle for nodes.csv. edge_handle: File handle for edges.csv. doc: JSON document as dict. subset: The subset name for this dataset.
Returns:: None.

parse_annotations(node_handle: IO, edge_handle: IO, data_file1: str, data_file2: str, data_file3: str) → None¶

Parse annotations from CORD-19_1_5.zip.

Args:: node_handle: File handle for nodes.csv. edge_handle: File handle for edges.csv. data_file1: Path to pdf_json_part_1.zip data_file2: Path to pdf_json_part_2.zip data_file2: Path to pmc_json.zip
Returns:: None.

parse_cooccurrence(node_handle: Any, edge_handle: Any, data_file: str) → None¶

Parse term co-occurrences from cv19_scc.zip.

Args:: node_handle: File handle for nodes.csv. edge_handle: File handle for edges.csv. data_file: Path to cv19_scc.zip.
Returns:: None.

parse_cooccurrence_record(node_handle: Any, edge_handle: Any, record: Dict) → None¶

Parse term-cooccurrences.

Args:: node_handle: File handle for nodes.csv. edge_handle: File handle for edges.csv. record: A dictionary corresponding to a row from a table.
Returns:: None.

run(pdf_zipfile_1: Optional[str] = None, pdf_zipfile_2: Optional[str] = None, pmc_zipfile: Optional[str] = None, co_occur_zipfile: Optional[str] = None) → None¶

Method is called and performs needed transformations to process annotations from SciBite CORD-19

Args:: pdf_zipfile_1: PDF zip file part 1 [pdf_json_part_1.zip] pdf_zipfile_2: PDF zip file part 2 [pdf_json_part_1.zip] pmc_zipfile: pmc zipfile [pmc_json.zip] co_occur_zipfile: coocurrence data zipfile [cv19_scc_1_2.zip]
Returns:: None.

Module contents¶

class kg_covid_19.transform_utils.scibite_cord.ScibiteCordTransform(input_dir: str = None, output_dir: str = None)¶