kg_covid_19.transform_utils.scibite_cord package

Submodules

kg_covid_19.transform_utils.scibite_cord.scibite_cord module

class kg_covid_19.transform_utils.scibite_cord.scibite_cord.ScibiteCordTransform(input_dir: str = None, output_dir: str = None)

Bases: kg_covid_19.transform_utils.transform.Transform

ScibiteCordTransform parses the SciBite annotations on CORD-19 dataset to extract concept to publication annotations and co-occurrences.

contract_uri(iri) → str

Contract a given IRI.

Contract a given IRI, with special parsing and transformations depending on the nature of the IRI.

Args:

iri: IRI as string

Returns:

str.

extract_termite_hits(data: Dict) → Set

Parse termite hits

Args:

node_handle: File handle for nodes.csv. edge_handle: File handle for edges.csv. data: A dictionary.

Returns:

None.

static get_identifier_by_prefix(record, prefix)

Get identifier from a list based on prefix.

Args:

record: record from NCBI gene_info. prefix: prefix of the identifier to extract.

Returns:

str

static is_curie(s: str) → bool

Check if a given string is a CURIE.

Args:

s: string

Returns:

bool.

static is_iri(s) → bool

Check ig a given string is an IRI.

Args:

s: string

Returns:

bool.

load_country_code(input_dir: str, output_dir: str) → None
load_gene_info(input_dir: str, output_dir: str, species_id: List = None) → None

Load mappings from NCBI gene_info (gene_info.gz).

Args:

input_dir: A string pointing to the directory to import data from. output_dir: A string pointing to the directory to output data to. species_id: A list with the species IDs.

Returns:

None.

parse_annotation_doc(node_handle, edge_handle, doc: Dict) → None

Parse a JSON document corresponding to a publication.

Args:

node_handle: File handle for nodes.csv. edge_handle: File handle for edges.csv. doc: JSON document as dict. subset: The subset name for this dataset.

Returns:

None.

parse_annotations(node_handle: IO, edge_handle: IO, data_file1: str, data_file2: str, data_file3: str) → None

Parse annotations from CORD-19_1_5.zip.

Args:

node_handle: File handle for nodes.csv. edge_handle: File handle for edges.csv. data_file1: Path to pdf_json_part_1.zip data_file2: Path to pdf_json_part_2.zip data_file2: Path to pmc_json.zip

Returns:

None.

parse_cooccurrence(node_handle: Any, edge_handle: Any, data_file: str) → None

Parse term co-occurrences from cv19_scc.zip.

Args:

node_handle: File handle for nodes.csv. edge_handle: File handle for edges.csv. data_file: Path to cv19_scc.zip.

Returns:

None.

parse_cooccurrence_record(node_handle: Any, edge_handle: Any, record: Dict) → None

Parse term-cooccurrences.

Args:

node_handle: File handle for nodes.csv. edge_handle: File handle for edges.csv. record: A dictionary corresponding to a row from a table.

Returns:

None.

run(pdf_zipfile_1: Optional[str] = None, pdf_zipfile_2: Optional[str] = None, pmc_zipfile: Optional[str] = None, co_occur_zipfile: Optional[str] = None) → None

Method is called and performs needed transformations to process annotations from SciBite CORD-19

Args:

pdf_zipfile_1: PDF zip file part 1 [pdf_json_part_1.zip] pdf_zipfile_2: PDF zip file part 2 [pdf_json_part_1.zip] pmc_zipfile: pmc zipfile [pmc_json.zip] co_occur_zipfile: coocurrence data zipfile [cv19_scc_1_2.zip]

Returns:

None.

Module contents

class kg_covid_19.transform_utils.scibite_cord.ScibiteCordTransform(input_dir: str = None, output_dir: str = None)

Bases: kg_covid_19.transform_utils.transform.Transform

ScibiteCordTransform parses the SciBite annotations on CORD-19 dataset to extract concept to publication annotations and co-occurrences.

contract_uri(iri) → str

Contract a given IRI.

Contract a given IRI, with special parsing and transformations depending on the nature of the IRI.

Args:

iri: IRI as string

Returns:

str.

extract_termite_hits(data: Dict) → Set

Parse termite hits

Args:

node_handle: File handle for nodes.csv. edge_handle: File handle for edges.csv. data: A dictionary.

Returns:

None.

static get_identifier_by_prefix(record, prefix)

Get identifier from a list based on prefix.

Args:

record: record from NCBI gene_info. prefix: prefix of the identifier to extract.

Returns:

str

static is_curie(s: str) → bool

Check if a given string is a CURIE.

Args:

s: string

Returns:

bool.

static is_iri(s) → bool

Check ig a given string is an IRI.

Args:

s: string

Returns:

bool.

load_country_code(input_dir: str, output_dir: str) → None
load_gene_info(input_dir: str, output_dir: str, species_id: List = None) → None

Load mappings from NCBI gene_info (gene_info.gz).

Args:

input_dir: A string pointing to the directory to import data from. output_dir: A string pointing to the directory to output data to. species_id: A list with the species IDs.

Returns:

None.

parse_annotation_doc(node_handle, edge_handle, doc: Dict) → None

Parse a JSON document corresponding to a publication.

Args:

node_handle: File handle for nodes.csv. edge_handle: File handle for edges.csv. doc: JSON document as dict. subset: The subset name for this dataset.

Returns:

None.

parse_annotations(node_handle: IO, edge_handle: IO, data_file1: str, data_file2: str, data_file3: str) → None

Parse annotations from CORD-19_1_5.zip.

Args:

node_handle: File handle for nodes.csv. edge_handle: File handle for edges.csv. data_file1: Path to pdf_json_part_1.zip data_file2: Path to pdf_json_part_2.zip data_file2: Path to pmc_json.zip

Returns:

None.

parse_cooccurrence(node_handle: Any, edge_handle: Any, data_file: str) → None

Parse term co-occurrences from cv19_scc.zip.

Args:

node_handle: File handle for nodes.csv. edge_handle: File handle for edges.csv. data_file: Path to cv19_scc.zip.

Returns:

None.

parse_cooccurrence_record(node_handle: Any, edge_handle: Any, record: Dict) → None

Parse term-cooccurrences.

Args:

node_handle: File handle for nodes.csv. edge_handle: File handle for edges.csv. record: A dictionary corresponding to a row from a table.

Returns:

None.

run(pdf_zipfile_1: Optional[str] = None, pdf_zipfile_2: Optional[str] = None, pmc_zipfile: Optional[str] = None, co_occur_zipfile: Optional[str] = None) → None

Method is called and performs needed transformations to process annotations from SciBite CORD-19

Args:

pdf_zipfile_1: PDF zip file part 1 [pdf_json_part_1.zip] pdf_zipfile_2: PDF zip file part 2 [pdf_json_part_1.zip] pmc_zipfile: pmc zipfile [pmc_json.zip] co_occur_zipfile: coocurrence data zipfile [cv19_scc_1_2.zip]

Returns:

None.