Quick Start ----------- .. code:: sh git clone https://github.com/Knowledge-Graph-Hub/kg-covid-19 cd kg-covid-19 python3 -m venv venv && source venv/bin/activate # optional pip install . python run.py download python run.py transform python run.py merge Download the Knowledge Graph ---------------------------- Prebuilt versions of the KG-COVID-19 knowledge graph build from all available data are available in the following serialization formats: * `RDF N-Triples `__ * `KGX TSV `__ See `here `__ for a description of the KGX TSV format Previous builds are available for download `here `__. Each build contains the following data: * ``raw``: the data ingested for this build * ``transformed``: the transformed data from each source * ``stats``: detailed statistics about the contents of the KG * ``Jenkinsfile``: the exact commands used to generate the KG * ``kg-covid-19.nt.gz``: an RDF/Ntriples version of the KG * ``kg-covid-19.tar.gz``: a KGX TSV version of the KG * ``kg-covid-19.jnl.gz``: the Blazegraph journal file (for loading an endpoint) Knowledge Graph Hub concept --------------------------- A Knowledge Graph Hub (KG Hub) is framework to download and transform data to a central location for building knowledge graphs (KGs) from different combination of data sources, in an automated, YAML-driven way. The workflow constitutes of 3 steps: * Download data * Transform data for each data source into two TSV files (``edges.tsv`` and ``nodes.tsv``) as specified `here `__ * Merge the graphs for each data source of interest using `KGX `__ to produce a merged knowledge graph To facilitate interoperability of datasets, `Biolink categories `__ are added to nodes and `Biolink association types `__ are added to edges during transformation. A more thorough explanation of the KG-hub concept is `here `__. KG-COVID-19 project ------------------- The `KG-COVID-19 `__ project is the first instantiation of such a KG Hub. Thus, KG-COVID-19 is a framework that follows design patterns of the KG Hub to download and transform COVID-19/SARS-COV-2 related datasets and emit a knowledge graph that can then be used for machine learning or others uses, to produce actionable knowledge. The codebase ~~~~~~~~~~~~ - `Here `__ is the GitHub repo for this project. - `Here `__ is the GitHub repo for Embiggen, an implementation of node2vec and other methods to generate embeddings and apply machine learning to graphs. - `Here `__ is the GitHub repo for KGX, a knowledge graph exchange tool for working with graphs Prerequisites ~~~~~~~~~~~~~ - Java/JDK is required in order for the transform step to work properly. See `here `__ for instructions on installing. Computational requirements ~~~~~~~~~~~~~~~~~~~~~~~~~~ - On a commodity server with 200 GB of memory, generation of the knowledge graph containing all source data requires a total of 3.7 hours (0.13 hours, 1.5 hours, and 2.1 hours for the download, transform, and merge step, respectively), with a peak memory usage of 34.4 GB and disk use of 37 GB. An estimate of the current build time on a typical server is also available `here `__. Installation ~~~~~~~~~~~~ .. code:: sh git clone https://github.com/Knowledge-Graph-Hub/kg-covid-19 cd kg-covid-19 python3 -m venv venv && source venv/bin/activate # optional pip install . Running the pipeline ~~~~~~~~~~~~~~~~~~~~ .. code:: sh python run.py download python run.py transform python run.py merge Jupyter notebook ~~~~~~~~~~~~~~~~ We have also prepared a `Jupyter notebook `__ demonstrating how to run the pipeline to generate a KG, and also how to use other tooling such as graph sampling for generating holdouts, and graph querying. A few organizing principles used for data ingest ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - UniProtKB identifiers are used for genes and proteins, where possible - For drug/compound identifiers, there is a preferred namespace. If there are datasets that provide identifiers from multiple namespaces then the choice is determined based on a descending order of preference, - ``CHEBI`` > ``CHEMBL`` > ``DRUGBANK`` > ``PUBCHEM`` - Less is more: for each data source, we ingest only the subset of data that is most relevant to the knowledge graph in question (here, it’s KG-COVID-19) - We avoid ingesting data from a source that isn’t authoritative for the data in question (e.g. we do not ingest protein interaction data from a drug database) - Each ingest should make an effort to add provenance data by adding a ``provided_by`` column for each node and edge in the output TSV file, populated with the source of each datum Querying the graph ------------------ A SPARQL endpoint for the merged knowledge graph is available `here `__. For a better experience, consider using `yasgui `__ for your querying needs (for yasgui, set http://kg-hub-rdf.berkeleybop.io/blazegraph/sparql as your SPARQL endpoint). If you are not sure where to start, `here `__ are some example SPARQL queries. Summary of the data ~~~~~~~~~~~~~~~~~~~ A detailed, up-to-date summary of data in KG-COVID-19 is available `here `__, with contents of the knowledge graph broken down by Biolink categories and Biolink association types for nodes and edges, respectively. An interactive dashboard to explore these stats is available `here `__. How to Contribute ----------------- Download and use the code ~~~~~~~~~~~~~~~~~~~~~~~~ Download and use the code, and any issues and questions `here `__. Write code to ingest data ~~~~~~~~~~~~~~~~~~~~~~~~ Most urgent need is for code to ingest data from new sources. **Find a data source to ingest:** An issue tracker with a list of new data sources is `here `__. **Look at the data file(s), and plan how you are going to write out data to nodes and edges:** You’ll need to write out a ``nodes.tsv`` file describing each entity you are ingesting, and an ``edges.tsv`` describing the relationships between entities, as described `here `__. ``nodes.tsv`` should have at least these columns (you can add more columns if you like): ``id name category`` ``id`` should be a CURIE that uses one of `these identifiers `__. They are enumerated `here `__. For genes, a Uniprot ID is preferred, if available. ``category`` should be a `Biolink category `__ in CURIE format, for example ``biolink:Gene`` ``edges.tsv`` should have at least these columns: ``subject edge_label object relation`` ``subject`` and ``object`` should be ``id``\ s that are present in the ``nodes.tsv`` file (again, as CURIEs that uses one of `these `__). ``edge_label`` should be a CURIE for the `biolink edge_label `__ that describes the relationship. ``relation`` should be a CURIE for the term from the `relation ontology `__. **Read how to make a PR, and fork the repo:** - Read `these `__ instructions about how to make a pull request in github. Fork the code and set up your development environment. **Add a block to ``download.yaml`` to download data file for source:** - Add a block of yaml containing the url of the file you need to download for the source (and optionally a brief description) in `download.yaml `__ like so - each item will be downloaded when the ``run.py download`` command is executed: .. code-block:: yaml # # brief comment about this source, one or more blocks with a url: (and optionally a local_name:, to avoid name collisions) # - # first file url: http://curefordisease.org/some_data.txt local_name: some_data.txt - # second file url: http://curefordisease.org/some_more_data.txt local_name: some_more_data.txt **Add code to ingest and transform data:** * Add a new sub-directory in `kg_emerging_viruses/transform_utils `__ with a unique name for your source. If the data come from a scientific paper, consider prepending the pubmed ID to the name of the source (e.g. ``pmid28355270_hcov229e_a549_cells``) * In this sub-directory, write a class that ingests the file(s) you added above in the yaml, which will be in ``data/raw/[file name without path]``. Your class should have a constructor and a ``run()`` function, which is called to perform the ingest. It should output data into ``data/transformed/[source name]`` for all nodes and edges, in tsv format, as described `here `__. * Also add the following metadata in the comments of your script: * data source * files used * release version that you are ingesting * documentation on which fields are relevant and how they map to node and edge properties * In `kg_covid_19/transform.py `__, add a key/value pair to ``DATA_SOURCES``. The key should be the ``[source name]`` above, and the value should be the name of the class above. Also add an import statement for the class. * In `merge.yaml `__, add a block for your new source, something like: .. code-block:: yaml SOURCE_NAME: input: format: tsv filename: - data/transformed/[source_name]/nodes.tsv - data/transformed/[source_name]/edges.tsv\ **Submit your PR on github, and link the github issue for the data source you ingested** Might want to run ``pylint`` and ``mypy`` and fix any issues before submitting your PR. Develop jupyter notebooks to show how to use kg-covid-19 ~~~~~~~~~~~~~~~~~~~~~~~~ See `here ` for example. Please contact `Justin `__ or anyone on the development team if you’d like to help! Contributors ------------ - `Justin Reese `__ - `Deepak Unni `__ - `Marcin Joachimiak `__ - `Peter Robinson `__ - `Chris Mungall `__ - `Tiffany Callahan `__ - `Luca Cappelletti `__ - `Vida Ravanmehr `__ Acknowledgements ---------------- We gratefully acknowledge and thank all COVID-19 data providers for making their data available.