Validation and inference over LinkML instance data using souffle

Last update: Aug 07, 2022

Overview

linkml-datalog

Validation and inference over LinkML instance data using souffle

Requirements

This project requires souffle

After installing souffle, install the python here is a normal way.

Until this is released to pypi:

poetry install

Running

Pass in a schema and a data file

poetry run python -m linkml_datalog.engines.datalog_engine -d tmp -s personinfo.yaml example_personinfo_data.yaml

The output will be a ValidationReport object, in yaml

e.g.

- type: sh:MaxValue
  subject: https://example.org/P/003
  instantiates: Person
  predicate: age_in_years
  object_str: '100001'
  info: Maximum is 999

Currently, to look at inferred edges, consult the directory you specified in -d

E.g.

tmp/Person_grandfather_of.csv

Will have a subject and object tuple P:005 to P:001

How it works

Schema is compiled to Souffle DL problem (see generated schema.dl file)
Any embedded logic program in the schema is also added
Data is converted to generic triple-like tuples (see *.facts)
Souffle executed
Inferred validation results turned into objects

Assuming input like this:

classes:
  Person:
    attributes:
      age:
        range: integer
        maximum_value: 999

The generated souffle program will look like this:

999.">

.decl Person_age_in_years_asserted(i: identifier, v: value)
.decl Person_age_in_years(i: identifier, v: value)
.output Person_age_in_years
.output Person_age_in_years_asserted
Person_age_in_years(i, v) :- 
    Person_age_in_years_asserted(i, v).
Person_age_in_years_asserted(i, v) :- 
    Person(i),
    triple(i, "https://w3id.org/linkml/examples/personinfo/age_in_years", v).

validation_result(
  "sh:MaxValueTODO",
  i,
  "Person",
  "age_in_years",
  v,
  "Maximum is 999") :-
    Person(i),
    Person_age_in_years(i, v),
    literal_number(v,num),
    num > 999.

Motivation / Future Extensions

The above example shows functionality that could easily be achieved by other means:

jsonschema
shape languages: shex/shacl

In fact the core linkml library already has wrappers for these. See working with data in linkml guide.

However, jsonschema in particular offers very limited expressivity. There are many more opportunities for expressivity with linkml.

In particular, LinkML 1.2 introduces autoclassification rules, conditional logic, and complex expressions -- THESE ARE NOT TRANSLATED YET, but they will be in future.

For now, you can also include your own rules in the header of your schema as an annotation, e.g the following translates a 'reified' association modeling of relationships to direct slot assignments, and includes transitive inferences etc

has_familial_relationship_to(i, p, j) :-
    Person_has_familial_relationships(i, r),
    FamilialRelationship_related_to(r, j),
    FamilialRelationship_type(r, p).

Person_parent_of(i, j) :-
    has_familial_relationship_to(i, "https://example.org/FamilialRelations#02", j).

Person_ancestor_of(i, j) :-
        Person_parent_of(i, z),
        Person_ancestor_of(z, j).

Person_ancestor_of(i, j) :-
        Person_parent_of(i, j).

See tests for more details.

In future these will be compilable from higher level predicates

Background

See #196

You might also like...

PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift This project is composed of two parts: Part1 and Part2

1 Jan 19, 2022

fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Fast Data Science, AKA fds, is a CLI for Data Scientists to version control data and code at once, by conveniently wrapping git and dvc

359 Dec 22, 2022

Python data processing, analysis, visualization, and data operations

Python This is a Python data processing, analysis, visualization and data operations of the source code warehouse, book ISBN: 9787115527592 Descriptio

1 Jan 16, 2022

Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Data Scientist Learning Plan Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

27 Nov 1, 2022

A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

Realtime Financial Market Data Visualization and Analysis Introduction This repo shows my project about real-time stock data pipeline. All the code is

6 Sep 7, 2022

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set.

791 Jan 4, 2023

A data parser for the internal syncing data format used by Fog of World.

Releases(v0.2.0)

v0.2.0(Jan 31, 2022)

Full Changelog: https://github.com/linkml/linkml-datalog/compare/v0.1.4...v0.2.0
Source code(tar.gz)
Source code(zip)
v0.1.5(Jan 29, 2022)

Full Changelog: https://github.com/linkml/linkml-datalog/compare/v0.1.2...v0.1.5
Source code(tar.gz)
Source code(zip)
v0.1.4(Jan 29, 2022)

Full Changelog: https://github.com/linkml/linkml-datalog/compare/v0.1.2...v0.1.4
Source code(tar.gz)
Source code(zip)
v0.1.3(Jan 29, 2022)

Full Changelog: https://github.com/linkml/linkml-datalog/compare/v0.1.0...v0.1.3
Source code(tar.gz)
Source code(zip)
v0.1.2(Jan 29, 2022)

Full Changelog: https://github.com/linkml/linkml-datalog/compare/v0.1.0...v0.1.2
Source code(tar.gz)
Source code(zip)
v0.1.1(Jan 29, 2022)

Full Changelog: https://github.com/linkml/linkml-datalog/commits/v0.1.1
Source code(tar.gz)
Source code(zip)
v0.1.0(Jan 29, 2022)

Full Changelog: https://github.com/linkml/linkml-datalog/commits/v0.1.0
Source code(tar.gz)
Source code(zip)

Validation and inference over LinkML instance data using souffle

Related tags

Overview

linkml-datalog

Requirements

Running

How it works

Motivation / Future Extensions

Background

You might also like...

PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Python data processing, analysis, visualization, and data operations

Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

A data parser for the internal syncing data format used by Fog of World.

Functional Data Analysis, or FDA, is the field of Statistics that analyses data that depend on a continuous parameter.

Fancy data functions that will make your life as a data scientist easier.

Releases(v0.2.0)

v0.2.0(Jan 31, 2022)

v0.1.5(Jan 29, 2022)

v0.1.4(Jan 29, 2022)

v0.1.3(Jan 29, 2022)

v0.1.2(Jan 29, 2022)

v0.1.1(Jan 29, 2022)

v0.1.0(Jan 29, 2022)

Owner

Linked data Modeling Language

PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams

A Numba-based two-point correlation function calculator using a grid decomposition

A model checker for verifying properties in epistemic models

A computer algebra system written in pure Python

A columnar data container that can be compressed.

MoRecon - A tool for reconstructing missing frames in motion capture data.

Hydrogen (or other pure gas phase species) depressurization calculations

Repositori untuk menyimpan material Long Course STMKGxHMGI tentang Geophysical Python for Seismic Data Analysis

Semi-Automated Data Processing

Geospatial data-science analysis on reasons behind delay in Grab ride-share services

A probabilistic programming library for Bayesian deep learning, generative models, based on Tensorflow

Numerical Analysis toolkit centred around PDEs, for demonstration and understanding purposes not production

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

Hangar is version control for tensor data. Commit, branch, merge, revert, and collaborate in the data-defined software era.

The Master's in Data Science Program run by the Faculty of Mathematics and Information Science

This is an analysis and prediction project for house prices in King County, USA based on certain features of the house

🌍 Create 3d-printable STLs from satellite elevation data 🌏

Modular analysis tools for neurophysiology data

Data analysis and visualisation projects from a range of individual projects and applications