Curate datasets of any format¶
Our previous guide explained how to validate, standardize & annotate DataFrame
and AnnData
. In this guide, we’ll walk through the basic API that lets you work with any format of data.
How do I validate based on a public ontology?
LaminDB makes it easy to validate categorical variables based on registries that inherit from CanValidate
.
CanValidate
methods validate against the registries in your LaminDB instance.
In Manage biological registries, you’ll see how to extend standard validation to validation against public references using a ReferenceTable
ontology object: public = Record.public()
.
By default, from_values()
considers a match in a public reference a validated value for any bionty
entity.
# !pip install 'lamindb[bionty,zarr]'
!lamin init --storage ./test-curate-any --schema bionty
Show code cell output
→ connected lamindb: testuser1/test-curate-any
import lamindb as ln
import bionty as bt
import zarr
import numpy as np
data = zarr.create((10,), dtype=[('value', 'f8'), ("gene", "U15"), ('disease', 'U16')], store='data.zarr')
data["gene"] = ["ENSG00000139618", "ENSG00000141510", "ENSG00000133703", "ENSG00000157764", "ENSG00000171862", "ENSG00000091831", "ENSG00000141736", "ENSG00000133056", "ENSG00000146648", "ENSG00000118523"]
data["disease"] = np.random.choice(['MONDO:0004975', 'MONDO:0004980'], 10)
→ connected lamindb: testuser1/test-curate-any
Define validation criteria¶
Entities that don’t have a dedicated registry (“are not typed”) can be validated & registered using ULabel
:
criteria = {
"disease": bt.Disease.ontology_id,
"project": ln.ULabel.name,
"gene": bt.Gene.ensembl_gene_id,
}
Validate and standardize metadata¶
validate()
validates passed values against reference values in a registry.
It returns a boolean vector indicating whether a value has an exact match in the reference values.
bt.Disease.validate(data["disease"], field=bt.Disease.ontology_id)
! Your Disease registry is empty, consider populating it first!
→ use `.import_from_source()` to import records from a source, e.g. a public ontology
array([False, False, False, False, False, False, False, False, False,
False])
When validation fails, you can call inspect()
to figure out what to do.
inspect()
applies the same definition of validation as validate()
, but returns a rich return value InspectResult
. Most importantly, it logs recommended curation steps that would render the data validated.
Note: you can use standardize()
to standardize synonyms.
bt.Disease.inspect(data["disease"], field=bt.Disease.ontology_id);
! received 2 unique terms, 8 empty/duplicated terms are ignored
! 2 unique terms (100.00%) are not validated for ontology_id: MONDO:0004980, MONDO:0004975
detected 2 Disease terms in Bionty for ontology_id: 'MONDO:0004975', 'MONDO:0004980'
→ add records from Bionty to your Disease registry via .from_values()
Following the suggestions to register new labels:
Bulk creating records using from_values()
only returns validated records:
Note: Terms validated with public reference are also created with .from_values
, see Manage biological registries for details.
diseases = bt.Disease.from_values(data["disease"], field=bt.Disease.ontology_id)
ln.save(diseases)
Repeat the process for more labels:
projects = ln.ULabel.from_values(
["Project A", "Project B"],
field=ln.ULabel.name,
create=True, # create non-existing labels rather than attempting to load them from the database
)
ln.save(projects)
genes = bt.Gene.from_values(data["gene"], field=bt.Gene.ensembl_gene_id)
ln.save(genes)
Annotate and save dataset with validated metadata¶
Register the dataset as an artifact:
artifact = ln.Artifact("data.zarr", description="a zarr object").save()
Show code cell output
! no run & transform got linked, call `ln.track()` & re-run
Link the artifact to validated labels. You could directly do this, e.g., via artifact.ulabels.add(projects)
or artifact.diseases.add(diseases)
.
However, often, you want to track the features that measured labels. Hence, let’s try to associate our labels with features:
from lamindb.core.exceptions import ValidationError
try:
artifact.features.add_values({"project": projects, "disease": diseases})
except ValidationError as e:
print(e)
Show code cell output
! cannot infer feature type of: [ULabel(uid='J7m0vA9O', name='Project A', created_by_id=1, created_at=2024-10-21 15:05:21 UTC), ULabel(uid='umyduVjF', name='Project B', created_by_id=1, created_at=2024-10-21 15:05:21 UTC)], returning '?
! cannot infer feature type of: [Disease(uid='4JmTj6Sn', name='atopic eczema', ontology_id='MONDO:0004980', synonyms='allergic dermatitis|Atopic dermatitis|allergic form of dermatitis|Besnier's prurigo|Atopic neurodermatitis|eczema|allergic|atopic eczema|eczematous dermatitis', description='A Chronic Inflammatory Genetically Determined Disease Of The Skin Marked By Increased Ability To Form Reagin (Ige), With Increased Susceptibility To Allergic Rhinitis And Asthma, And Hereditary Disposition To A Lowered Threshold For Pruritus. It Is Manifested By Lichenification, Excoriation, And Crusting, Mainly On The Flexural Surfaces Of The Elbow And Knee. In Infants It Is Known As Infantile Eczema.', created_by_id=1, source_id=49, created_at=2024-10-21 15:05:21 UTC), Disease(uid='4F2HPJ3w', name='Alzheimer disease', ontology_id='MONDO:0004975', synonyms='Alzheimers disease|Alzheimer's dementia|Alzheimer's disease|Alzheimers dementia|AD|presenile and senile dementia|Alzheimer dementia|Alzheimer disease', description='A Progressive, Neurodegenerative Disease Characterized By Loss Of Function And Death Of Nerve Cells In Several Areas Of The Brain Leading To Loss Of Cognitive Function Such As Memory And Language.', created_by_id=1, source_id=49, created_at=2024-10-21 15:05:21 UTC)], returning '?
These keys could not be validated: ['project', 'disease']
Here is how to create a feature:
ln.Feature(name='project', dtype='?').save()
ln.Feature(name='disease', dtype='?').save()
This errored because we hadn’t yet registered features. After copy and paste from the error message, things work out:
ln.Feature(name='project', dtype='cat[ULabel]').save()
ln.Feature(name='disease', dtype='cat[bionty.Disease]').save()
artifact.features.add_values({"project": projects, "disease": diseases})
artifact.features
Show code cell output
Features
'disease' = 'atopic eczema', 'Alzheimer disease'
'project' = 'Project A', 'Project B'
Since genes are the measurements, we register them as features:
feature_set = ln.FeatureSet(genes)
feature_set.save()
artifact.features.add_feature_set(feature_set, slot="genes")
artifact.describe()
Show code cell output
Artifact(uid='3gQ0QSQh9e6zFYP50000', is_latest=True, description='a zarr object', suffix='.zarr', size=973, hash='gWLzZ5-RfFM9zzOXSx98hw', n_objects=2, _hash_type='md5-d', visibility=1, _key_is_virtual=True, created_at=2024-10-21 15:05:23 UTC)
Provenance
.storage = '/home/runner/work/lamindb/lamindb/docs/test-curate-any'
.created_by = 'testuser1'
Labels
.diseases = 'atopic eczema', 'Alzheimer disease'
.ulabels = 'Project A', 'Project B'
Features
'disease' = 'atopic eczema', 'Alzheimer disease'
'project' = 'Project A', 'Project B'
Feature sets
'genes' = 'BRCA2', 'TP53', 'KRAS', 'BRAF', 'PTEN', 'ESR1', 'ERBB2', 'PIK3C2B', 'EGFR', 'CCN2'
Show code cell content
# clean up test instance
!lamin delete --force test-curate-any
!rm -r data.zarr
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.11.10/x64/bin/lamin", line 8, in <module>
sys.exit(main())
^^^^^^
File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/rich_click/rich_command.py", line 367, in __call__
return super().__call__(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/rich_click/rich_command.py", line 152, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/lamin_cli/__main__.py", line 209, in delete
return delete(instance, force=force)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/lamindb_setup/_delete.py", line 102, in delete
n_objects = check_storage_is_empty(
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/lamindb_setup/core/upath.py", line 772, in check_storage_is_empty
raise InstanceNotEmpty(message)
lamindb_setup.core.upath.InstanceNotEmpty: Storage '/home/runner/work/lamindb/lamindb/docs/test-curate-any/.lamindb' contains 2 objects - delete them prior to deleting the instance