Update doc, tutorial for move to production Spin cluster
Modify strategy for handling db authorization config file
Reorganize find_datasets for efficiency
- Fix keyword code
- Regularize member names in DataRegistry class
Add workflow to publish to PyPI upon release
- Modify Query.get_all_tables and Query.get_all_columns to return sorted lists rather than sets so that web app displays them sorted
- Add alternate option names using hyphens rather than underscores in the script create_registry_schema
- Add keyword class to manage keywords in the data registry.
- Moved all functionality related to keywords from dataset to new keyword class.
- Bug fix for get_dataset_absolute_path and associated CI tests
- Add optional
schemaargument tofind_datasets
- Bug fix affecting querying of aliases
- Add test to check that alias querying works even when query_mode is not "both"
- Add Python 3.13 to CI and change Python version restriction to just a lower bound (>=3.9).
- Added logging. There is now a
loggerobject in theDbConnectionclass that can be called for logging. For exampleDbConnection.logger.debug("text"). This replaces theverboseflag. - Users can now make aggregate queries via e.g.,
datareg.Query.aggregate_datasets("dataset_id", agg_func="avg"). Aggregate function can be "count", "avg", "sum", "min" or "max". The usual query filters can still be applied. By default it is thedatasettable the is queried, however settingtable_name=can change this to "dataset_alias", "keyword", or "dataset_keyword". - Expanded the list of columns that are modifiable.
- There is now a
get_all_tables()andget_all_columns()function in theQueryobject. These can be reffined to return on the columns of a given table (default isdataset), or columns from all tables.
- Make
dataset.relative_pathnullable - Adjust code to set
dataset.relative_pathto null for datasets of location typeexteranlandmeta_only - Add documentation for in-place database upgrades
- Add script for making db upgrade
- Move
get_keyword_list()function to theQuery()class. - Fix
dregs lsfor corner case scenarios - Fix
dregs showto work withquery_modeformat - Added
skip_provenance_reflectto theDbConnection()object, which stops the provenance table being queried during database connection. This is desgined for schema creation only.
When connection to the database, now a namespace is connected to, rather than a
selected schema. The namespace is a combination of a working and production
schema. On connection, both schemas are connected to and both schemas are
reflected. By default, queries search both the working and production schemas
and their results combined (this behaviour can be changed using query_mode to
limit queries to a single schema).
register_mode dictates which schema will be used ("working" or "production")
for write, modify and delete operations on entries during the connection
instance.
Update delete functionality
- The
delete()function now takesname,version_string,ownerandowner_typeas arguments, rather than simply thedataset_id. - One can still delete by the
dataset_idusing the CLI, which now includes a confirmation step.
Make more options for querying
- There is now a
~=query operator that can utalise the.ilikefilter to allow non-case-sensitive filering with wildcards (i.e., the%character). dregs lscan now filter on the dataset name, including%wildcards, using the--nameoption.dregs_lscan return arbitrary columns using the--return_colsoption
Some changes to the way the relative_path is automatically generated from the
name and version.
- All automatically generated
relative_pathsare placed in a top level.gen_pathsdirectory, e.g.,<root_dir>/<schema>/<owner_type>/<owner>/.gen_paths. This is to prevent clashes with user specifiedrelative_paths. - Single files no longer have their filename changed when automatically
generating the
relative_path. Instead a directory containing thenameandversionis created, and the file is copied there. This preserves the filname suffix in the relative path.
Update documentation for release, new installation instructions etc.
- Update default NERSC site to
/global/cfs/cdirs/lsst/utilities/desc-data-registry - Update default schema names (now stored in
src/dataregistry/schema/default_schema_names.yaml - There is now a
reg_adminaccount which is the only account to create the initial schemas. The schema creation script has been updated to give the correctreg_writerandreg_readerprivileges. - Remove
version_suffix
- Update
dregs lsto be a bit cleaner. Also hasdregs ls --extendedoption to give back more quantities. Also can now query on a keyword usingdregs ls --keyword <keyword> - Added
modifyto CLI to update datasets from the command line
There cannot be a unique constraint in the database for the owner,
owner_type and relative_path, as multiple entries can share theose values,
however we require that at any one time only one dataset has their data at this
location. Added a check during register to ensure the relative_path is
avaliable.
- Bump database version to 3.3.0, removed
is_overwritten,replace_date,replace_uidcolumns - Added
replacedbit to thevalidbitmask
The tables_required list, when doing a query, was only build from the return
column list. This means if a filter used a table not in the returned column
list the proper join would not be made. This has been corrected.
- Added
replace()function for datasets. This is functionally very similar toregister(), but it allows users to overwrite previous datasets whilst keeping the same name/version/suffix/owner/ownertype combination. Documentation updated. - Datasets now have a
replace_iterationcounter and areplace_idvalue which points to the dataset that replaced them. To reflect that the unique constraints now include thereplace_iterationcolumn. - Database version bumped to 3.2.0
- Tests now use the
property_dictreturn type and first make sure that the correct number of results was found before checking the results.
- Update the
schema.yamlfile to include unique constraints and table indexes. - Update the unique constraints for the dataset table to be
owner,owner_type,name,version,version_suffix.
When registering a dataset that is overwriting a previous dataset, don't tag
the previous datasets as valid=False until any data copying is successful.
Add ability to tag datasets with keywords/labels to make them easier to catagorize.
- Can tag keywords when registering datasets through the Python API or CLI. Can
add keywords after registration using the
add_keywords()method in the Python API. - Database version bumped to 3.0.0
- New table
keywordthat stores both the system and user keywords. - New table
dataset_keywordthat links keywords to datasets. - System keywords are stored in
src/dataregistry/schema/keywords.yaml, which is used to populate thekeywordstable during database creation. - Added
datareg.Registrar.dataset.get_keywords()function to return the list of currently registered keywords. - When the keyword table is queried, an automatic join is made with the dataset-keyword association table. So the user can query for all datasets with a given keyword, for example.
- Added keywords information to the documentation
- Can run
dregs show keywordsfrom CLI to display all pre-registered keywords
Separate out creation of production schema and non-production schema since, under normal circumstances, there will be a single "real" production schema (owner type == production only) but possibly multiple non-production schemas to keep track of entries for the other owner types. Add a field to the provenance table so a schema can discover the name of its associated production schema and form foreign key constraints correctly.
Bumped database version to 2.3.0. This code requires database version >= 2.3.0
- Add check during dataset registration to raise an exception if the
root_dirdoes not exist - Add check before copying any data (i.e.,
old_location != None) that the user has write permission to theroot_dirfolder.
Add ability to register "external" datasets. For example datasets that are not physically managed by the registry, or are offsite, therefore only a database entry is created.
- Database version bumped to 2.2.0
- Added
location_typecolumn todatasettable (can be either "onsite", "external" or "dummy"). - Added
contact_emailandurlcolumn todatasettable. One of these is required when registering alocation_type="external"dataset. - Removed
is_external_linkcolumn fromdatasettable as it is redundant. - Renamed
execution.localetoexecution.sitein theexecutiontable.
Version 0.4.0 focuses around being able to manipulate data already within the dataregistry, i.e., adding the ability to delete and modify previous datasets.
Registrarnow has a class for each table. They inherit from aBaseTableclass, this means that shared functions, like deleting entries, are available for all tables. (#92)- Working with tables via the python interface has slightly different syntax (see user changelog below). (#92)
is_validis removed as adatasetproperty. It has been replaced withstatuswhich is a bitmask (bit 0="valid", bit 1= "deleted" and bit 2="archived"), so now datasets can a combination of multiple states. (#93)archive_date,archive_path,delete_date,delete_uidandmove_datehave been added as newdatasetfields. (#93)- Database version bumped to
2.0.1(#93) datasetentries can be deleted (see below) (#94)- The CI for the CLI is now pure Python (i.e., there is no more bash script to ingest dummy entries into the registry for testing).
- Can no longer "bump" a dataset that has a version suffix (trying to do so will raise an error). If a user wants to make a new version of a dataset with a suffix they can still do so by manually specifying the version and suffix (#97 ).
- Dataset entries can be modified (see below, #100)
- All database tables (
dataset,execution, etc) have a more universal syntax. The functionality is still accessed via theRegistrarclass, but now for example to register a dataset it'sRegistrar.dataset.register(), similarly for an executionRegistrar.execution.register()(#92). The docs and tutorials have been updated (#95). datasetentries can now be deleted using theRegistrar.dataset.delete(dataset_id=...)function. This will also delete the raw data within theroot_dir. Note that the entry in the database will always remain (with an updatedstatusfield to indicate it has been deleted). (#94)- Documentation has been updated to make things a bit clearer. Now split into more focused tutorials (#95).
- Certain dataset quantities can be modified after registration (#100). Documentation has been updated with examples.