This Python tool enumerates all the items and collections in ReDATA, downloads the associated files and metadata into a predefined structure on ingest staging storage, gathers curation information from curation storage, and adds the gathered information to the predefined structure.
ReBACH is run via the command line as outlined in the 'How to Run' section of this readme. During its run, ReBACH enumerates all published items and their versions on UArizona's Figshare using the Figshare API and downloads their metadata to the system memory. ReBACH then downloads files into the ingest staging storage for items that have a matching curation storage folder. The tool then validates the files and folder structure in the curation storage for those items. For the items that have matching folders in the curation storage that pass validation, ReBACH copies the files from the curation storage into the corresponding ingest staging storage folder, otherwise the ingest staging storage folder and its contents are deleted. Information and errors are logged in a file with some information and errors displayed in the terminal.
- Python >= 3.9
- requests Python library >= 2.18.4
- Ubuntu >= 20.04
- Slugify >= 7.0.0
- Figshare organization number
- Figshare API token for respective organization
- Archival storage (AP Trust) user email
- Archival storage (AP Trust) user secret
- Read privileges to curation storage location
- Write privileges to ingest staging storage and archival staging storage
- Write privileges to logs location
- Copy the .env.sample.ini file and give it a name of your choice (e.g. .env.ini).
- Fill out the .env.ini file (IMPORTANT: Make sure not to commit this file to Github)
- figshare_api
- url - required: The figshare API url
- token - required: Your auth token to your organization's API
- retries - required: Number of times the script should retry API or file system calls if it is unable to connect. Defaults to 3.
- retries_wait - required: Number of seconds the script should wait between call retries if it is unable to connect. Defaults to 10.
- institution - required: The Figshare Institution ID for your organization.
- ingest_staging_storage - required: The file system location where the preservation folders/packages should be created for ingest into UAL's preservation workflow. Ensure this location is different from archival_staging_storage location in
bagger/config/default.toml. - logs_location - required: The file system location where logs should be created. This value will override the one in
bagger/config/default.tomlwhen bagger is used for post-processing (see post_process_script_command setting below). - additional_percentage_required - required: How much extra space the
ingest_staging_storageshould have in order to handle files as a percent. This percent is applied to the total storage needed for all files. I.e. if the value of this field is 10 and the amount of storage needed for files is 1 GB, the script will make sure that theingest_staging_storagehas at least 1.1 GB free. Defaults to 10. - pre_process_script_command - optional: The terminal command (including arguments) to invoke a script to be run BEFORE the files are copied and logic applied to the
ingest_staging_storage(note: this action is not currently implemented). - post_process_script_command - required: Specifies the method of performing post-processing steps. This can take only two values: the string 'Bagger', or the path to an external script. If the value is set to 'Bagger', the post-processing steps will consist of running the internal
baggermodule. If the value is set to a path to an external script, the post-processing steps will be executed by invoking the external script through the function 'post_process_script_function'. The post-processing steps are executed AFTER the files are copied and logic applied to theingest_staging_storage. - curation_storage_location - required: The file system location where the curation files reside.
- bag_name_prefix - required: This is the prefix for bag names. It is the first set of characters before the underscore("_") that precedes the article_id in bag name, and it defaults to "azu" in env.ini file if not changed.
- figshare_api
- Ensure the aforementioned Dependencies and Requirements are met.
- Navigate to the root directory of ReBACH via the terminal and start the script by entering the command
python3 app.py --xfg /path/of/.env.iniorpython app.py --xfg /path/of/.env.inidepending on your system configuration (note: the script must be run using Python 3.9 or greater). - Informational and error output will occur in the terminal. The same output will be appended to a file in the logs location with today's date with some additional information and error logging occurring in the file. The log details are described in Description of ReBACH Log Messages.
- Final preservation package output will occur in the
ingest_staging_storageyou specified in the env.ini file.
These parameters are only available on the command line.
| Parameter | Description |
|---|---|
--xfg |
The path to the configuration file to use. |
--ids |
A comma-separated list of article IDs to process. E.g., 12345,12356. |
--continue-on-error |
If there is an error during the item processing stage for a given item, skip it and continue to the next item. |
--dry-run |
Runs all operations, excluding any that involve writing any storage medium. |
--check-remote-staging |
Checks alternative remote staging storage for duplicate bags. |
- ReBACH will attempt to fetch all items in the institutional instance. Items that are not published (curation_status != 'approved') will be ignored.
- Items that are embargoed are also fetched however due to limitations in the API, only the latest version can be fetched until the embargo expires or is removed.
- While fetching, ReBACH checks
archival_staging_storageinbagger/config/default.tomlandarchival storagefor a duplicate bags of each item. If a duplicate of an item is found and confirmed in any of the locations, the item will ignored in subsequent stages except when Bagger's Dart workflow json file is configured to upload to a S3 storage. - Checking archival storage for a duplicate bags of an article requires size of the curation storage folder of the article. If an error occurs while calculating the size of an article curation folder, the error will be recorded and execution will stop except if the
--continue-on-errorflag is set. - Remote archival staging storage will be checked for duplicate bags if DART workflow json file configured to upload to an S3 storage, even if the
--check-remote-stagingflag is not set. - When processing collections, ReBACH records which items are part of the collection by appending them to collection's JSON as returned by the Figshare API.
- If an item encounters errors, it will not be processed and any partial files are deleted in ingest staging storage.