CSV Importer

Bulkrax can import from a CSV file that follows the following guidelines.

Required fields

The CSV MUST have a header row to uniquely identify the record.
This header row MUST have a field representing the source_identifier, containing a unique identifier for the item. (refer to the below for more detail)
The CSV MUST have a title column
There MUST be something in the field representing the source_identifier and title for all works (unless you are auto generating source_identifiers in the bulkrax config file)

Source Identifier

Refer to https://github.com/samvera-labs/bulkrax/wiki/Configuring-Bulkrax#source-identifier.

Supported fields

All columns will be imported if the column name matches an existing metadata property in Hyrax, eg. title, creator, etc.

In addition, the following columns will be imported:

collection or collection_# (deprecated in v3.0.0)
file or file_#
file_url or file_url_#
remote_files
model

Properties with multiple values

A property's value is most often a single string or an array of strings. We are also accounting for the value being an array of hashes. Refer to the field mapping for more configuration details on how to handle these use cases.

There are two ways that a property with multiple values can be imported.

Single Header

contributor	language	license
Aaliyah; Ruth	En	cc3.0

Multiple Headers

contributor_1	contributor_2	language	license
Aaliyah	Ruth	En	cc3.0

Collections

As of v3.0.0 collections are to be imported as their own row, instead of as a column header. Use the format below to create/edit your csv (the order of the columns can be different).

In the example below, Second Work is a child of First Work, while both works are children of Collection
- If you don't want Second Work to be a child of Collection, don't add the Collection source_identifier as a parent
Collections can also be children of other collections
A "children" column can also be used to establish relationships, but you would use "parents" or "children". Not both.
The character separating multiple source identifiers can be a ;, | or whatever value has been established as the delimiter for the parents/children field in your bulkrax.rb mapping

source_identifier	model	title	description	parents
collection_1	Collection	First Collection	This will be the collection's description
work_1	Work	First Work	This is a work	collection_1
	Work	Second Work	This is another work	work_1 ; collection_1

Caveats

Since a collection is imported as a row with its own metadata now, you must give the collection a source_identifier value to reference in the "parents" column of whatever work(s) you want to belong to it
If you are importing works into an existing collection, you don't need the collection row. You must still reference either the source_identifier or id already attached to that collection in the "parents" column of the work(s) and/or collection(s).

Deprecated in v.3.0.0

A column titled collection will be used to define which collection imported works should be added to. Works are added to collections based on the collection's source_identifier, which would be provided in the collection field on the csv. To create a new collection, put the title in the collection field.

Multiple collections can be supplied.

If the value provided matches a value found in the system_identifier_field of an existing collection, then works will be added to that collection. If not, a new collection will be created and both title and system_identifier_field will be set to the value supplied in the collection column.

For example

source_identifier	title	collection
imported_work_1	Work One	Collection One
imported_work_2	Work Two	Collection One; Collection Two

In the first row (after the header), the Work being imported will be added to Collection One, and in the second, to both Collection One and Collection Two.

If either of those already exist, then the existing collection is used. If not, a new one is created.

Model

The model column is used to determine the work type. It is not required. In it's absence, either the field mapping or default_work_type will be used. Read more about these in the Configuration guide.

Importing Files

Method 1

This method is capable of importing files and assigning them to works, but is incapable of assigning metadata to the files themselves

Files will be imported from a column called file_#, file_url_# or remote_files if they are present.

The file_# columns will each contain a single filename (these must be unique). Multiple files can be imported, by using additional numerated headers.

The file_url_# columns will each contain a single URL to a file which will be downloaded and imported (these must be unique). Multiple files can be imported, by using additional numerated headers.

The remote_files column will contain one or more URLs to files which will be downloaded and imported. Multiple files can be imported, if separated by a pipe (|). (Semi-colons are valid URL syntax so don't use it as the separator. URLs themselves MUST NOT contain pipes).

Method 2

This method is capable of importing files, assigning them to works, and assigning metadata to the files themselves

One or more files can up uploaded into a set that contains metadata. This is referred to as a "File Set".

NOTE: Currently (as of v2.1), this method does not support the file_url_# and remote_files columns mentioned in Method 1. Only the file_# column is supported. See the Important Configuration Details section below for more details on how to use the file_# column.

The following are required to import File Sets:

A unique source_identifier
The value "FileSet" (no spaces) in the configured model column
One or more file names in the file column
An identifier for the parent work that the File Set will be assigned to
- This identifier could be the Work's Bulkrax source_identifier or its ID

Example CSV:

source_identifier	model	file	parent	title	description
work_1	Work			My Work	This is a work
file_set_1	FileSet	image_1.png	work_1	My FileSet	This is a file set

Important Configuration Details

Regardless of which method you choose to ingest files, the following rules apply

Files Location

If imported from a pre-existing server location, files MUST be placed in a directory called files relative to the location of the CSV file. By default, Bulkrax will process the file column in the provided CSV or treat all file_<number> (e.g. file_0, file_1, file_2`) columns as the columns for filenames.

Below is an example of the current directory with the file metadata.csv and the sub-directory files containing two .tif files. When the metadata.csv has a column file, you can provide one or more filenames (separated by ;).

.
├── files
│   ├── P000001.tif
│   └── P000002.tif
└── metadata.csv

With the above current directory, when our metadata.csv looks as follows, we'd ingest one work and attach those two files to the work.

file,                      title
P000001.tif; P000002.tif,  My Work

Alternatively if we have the same current directory, when our metadata.csv looks as follows, we'd ingest two works and attach one file to each work.

file,         title
P000001.tif,  My Work
P000002.tif,  My Other Work

With another example:

source_identifier	title	creator	publisher	file
first_work	First work title	Smith, John	Faber and Faber	document.pdf
second_work	Second work title	Jones, David	Macmillan	firstdocument.docx; seconddocument.pdf
third_work	Third work title	Other, A.N.	Penguin

If the CSV to be imported is written to the server at:

/tmp/imports/1/csv-to-be-imported.csv

The files would be at:

/tmp/imports/1/files/document.pdf
/tmp/imports/1/files/firstdocument.docx
/tmp/imports/1/files/seconddocument.pdf

The third_work does not have any associated files.

If uploading using Browse Everything, the location of the files will be handled by the system.

Importing from a Zip file

A Zip file containing a single CSV and a folder named files/ can be imported by the CSV Importer. The structure of the Zip is very important and is as follows:

metadata.csv
files/
  |
  file_1.png
  file_2.jpg

See the Files Location guide for how to reference the files within the CSV

In Finder, select the CSV and the files/ folder (cmd + click to select multiple items), right click, and select Compress. This will create the Zip file that will be imported.

NOTE: The names of the files themselves don't matter, as long as they match what's in the files column in the CSV. Likewise, the name of the CSV does not matter. However, the name of the folder containing the files does matter and should be written exactly as "files" (lowercase and plural). Also, the structure of the Zip is important; for example, if you compress a directory containing the CSV and the files/ folder, it will not import properly.

Configuration and Customization

Please see the Configuration guide for information on how to configure and customize import. For example, by excluding columns from import, or splitting data on specific delimeters.

Bulkrax.setup do | config |
  # Use the doi field (note: doi must be available on all works and collections).
  config.field_mappings['Bulkrax::CsvParser'] = {
    'bulkrax_identifier' => { from: ['original_identifier'], source_identifier: true }
  }
end

Allow Bulkrax to create the source_identifier

If there isn't a field that's available and unique across all Works and Collections, Bulkrax can make a custom field. An example of how this can be changed in the local application as follows:

  config.fill_in_blank_source_identifiers = ->(obj, index) { "#{Site.instance.account.name}-#{obj.importerexporter.id}-#{index}" }
  config.field_mappings['Bulkrax::CsvParser'] = {
    'bulkrax_identifier' => { from: ['original_identifier'], source_identifier: true }
  }

You will also need to add the following to "app/indexers/shared_indexer" in your local app

solr_doc[Solrizer.solr_name('bulkrax_identifier', :facetable)] = object.bulkrax_identifier

Supported fields

All columns will be imported if the column name matches an existing metadata property in Hyrax, eg. title, creator, etc.

In addition, the following columns will be imported:

collection or collection_# (deprecated in v3.0.0)
file or file_#
file_url or file_url_#
remote_files
model

Properties with multiple values

A property's value is most often a single string or an array of strings. We are also accounting for the value being an array of hashes. Refer to the field mapping for more configuration details on how to handle these use cases.

There are two ways that a property with multiple values can be imported.

Single Header

contributor	language	license
Aaliyah; Ruth	En	cc3.0

Multiple Headers

contributor_1	contributor_2	language	license
Aaliyah	Ruth	En	cc3.0

Collections

As of v.3.0.0 collections are to be imported as their own row, instead of as a column header. Use the format below to create/edit your csv (the order of the columns can be different).

In the example below, Second Work is a child of First Work, while both works are children of Collection
- If you don't want Second Work to be a child of Collection, don't add the Collection source_identifier as a parent
Collections can also be children of other collections
A "children" column can also be used to establish relationships, but you would use "parents" or "children". Not both.
The character separating multiple source identifiers can be a ;, | or whatever value has been established as the delimiter for the parents/children field in your bulkrax.rb mapping

source_identifier	model	title	description	parents
collection_1	Collection	First Collection	This will be the collection's description
work_1	Work	First Work	This is a work	collection_1
	Work	Second Work	This is another work	work_1 ; collection_1

Caveats

Since a collection is imported as a row with its own metadata now, you must give the collection a source_identifier value to reference in the "parents" column of whatever work(s) you want to belong to it
If you are importing works into an existing collection, you don't need the collection row. You must still reference either the source_identifier or id already attached to that collection in the "parents" column of the work(s) and/or collection(s).

Deprecated in v.3.0.0

A column titled collection will be used to define which collection imported works should be added to. Works are added to collections based on the collection's source_identifier, which would be provided in the collection field on the csv. To create a new collection, put the title in the collection field.

Multiple collections can be supplied.

If the value provided matches a value found in the system_identifier_field of an existing collection, then works will be added to that collection. If not, a new collection will be created and both title and system_identifier_field will be set to the value supplied in the collection column.

For example

source_identifier	title	collection
imported_work_1	Work One	Collection One
imported_work_2	Work Two	Collection One; Collection Two

In the first row (after the header), the Work being imported will be added to Collection One, and in the second, to both Collection One and Collection Two.

If either of those already exist, then the existing collection is used. If not, a new one is created.

Model

The model column is used to determine the work type. It is not required. In it's absence, either the field mapping or default_work_type will be used. Read more about these in the Configuration guide.

Importing Files

Method 1

This method is capable of importing files and assigning them to works, but is incapable of assigning metadata to the files themselves

Files will be imported from a column called file_#, file_url_# or remote_files if they are present.

The file_# columns will each contain a single filename (these must be unique). Multiple files can be imported, by using additional numerated headers.

The file_url_# columns will each contain a single URL to a file which will be downloaded and imported (these must be unique). Multiple files can be imported, by using additional numerated headers.

The remote_files column will contain one or more URLs to files which will be downloaded and imported. Multiple files can be imported, if separated by a pipe (|). (Semi-colons are valid URL syntax so don't use it as the separator. URLs themselves MUST NOT contain pipes).

Method 2

This method is capable of importing files, assigning them to works, and assigning metadata to the files themselves

One or more files can up uploaded into a set that contains metadata. This is referred to as a "File Set".

NOTE: Currently (as of v2.1), this method does not support the file_url_# and remote_files columns mentioned in Method 1. Only the file_# column is supported. See the Important Configuration Details section below for more details on how to use the file_# column.

The following are required to import File Sets:

A unique source_identifier
The value "FileSet" (no spaces) in the configured model column
One or more file names in the file column
An identifier for the parent work that the File Set will be assigned to
- This identifier could be the Work's Bulkrax source_identifier or its ID

Example CSV:

source_identifier	model	file	parent	title	description
work_1	Work			My Work	This is a work
file_set_1	FileSet	image_1.png	work_1	My FileSet	This is a file set

Important Configuration Details

Regardless of which method you choose to ingest files, the following rules apply

Files Location

If imported from a pre-existing server location, files MUST be placed in a directory called files relative to the location of the CSV file.

If uploading using Browse Everything, the location of the files will be handled by the system.

For example:

source_identifier	title	creator	publisher	file
first_work	First work title	Smith, John	Faber and Faber	document.pdf
second_work	Second work title	Jones, David	Macmillan	firstdocument.docx; seconddocument.pdf
third_work	Third work title	Other, A.N.	Penguin

If the CSV to be imported is located at

/tmp/imports/1/csv-to-be-imported.csv

The files would be at:

/tmp/imports/1/files/document.pdf
/tmp/imports/1/files/firstdocument.docx
/tmp/imports/1/files/seconddocument.pdf

The third_work does not have any associated files.

Importing from a Zip file

A Zip file containing a single CSV and a folder named files/ can be imported by the CSV Importer. The structure of the Zip is very important and is as follows:

metadata.csv
files/
  |
  file_1.png
  file_2.jpg

See the Files Location guide for how to reference the files within the CSV

In Finder, select the CSV and the files/ folder (cmd + click to select multiple items), right click, and select Compress. This will create the Zip file that will be imported.

NOTE: The names of the files themselves don't matter, as long as they match what's in the files column in the CSV. Likewise, the name of the CSV does not matter. However, the name of the folder containing the files does matter and should be written exactly as "files" (lowercase and plural). Also, the structure of the Zip is important; for example, if you compress a directory containing the CSV and the files/ folder, it will not import properly.

Configuration and Customization

Please see the Configuration guide for information on how to configure and customize import. For example, by excluding columns from import, or splitting data on specific delimeters.

CSV Importer

Required fields

Source Identifier

Supported fields

Properties with multiple values

Single Header

Multiple Headers

Collections

Caveats

Deprecated in v.3.0.0

Model

Importing Files

Method 1

This method is capable of importing files and assigning them to works, but is incapable of assigning metadata to the files themselves

Method 2

This method is capable of importing files, assigning them to works, and assigning metadata to the files themselves

Important Configuration Details

Files Location

Importing from a Zip file

Configuration and Customization

Supported fields

Properties with multiple values

Single Header

Multiple Headers

Collections

Caveats

Deprecated in v.3.0.0

Model

Importing Files

Method 1

This method is capable of importing files and assigning them to works, but is incapable of assigning metadata to the files themselves

Method 2

This method is capable of importing files, assigning them to works, and assigning metadata to the files themselves

Important Configuration Details

Files Location

Importing from a Zip file

Configuration and Customization

Clone this wiki locally