-
Notifications
You must be signed in to change notification settings - Fork 21
CSV Importer
Bulkrax can import from a CSV file that follows the following guidelines.
- The CSV MUST have a header row to uniquely identify the record.
- This header row MUST have a field representing the
source_identifier
, containing a unique identifier for the item. (refer to the below for more detail) - The CSV MUST have a
title
column - There MUST be something in the field representing the
source_identifier
andtitle
for all works (unless you are auto generating source_identifiers in the bulkrax config file)
Refer to https://github.com/samvera-labs/bulkrax/wiki/Configuring-Bulkrax#source-identifier.
All columns will be imported if the column name matches an existing metadata property in Hyrax, eg. title, creator, etc.
In addition, the following columns will be imported:
- collection or collection_# (deprecated in
v3.0.0
) - file or file_#
- file_url or file_url_#
- remote_files
- model
A property's value is most often a single string or an array of strings. We are also accounting for the value being an array of hashes. Refer to the field mapping for more configuration details on how to handle these use cases.
There are two ways that a property with multiple values can be imported.
contributor | language | license |
---|---|---|
Aaliyah; Ruth | En | cc3.0 |
contributor_1 | contributor_2 | language | license |
---|---|---|---|
Aaliyah | Ruth | En | cc3.0 |
As of v3.0.0
collections are to be imported as their own row, instead of as a column header. Use the format below to create/edit your csv (the order of the columns can be different).
- In the example below, Second Work is a child of First Work, while both works are children of Collection
- If you don't want Second Work to be a child of Collection, don't add the Collection source_identifier as a parent
- Collections can also be children of other collections
- A "children" column can also be used to establish relationships, but you would use "parents" or "children". Not both.
- The character separating multiple source identifiers can be a
;
,|
or whatever value has been established as the delimiter for the parents/children field in your bulkrax.rb mapping
source_identifier | model | title | description | parents |
---|---|---|---|---|
collection_1 | Collection | First Collection | This will be the collection's description | |
work_1 | Work | First Work | This is a work | collection_1 |
Work | Second Work | This is another work | work_1 ; collection_1 |
- Since a collection is imported as a row with its own metadata now, you must give the collection a source_identifier value to reference in the "parents" column of whatever work(s) you want to belong to it
- If you are importing works into an existing collection, you don't need the collection row. You must still reference either the
source_identifier
orid
already attached to that collection in the "parents" column of the work(s) and/or collection(s).
A column titled collection
will be used to define which collection imported works should be added to. Works are added to collections based on the collection's source_identifier, which would be provided in the collection
field on the csv. To create a new collection, put the title in the collection
field.
Multiple collections can be supplied.
If the value provided matches a value found in the system_identifier_field
of an existing collection, then works will be added to that collection. If not, a new collection will be created and both title and system_identifier_field
will be set to the value supplied in the collection column.
For example
source_identifier | title | collection |
---|---|---|
imported_work_1 | Work One | Collection One |
imported_work_2 | Work Two | Collection One; Collection Two |
In the first row (after the header), the Work being imported will be added to Collection One, and in the second, to both Collection One and Collection Two.
If either of those already exist, then the existing collection is used. If not, a new one is created.
The model column is used to determine the work type. It is not required. In it's absence, either the field mapping or default_work_type will be used. Read more about these in the Configuration guide.
This method is capable of importing files and assigning them to works, but is incapable of assigning metadata to the files themselves
Files will be imported from a column called file_#
, file_url_#
or remote_files
if they are present.
The file_#
columns will each contain a single filename (these must be unique). Multiple files can be imported, by using additional numerated headers.
The file_url_#
columns will each contain a single URL to a file which will be downloaded and imported (these must be unique). Multiple files can be imported, by using additional numerated headers.
The remote_files
column will contain one or more URLs to files which will be downloaded and imported. Multiple files can be imported, if separated by a pipe (|). (Semi-colons are valid URL syntax so don't use it as the separator. URLs themselves MUST NOT contain pipes).
This method is capable of importing files, assigning them to works, and assigning metadata to the files themselves
One or more files can up uploaded into a set that contains metadata. This is referred to as a "File Set".
NOTE: Currently (as of v2.1), this method does not support the file_url_#
and remote_files
columns mentioned in Method 1. Only the file_#
column is supported. See the Important Configuration Details section below for more details on how to use the file_#
column.
The following are required to import File Sets:
- A unique source_identifier
- The value "FileSet" (no spaces) in the configured model column
- One or more file names in the file column
- An identifier for the parent work that the File Set will be assigned to
- This identifier could be the Work's Bulkrax
source_identifier
or its ID
- This identifier could be the Work's Bulkrax
Example CSV:
source_identifier | model | file | parent | title | description |
---|---|---|---|---|---|
work_1 | Work | My Work | This is a work | ||
file_set_1 | FileSet | image_1.png | work_1 | My FileSet | This is a file set |
Regardless of which method you choose to ingest files, the following rules apply
If imported from a pre-existing server location, files MUST be placed in a directory called files
relative to the location of the CSV file. By default, Bulkrax will process the file
column in the provided CSV or treat all file_<number>
(e.g. file_0
, file_1
, file_2`) columns as the columns for filenames.
Below is an example of the current directory with the file metadata.csv
and the sub-directory files
containing two .tif
files. When the metadata.csv
has a column file
, you can provide one or more filenames (separated by ;
).
.
├── files
│ ├── P000001.tif
│ └── P000002.tif
└── metadata.csv
With the above current directory, when our metadata.csv
looks as follows, we'd ingest one work and attach those two files to the work.
file, title
P000001.tif; P000002.tif, My Work
Alternatively if we have the same current directory, when our metadata.csv
looks as follows, we'd ingest two works and attach one file to each work.
file, title
P000001.tif, My Work
P000002.tif, My Other Work
With another example:
source_identifier | title | creator | publisher | file |
---|---|---|---|---|
first_work | First work title | Smith, John | Faber and Faber | document.pdf |
second_work | Second work title | Jones, David | Macmillan | firstdocument.docx; seconddocument.pdf |
third_work | Third work title | Other, A.N. | Penguin |
If the CSV to be imported is written to the server at:
/tmp/imports/1/csv-to-be-imported.csv
The files would be at:
/tmp/imports/1/files/document.pdf
/tmp/imports/1/files/firstdocument.docx
/tmp/imports/1/files/seconddocument.pdf
The third_work does not have any associated files.
If uploading using Browse Everything, the location of the files will be handled by the system.
A Zip file containing a single CSV and a folder named files/
can be imported by the CSV Importer. The structure of the Zip is very important and is as follows:
metadata.csv
files/
|
file_1.png
file_2.jpg
See the Files Location guide for how to reference the files within the CSV
In Finder, select the CSV and the files/
folder (cmd + click
to select multiple items), right click, and select Compress. This will create the Zip file that will be imported.
NOTE: The names of the files themselves don't matter, as long as they match what's in the files
column in the CSV. Likewise, the name of the CSV does not matter. However, the name of the folder containing the files does matter and should be written exactly as "files" (lowercase and plural). Also, the structure of the Zip is important; for example, if you compress a directory containing the CSV and the files/
folder, it will not import properly.
Please see the Configuration guide for information on how to configure and customize import. For example, by excluding columns from import, or splitting data on specific delimeters.
Bulkrax.setup do | config |
# Use the doi field (note: doi must be available on all works and collections).
config.field_mappings['Bulkrax::CsvParser'] = {
'bulkrax_identifier' => { from: ['original_identifier'], source_identifier: true }
}
end
- Allow Bulkrax to create the
source_identifier
- If there isn't a field that's available and unique across all Works and Collections, Bulkrax can make a custom field. An example of how this can be changed in the local application as follows:
config.fill_in_blank_source_identifiers = ->(obj, index) { "#{Site.instance.account.name}-#{obj.importerexporter.id}-#{index}" } config.field_mappings['Bulkrax::CsvParser'] = { 'bulkrax_identifier' => { from: ['original_identifier'], source_identifier: true } }
- You will also need to add the following to "app/indexers/shared_indexer" in your local app
solr_doc[Solrizer.solr_name('bulkrax_identifier', :facetable)] = object.bulkrax_identifier
All columns will be imported if the column name matches an existing metadata property in Hyrax, eg. title, creator, etc.
In addition, the following columns will be imported:
- collection or collection_# (deprecated in
v3.0.0
) - file or file_#
- file_url or file_url_#
- remote_files
- model
A property's value is most often a single string or an array of strings. We are also accounting for the value being an array of hashes. Refer to the field mapping for more configuration details on how to handle these use cases.
There are two ways that a property with multiple values can be imported.
contributor | language | license |
---|---|---|
Aaliyah; Ruth | En | cc3.0 |
contributor_1 | contributor_2 | language | license |
---|---|---|---|
Aaliyah | Ruth | En | cc3.0 |
As of v.3.0.0
collections are to be imported as their own row, instead of as a column header. Use the format below to create/edit your csv (the order of the columns can be different).
- In the example below, Second Work is a child of First Work, while both works are children of Collection
- If you don't want Second Work to be a child of Collection, don't add the Collection source_identifier as a parent
- Collections can also be children of other collections
- A "children" column can also be used to establish relationships, but you would use "parents" or "children". Not both.
- The character separating multiple source identifiers can be a
;
,|
or whatever value has been established as the delimiter for the parents/children field in your bulkrax.rb mapping
source_identifier | model | title | description | parents |
---|---|---|---|---|
collection_1 | Collection | First Collection | This will be the collection's description | |
work_1 | Work | First Work | This is a work | collection_1 |
Work | Second Work | This is another work | work_1 ; collection_1 |
- Since a collection is imported as a row with its own metadata now, you must give the collection a source_identifier value to reference in the "parents" column of whatever work(s) you want to belong to it
- If you are importing works into an existing collection, you don't need the collection row. You must still reference either the
source_identifier
orid
already attached to that collection in the "parents" column of the work(s) and/or collection(s).
A column titled collection
will be used to define which collection imported works should be added to. Works are added to collections based on the collection's source_identifier, which would be provided in the collection
field on the csv. To create a new collection, put the title in the collection
field.
Multiple collections can be supplied.
If the value provided matches a value found in the system_identifier_field
of an existing collection, then works will be added to that collection. If not, a new collection will be created and both title and system_identifier_field
will be set to the value supplied in the collection column.
For example
source_identifier | title | collection |
---|---|---|
imported_work_1 | Work One | Collection One |
imported_work_2 | Work Two | Collection One; Collection Two |
In the first row (after the header), the Work being imported will be added to Collection One, and in the second, to both Collection One and Collection Two.
If either of those already exist, then the existing collection is used. If not, a new one is created.
The model column is used to determine the work type. It is not required. In it's absence, either the field mapping or default_work_type will be used. Read more about these in the Configuration guide.
This method is capable of importing files and assigning them to works, but is incapable of assigning metadata to the files themselves
Files will be imported from a column called file_#
, file_url_#
or remote_files
if they are present.
The file_#
columns will each contain a single filename (these must be unique). Multiple files can be imported, by using additional numerated headers.
The file_url_#
columns will each contain a single URL to a file which will be downloaded and imported (these must be unique). Multiple files can be imported, by using additional numerated headers.
The remote_files
column will contain one or more URLs to files which will be downloaded and imported. Multiple files can be imported, if separated by a pipe (|). (Semi-colons are valid URL syntax so don't use it as the separator. URLs themselves MUST NOT contain pipes).
This method is capable of importing files, assigning them to works, and assigning metadata to the files themselves
One or more files can up uploaded into a set that contains metadata. This is referred to as a "File Set".
NOTE: Currently (as of v2.1), this method does not support the file_url_#
and remote_files
columns mentioned in Method 1. Only the file_#
column is supported. See the Important Configuration Details section below for more details on how to use the file_#
column.
The following are required to import File Sets:
- A unique source_identifier
- The value "FileSet" (no spaces) in the configured model column
- One or more file names in the file column
- An identifier for the parent work that the File Set will be assigned to
- This identifier could be the Work's Bulkrax
source_identifier
or its ID
- This identifier could be the Work's Bulkrax
Example CSV:
source_identifier | model | file | parent | title | description |
---|---|---|---|---|---|
work_1 | Work | My Work | This is a work | ||
file_set_1 | FileSet | image_1.png | work_1 | My FileSet | This is a file set |
Regardless of which method you choose to ingest files, the following rules apply
If imported from a pre-existing server location, files MUST be placed in a directory called files
relative to the location of the CSV file.
If uploading using Browse Everything, the location of the files will be handled by the system.
For example:
source_identifier | title | creator | publisher | file |
---|---|---|---|---|
first_work | First work title | Smith, John | Faber and Faber | document.pdf |
second_work | Second work title | Jones, David | Macmillan | firstdocument.docx; seconddocument.pdf |
third_work | Third work title | Other, A.N. | Penguin |
If the CSV to be imported is located at
/tmp/imports/1/csv-to-be-imported.csv
The files would be at:
/tmp/imports/1/files/document.pdf
/tmp/imports/1/files/firstdocument.docx
/tmp/imports/1/files/seconddocument.pdf
The third_work does not have any associated files.
A Zip file containing a single CSV and a folder named files/
can be imported by the CSV Importer. The structure of the Zip is very important and is as follows:
metadata.csv
files/
|
file_1.png
file_2.jpg
See the Files Location guide for how to reference the files within the CSV
In Finder, select the CSV and the files/
folder (cmd + click
to select multiple items), right click, and select Compress. This will create the Zip file that will be imported.
NOTE: The names of the files themselves don't matter, as long as they match what's in the files
column in the CSV. Likewise, the name of the CSV does not matter. However, the name of the folder containing the files does matter and should be written exactly as "files" (lowercase and plural). Also, the structure of the Zip is important; for example, if you compress a directory containing the CSV and the files/
folder, it will not import properly.
Please see the Configuration guide for information on how to configure and customize import. For example, by excluding columns from import, or splitting data on specific delimeters.