This document explains step-by-step how to ingest documents from an archive at a URL to make them searchable using this AWS semantic search solution.
If you are on MacOS make sure that when compressing your documents into the archive that it does not add AppleDouble blobs *_
files, see also this StackExchange answer for the question Create tar archive of a directory, except for hidden files?
The next steps guide you through ingesting documents from a URL into the Amazon OpenSearch index. The URL needs to point to an archive (zip, gz or tar.gz) and has to be accessible from the NLPSearchPrivateSubnet
subnet. The subnet can reach the internet through a NAT gateway.
- In your terminal navigate to
cd ~/semantic-search-aws-docs/ingestion
- Initialize Terraform
terraform init
- Set
DOCS_SRC
variable to the URL from which you want to ingest the documents from, for example if your documents are in Amazon S3 then you could create a presigned URL for the archive in Amazon S3 and assignDOCS_SRC=https://<BUCKET_NAME>.s3.<REGION>.amazonaws.com/data.zip?response-content-disposition=inline&X-Amz-Security-Token=[...]
. - Deploy the ingestion resources
terraform apply -var-file="urldocs.tfvars" -var="infra_region=$REGION" -var="infra_tf_state_s3_bucket=$S3_BUCKET" -var="docs_src=$DOCS_SRC"
- Enter
yes
when Terraform prompts you "to perform these actions".
- Enter
- After the successful deployment of the ingestion resources you need to wait until the ingestion task completes. Follow the Wait for Ingestion to Complete instructions to check for completion from you terminal.
After ingesting your documents you can remove the ingestion resources. Follow the Clean up Ingestion Resources instructions to clean up the ingestion resources.