PoC: Add support for indexing the content of attachments using sunspot_cell #6
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds support for text extraction of binary attachments using tika. Regrettably there is currently no centrally maintained version of
sunspot_cell
. The waterfield fork seems to be the most advanced one.It is currently necessary to use
xml
as the sunspotupdate_format
. The json format can be fixed with waterfield/sunspot_cell#1The following
config/sunspot.yml
is working for me:I'm using docker solr 7 for testing purposes with the core config from the sunspot examples with libs, fields, fieldTypes and ExtractingRequestHandler added (see attached sunspot-configset_cell.patch.txt).
This PR is in a proof-of-concept stage. The necessary config changes could be incorporated into the included solr config example. Also it would be nice if the search result would display a result snipped and not only the attachment...