Hi all,
I'm using Elasticsearch v6.6, and I have a suggest feature implemented in my app. I don't want to go ahead with the search_as_you_type
field at this moment (available in 7.2).
I have a suggestions
field [^1] which I use for a completion query, with a custom analyzer declared in the index template settings [^2]. This field is filled (via an ingest plugin, see below) with a transformed version of the content of a description
field (having full text content) + several other simple text fields which are copied to it via the mapping (using copy_to
statement).
Few months ago, we did not have lot of data, but now we are reaching a big stop in terms of JVM HEAP usage, because the completion
field is very greedy (18GB RAM only for this field).
I believe the ingested data in this field could be better stored in a third-party index to use only for completion, instead of having lots of word duplication in most of the 50M documents in current index.
So I was thinking about a suggest dedicated index containing only the suggestions
field with all the suggestions, but those being deduplicated in order to limit the huge RAM usage. For doing so, I plan to update my ingest plugin [^3] (used to analyze on the fly each document and store the result in their suggestions
field) in order to copy the analyzed data to another index, each document being only one of the possible suggestions.
Question 1: does this approach consisting of indexing all the content to suggest in a dedicated index looks relevant, and should reduce the RAM consumption?
Question 2: is it possible to get a configured Elasticsearch Java client from a custom processor plugin in an ingest plugin in order to copy the transformed data into another index?
Thanks for any advice!
[^1] :
{
"dynamic": "strict",
"properties": {
"all other fields": "are not shown for brevity",
"name": {
"type": "text",
"copy_to": "suggestions",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"description": {
"analyzer": "english",
"type": "text"
},
"suggestions": {
"type" : "completion",
"analyzer": "custom_suggestion_analyzer"
}
}
}
[^2]:
{
"number_of_replicas": 0,
"number_of_shards": 4,
"refresh_interval": "30s",
"analysis": {
"analyzer": {
"custom_suggestion_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase"
]
}
}
}
}
[^3]:
PUT _ingest/pipeline/description-tokenizer-pipeline
{
"description": "Convert a description field to make it usable by an auto-completion query",
"processors": [
{
"description_tokenizer" : {
"field" : "my_description_field_containing_phrases",
"target_field": " the_field_where_processed_data_will_be_copied"
}
}
]
}
PUT /my-index/my-type/1?pipeline=description-tokenizer-pipeline
{
"my_description_field_containing_phrases" : "Some content to be analyzed and tokenized."
}
GET /my-index/my-type/1
{
"my_field_containing_phrases" : "Some content to be analyzed and tokenized.",
"tokenized_field": ["Some", "content", "analyzed", "tokenized"]
}