Index data into another index from an ingest processor, for suggestion feature using a completion field

Hi all,

I'm using Elasticsearch v6.6, and I have a suggest feature implemented in my app. I don't want to go ahead with the search_as_you_type field at this moment (available in 7.2).

I have a suggestions field [^1] which I use for a completion query, with a custom analyzer declared in the index template settings [^2]. This field is filled (via an ingest plugin, see below) with a transformed version of the content of a description field (having full text content) + several other simple text fields which are copied to it via the mapping (using copy_to statement).

Few months ago, we did not have lot of data, but now we are reaching a big stop in terms of JVM HEAP usage, because the completion field is very greedy (18GB RAM only for this field).

I believe the ingested data in this field could be better stored in a third-party index to use only for completion, instead of having lots of word duplication in most of the 50M documents in current index.

So I was thinking about a suggest dedicated index containing only the suggestions field with all the suggestions, but those being deduplicated in order to limit the huge RAM usage. For doing so, I plan to update my ingest plugin [^3] (used to analyze on the fly each document and store the result in their suggestions field) in order to copy the analyzed data to another index, each document being only one of the possible suggestions.

Question 1: does this approach consisting of indexing all the content to suggest in a dedicated index looks relevant, and should reduce the RAM consumption?

Question 2: is it possible to get a configured Elasticsearch Java client from a custom processor plugin in an ingest plugin in order to copy the transformed data into another index?

Thanks for any advice!

[^1] :

{
    "dynamic": "strict",
    "properties": {
        "all other fields": "are not shown for brevity",
        "name": {
            "type": "text",
            "copy_to": "suggestions",
            "fields": {
                "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                }
            }
        },
        "description": {
            "analyzer": "english",
            "type": "text"
        },
        "suggestions": {
            "type" : "completion",
            "analyzer": "custom_suggestion_analyzer"
        }
    }
}

[^2]:

{
    "number_of_replicas": 0,
    "number_of_shards": 4,
    "refresh_interval": "30s",
    "analysis": {
        "analyzer": {
            "custom_suggestion_analyzer": {
                "type": "custom",
                "tokenizer": "standard",
                "filter": [
                    "lowercase"
                ]
            }
        }
    }
}

[^3]:

PUT _ingest/pipeline/description-tokenizer-pipeline
{
  "description": "Convert a description field to make it usable by an auto-completion query",
  "processors": [
    {
      "description_tokenizer" : {
        "field" : "my_description_field_containing_phrases",
        "target_field": " the_field_where_processed_data_will_be_copied"
      }
    }
  ]
}

PUT /my-index/my-type/1?pipeline=description-tokenizer-pipeline
{
  "my_description_field_containing_phrases" : "Some content to be analyzed and tokenized."
}

GET /my-index/my-type/1
{
  "my_field_containing_phrases" : "Some content to be analyzed and tokenized.",
  "tokenized_field": ["Some", "content", "analyzed", "tokenized"]
}

I up the thread, hoping one from Elastic can answer at least 1 of my 2 questions.

Cheers.

Finally, I went to transform data externally and index it elsewhere.

I have transformed the data with Unix programs (parallel, sed, tr, jq) to remove numerous irrelevant information and so decrease drastically the size of the Lucene index, by a factor 10!

Did not found a way to get an client from ingest plugin.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.