Understanding fielddata mapping parameter

I'm attempting to build a kind of "next word prediction" using elasticsearch. The goal is to have suggest search terms based on different fields of the index.

I found a solution based on the work done here Search like a Google with Elasticsearch. Autocomplete, Did you mean and search for items. – Volodymyr Bilyachat using aggregations on a "synthetic" suggestion field that uses a shingle filter.

This works to my expectations, but I had to enable fielddata: true on the suggestions field to get access to the parts created by the shingle filter and aggregating on them.

According to the documentation Text field type | Elasticsearch Reference [7.11] | Elastic this: "... load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory..."

Can someone help me quantify what that can mean in terms of memory usage?

Let's say I have an index with 1million documents, and the suggestion field contains english sentences averaging 20 words. My 2/4 shingle filter would then create 74 tokens for each of those sentences.

Now when I query and aggregate terms on that field with few restrictions, what get's loaded into memory when fielddata is enabled, what can help me quantify that?

{
	"size": 0,
    "aggs": {
      "suggestions": {
	    "terms": {
		  "field": "suggestions",
	      "include": "c.*"
    	}
	  }
	},
	"query": {
      "prefix": {
    	"suggestions": {
    		"value": "c"
    	}
      }
	}
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.