Doc_values all the things! But, what about complex analyzers?


(Marshall Powers) #1

A few weeks ago, our Found instance stopped serving queries because it ran out of memory. In particular, we were seeing error messages about our fielddata getting too large. We were somehow able to recover, but we are very concerned about minimizing our memory usage. We read that using doc_values can help keep the size of the fielddata small, so we decided to try doing that. Our first pass was to attach doc_values to every field that we possibly could. However, this left us with several analyzed string fields which were still being used for sorting. For some of these string fields, I figured that we could just do the "analysis" on our end, before inserting data into Elasticsearch. For example, we have a "sortable" analyzer:

{
      "tokenizer" => "keyword",
      "filter" => ["lowercase"]
}

When we tokenize some strings in marvel, it looks like this:

GET ali-development/_analyze?analyzer=sortable&text=Jon+Lee

{
   "tokens": [
      {
         "token": "jon lee",
         "start_offset": 0,
         "end_offset": 7,
         "type": "word",
         "position": 1
      }
   ]
}

So my plan was, just add a new field in the mapping, for example call it "name_sortable", with type "string" and index "not_analyzed", and then pre-process my documents before sending them to Elasticsearch to add a downcased version of the "name" field. So, one question is, is this actually a correct idea, will this achieve what I hope it will achieve, allowing us to do basically a case-insensitive sort without creating huge fielddata?

My next concern, is that we also are using some more complex analyzers, and I don't know how I could pre-process things on my end to get the same result. For example, we also have an "autocomplete" analyzer currently that I might need to replace with a pre-processed not_analyzed doc_values field.

{
      "type" => "custom",
      "tokenizer" => "whitespace",
      "filter" => [ "lowercase", "edge_ngram" ],
      "stopwords" => "_none_",
      "char_filter" => []
}

When I look at the result in Marvel/Sense, I get this:

GET ali-development/_analyze?analyzer=autocomplete&text=Jon+Lee

{
   "tokens": [
      {
         "token": "j",
         "start_offset": 0,
         "end_offset": 3,
         "type": "word",
         "position": 1
      },
      {
         "token": "jo",
         "start_offset": 0,
         "end_offset": 3,
         "type": "word",
         "position": 1
      },
      {
         "token": "jon",
         "start_offset": 0,
         "end_offset": 3,
         "type": "word",
         "position": 1
      },
      {
         "token": "l",
         "start_offset": 4,
         "end_offset": 7,
         "type": "word",
         "position": 2
      },
      {
         "token": "le",
         "start_offset": 4,
         "end_offset": 7,
         "type": "word",
         "position": 2
      },
      {
         "token": "lee",
         "start_offset": 4,
         "end_offset": 7,
         "type": "word",
         "position": 2
      }
   ]
}

Is it even possible to produce a not_analyzed string field that will behave the same as this autocomplete-analyzed string? If so, how?

Thanks for making it all the way to the end,
Marshall


(system) #2