Elasticsearch ngram tokenizer

Hi,

[Elasticsearch version 6.7.2]

I am trying to index my data using ngram tokenizer but sometimes it takes too much time to index.

What I am trying to do is to make user to be able to search for any word or part of the word. So if I have text - This is my text - and user writes "my text" or "s my", that text should come up as a result. using ngram tokenizer worked for me and it seemes like it is doing what I want but sometimes I have way too long text. because it is a description of something and it might be as long as it want to be (even 10000 chars) indexing this has to be a pain for elasticsearch I guess but I need it to be indexed the way I described.

This is my init settings

{
  "settings": {
    "index": {
      "blocks": {"read_only_allow_delete": "false"},
      "max_ngram_diff": 150,
      "number_of_shards": 3,
      "number_of_replicas": 2
    },
    "analysis": {
      "filter":{
        "synonym":{
          "type":"synonym",
          "synonyms_path":"thesaurus.conf"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 2,
          "max_gram": 40
        }
      },
      "analyzer": {
        "my_analyzer_lowercase": {
          "tokenizer": "my_tokenizer",
          "filter": [
            "lowercase",
            "synonym"
          ]
        },
        "my_analyzer_case_sensitive": {
          "filter":[
            "synonym"
          ],
          "tokenizer": "my_tokenizer"
        }
      }
    }
  },
  "mappings": {
    "modules": {
      "properties": {
        "module": {
          "type": "text",
          "analyzer": "my_analyzer_lowercase",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "organization": {
          "type": "text",
          "analyzer": "my_analyzer_lowercase",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "argument": {
          "type": "text",
          "fields": {
            "sensitive": {
              "type": "text",
              "analyzer": "my_analyzer_case_sensitive"
            },
            "lowercase": {
              "type": "text",
              "analyzer": "my_analyzer_lowercase"
            },
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "description": {
          "type": "text",
          "fields": {
            "sensitive": {
              "type": "text",
              "analyzer": "my_analyzer_case_sensitive"
            },
            "lowercase": {
              "type": "text",
              "analyzer": "my_analyzer_lowercase"
            },
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}

So is there any better way how to do this? Would it hep to upgrade to new version to make the indexing faster?

Also I am getting max_ngram_diff is too big warning. Which I ignored so far.

Any suggestions and help is appreciated especialy why it failes to index long texts. I am indexing it using python library

for key in yindexes:
    for success, info in parallel_bulk(es, yindexes[key], thread_count=int(threads), index='yindex', doc_type='modules', request_timeout=40):
        if not success:
            LOGGER.error('A elasticsearch document failed with info: {}'.format(info))

If I change the request_timeout to 300 then it will keep trying and crash the application. And again this happens only with long descritption texts

Thank you
Miroslav Kovac

You are running in a common full text search problem, that if you want to store everything at index time with fine granularity, you will store an insane amount of tokens in combination - which also means your search results will be pretty generic.

First, you may want to take a look at the new search-as-you-type datatype. This however uses only shingles and edge n grams.

However if you want to match o bar to match foo bar, this is indeed a different beast again, and would probably require some more thinking. you could use ngrams or maybe think about storing shingle, that are reversed to support searching for this. Take a look at this analyze output and sample

GET _analyze
{
  "text": "This is a foo bar test",
  "tokenizer": "standard",
  "filter": [
    "lowercase",
    {
      "type": "shingle",
      "output_unigrams": false
    },
    "reverse",
    {
      "type": "edge_ngram",
      "max_gram": 20,
      "min_gram": 5
    }
  ]
}

GET _analyze
{
  "text": "oo bar",
  "tokenizer": "standard",
  "filter": [
    "lowercase",
    {
      "type": "shingle",
      "output_unigrams": false
    },
    "reverse",
    {
      "type": "edge_ngram",
      "max_gram": 20,
      "min_gram": 5
    }
  ]
}

DELETE test 

PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_shingle",
            "reverse",
            "my_edge_ngram"
          ]
        }
      },
      "filter": {
        "my_shingle": {
          "type": "shingle",
          "output_unigrams": false
        },
        "my_edge_ngram": {
          "type": "edge_ngram",
          "max_gram": 20,
          "min_gram": 5
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_test_field": {
        "type" : "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

PUT test/_doc/1
{
  "my_test_field" : "This is a foo bar test"
}

GET test/_search
{
  "query": {
    "match": {
      "my_test_field": "oo bar"
    }
  }
}

This will not store ngrams but reversed edge ngrams and also search like that - maybe you can work with that (I was not able to get this working properly at some point in the past, but maybe I have been missing something :slight_smile:

Hey,

Thank you for your response, it clears some things up although I don t really understand why is it problem to store everything at index time. If the time that it consumes to index it is no matter for me and I have space I should be ok right? And about generic output I always receive _score in the ouput for each result which helps me to sort the output from most relevant to least relevant ones.

What other option do I have if I don t wanth to store everything at index time. I thought that it always stores all the data at index time. We don t have to take as example only my case. Sorry if you already have answered this but maybe I am missing something here.

just to clarify: if space is not a problem, always go to store as much as possible at index time :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.