Hi,
[Elasticsearch version 6.7.2]
I am trying to index my data using ngram tokenizer but sometimes it takes too much time to index.
What I am trying to do is to make user to be able to search for any word or part of the word. So if I have text - This is my text - and user writes "my text" or "s my", that text should come up as a result. using ngram tokenizer worked for me and it seemes like it is doing what I want but sometimes I have way too long text. because it is a description of something and it might be as long as it want to be (even 10000 chars) indexing this has to be a pain for elasticsearch I guess but I need it to be indexed the way I described.
This is my init settings
{
  "settings": {
    "index": {
      "blocks": {"read_only_allow_delete": "false"},
      "max_ngram_diff": 150,
      "number_of_shards": 3,
      "number_of_replicas": 2
    },
    "analysis": {
      "filter":{
        "synonym":{
          "type":"synonym",
          "synonyms_path":"thesaurus.conf"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 2,
          "max_gram": 40
        }
      },
      "analyzer": {
        "my_analyzer_lowercase": {
          "tokenizer": "my_tokenizer",
          "filter": [
            "lowercase",
            "synonym"
          ]
        },
        "my_analyzer_case_sensitive": {
          "filter":[
            "synonym"
          ],
          "tokenizer": "my_tokenizer"
        }
      }
    }
  },
  "mappings": {
    "modules": {
      "properties": {
        "module": {
          "type": "text",
          "analyzer": "my_analyzer_lowercase",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "organization": {
          "type": "text",
          "analyzer": "my_analyzer_lowercase",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "argument": {
          "type": "text",
          "fields": {
            "sensitive": {
              "type": "text",
              "analyzer": "my_analyzer_case_sensitive"
            },
            "lowercase": {
              "type": "text",
              "analyzer": "my_analyzer_lowercase"
            },
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "description": {
          "type": "text",
          "fields": {
            "sensitive": {
              "type": "text",
              "analyzer": "my_analyzer_case_sensitive"
            },
            "lowercase": {
              "type": "text",
              "analyzer": "my_analyzer_lowercase"
            },
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}
So is there any better way how to do this? Would it hep to upgrade to new version to make the indexing faster?
Also I am getting max_ngram_diff is too big warning. Which I ignored so far.
Any suggestions and help is appreciated especialy why it failes to index long texts. I am indexing it using python library
for key in yindexes:
    for success, info in parallel_bulk(es, yindexes[key], thread_count=int(threads), index='yindex', doc_type='modules', request_timeout=40):
        if not success:
            LOGGER.error('A elasticsearch document failed with info: {}'.format(info))
If I change the request_timeout to 300 then it will keep trying and crash the application. And again this happens only with long descritption texts
Thank you
Miroslav Kovac
