Hi,
[Elasticsearch version 6.7.2]
I am trying to index my data using ngram tokenizer but sometimes it takes too much time to index.
What I am trying to do is to make user to be able to search for any word or part of the word. So if I have text - This is my text - and user writes "my text" or "s my", that text should come up as a result. using ngram tokenizer worked for me and it seemes like it is doing what I want but sometimes I have way too long text. because it is a description of something and it might be as long as it want to be (even 10000 chars) indexing this has to be a pain for elasticsearch I guess but I need it to be indexed the way I described.
This is my init settings
{
"settings": {
"index": {
"blocks": {"read_only_allow_delete": "false"},
"max_ngram_diff": 150,
"number_of_shards": 3,
"number_of_replicas": 2
},
"analysis": {
"filter":{
"synonym":{
"type":"synonym",
"synonyms_path":"thesaurus.conf"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 40
}
},
"analyzer": {
"my_analyzer_lowercase": {
"tokenizer": "my_tokenizer",
"filter": [
"lowercase",
"synonym"
]
},
"my_analyzer_case_sensitive": {
"filter":[
"synonym"
],
"tokenizer": "my_tokenizer"
}
}
}
},
"mappings": {
"modules": {
"properties": {
"module": {
"type": "text",
"analyzer": "my_analyzer_lowercase",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"organization": {
"type": "text",
"analyzer": "my_analyzer_lowercase",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"argument": {
"type": "text",
"fields": {
"sensitive": {
"type": "text",
"analyzer": "my_analyzer_case_sensitive"
},
"lowercase": {
"type": "text",
"analyzer": "my_analyzer_lowercase"
},
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"description": {
"type": "text",
"fields": {
"sensitive": {
"type": "text",
"analyzer": "my_analyzer_case_sensitive"
},
"lowercase": {
"type": "text",
"analyzer": "my_analyzer_lowercase"
},
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
So is there any better way how to do this? Would it hep to upgrade to new version to make the indexing faster?
Also I am getting max_ngram_diff is too big warning. Which I ignored so far.
Any suggestions and help is appreciated especialy why it failes to index long texts. I am indexing it using python library
for key in yindexes:
for success, info in parallel_bulk(es, yindexes[key], thread_count=int(threads), index='yindex', doc_type='modules', request_timeout=40):
if not success:
LOGGER.error('A elasticsearch document failed with info: {}'.format(info))
If I change the request_timeout to 300 then it will keep trying and crash the application. And again this happens only with long descritption texts
Thank you
Miroslav Kovac