I have set the maximum shingle size as 3 for my index. When I am done with indexing I find that the shingles :
yield
_ yield
_ _ yield
have the same start and ending offsets. So, if I would need the original sub-string I will always get yield
!! Why is that and can it be corrected?
Another thing that I noticed is that the offsets of the shingles like:
_ yield _
yield _
-
yield _ _
have length such that the string has an extra space that the end. e.g. for_ yield _
if there is a string"not yield until"
then the offsets, is used on the original text, return"yield until "
(notice trailing space). I actually would want"not yield until"
.
Finally, is there a way we can avoid stop words while creating shingles? In my application the stop words don't play any role.
Following is the settings that I am using for the index:
{
"test_index": {
"settings": {
"index": {
"refresh_interval": "60s",
"number_of_shards": "5",
"store": {
"type": "default"
},
"creation_date": "1439324341816",
"analysis": {
"filter": {
"snowball_stop_words_en_EN": {
"type": "stop",
"stopwords_path": "snowball.stop"
},
"smart_stop_words_en_EN": {
"type": "stop",
"stopwords_path": "smart.stop"
},
"porter_stemmer_en_EN": {
"name": "porter",
"type": "stemmer"
},
"word_delimiter_en_EN": {
"type": "word_delimiter",
"stem_english_possessive": "true"
},
"default_stop_name_en_EN": {
"name": "_english_",
"type": "stop"
},
"preserve_original_en_EN": {
"type": "word_delimiter",
"preserve_original": "true"
},
"apos_replace_en_EN": {
"pattern": ".*\\'$",
"type": "pattern_replace",
"replacement": ""
},
"shingle_filter_en_EN": {
"max_shingle_size": "3",
"min_shingle_size": "2",
"type": "shingle",
"output_unigrams": "true"
}
},
"analyzer": {
"test_analyzer": {
"filter": [
"lowercase",
"smart_stop_words_en_EN",
"preserve_original_en_EN",
"porter_stemmer_en_EN",
"asciifolding",
"apos_replace_en_EN",
"shingle_filter_en_EN"
],
"type": "custom",
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"version": {
"created": "1060099"
},
"uuid": "urrFTCEoThyuPjjBYCBgYQ"
}
}
}
}
I use the test_analyzer
on the documents while indexing.
I am using ElasticSearch 1.6.0.
As an example I have even attached link to a text file that I was using.
Thanks.
Any suggestions are welcome.