Are these ES bugs or just wrong implementations by me?


(apanimesh061) #1

I have set the maximum shingle size as 3 for my index. When I am done with indexing I find that the shingles :

  1. yield
  2. _ yield
  3. _ _ yield

have the same start and ending offsets. So, if I would need the original sub-string I will always get yield!! Why is that and can it be corrected?

Another thing that I noticed is that the offsets of the shingles like:

  1. _ yield _
  2. yield _
  3. yield _ _
    have length such that the string has an extra space that the end. e.g. for _ yield _ if there is a string "not yield until" then the offsets, is used on the original text, return "yield until " (notice trailing space). I actually would want "not yield until".

Finally, is there a way we can avoid stop words while creating shingles? In my application the stop words don't play any role.

Following is the settings that I am using for the index:

{
   "test_index": {
      "settings": {
         "index": {
            "refresh_interval": "60s",
            "number_of_shards": "5",
            "store": {
               "type": "default"
            },
            "creation_date": "1439324341816",
            "analysis": {
               "filter": {
                  "snowball_stop_words_en_EN": {
                     "type": "stop",
                     "stopwords_path": "snowball.stop"
                  },
                  "smart_stop_words_en_EN": {
                     "type": "stop",
                     "stopwords_path": "smart.stop"
                  },
                  "porter_stemmer_en_EN": {
                     "name": "porter",
                     "type": "stemmer"
                  },
                  "word_delimiter_en_EN": {
                     "type": "word_delimiter",
                     "stem_english_possessive": "true"
                  },
                  "default_stop_name_en_EN": {
                     "name": "_english_",
                     "type": "stop"
                  },
                  "preserve_original_en_EN": {
                     "type": "word_delimiter",
                     "preserve_original": "true"
                  },
                  "apos_replace_en_EN": {
                     "pattern": ".*\\'$",
                     "type": "pattern_replace",
                     "replacement": ""
                  },
                  "shingle_filter_en_EN": {
                     "max_shingle_size": "3",
                     "min_shingle_size": "2",
                     "type": "shingle",
                     "output_unigrams": "true"
                  }
               },
               "analyzer": {
                  "test_analyzer": {
                     "filter": [
                        "lowercase",
                        "smart_stop_words_en_EN",
                        "preserve_original_en_EN",
                        "porter_stemmer_en_EN",
                        "asciifolding",
                        "apos_replace_en_EN",
                        "shingle_filter_en_EN"
                     ],
                     "type": "custom",
                     "tokenizer": "standard"
                  }
               }
            },
            "number_of_replicas": "1",
            "version": {
               "created": "1060099"
            },
            "uuid": "urrFTCEoThyuPjjBYCBgYQ"
         }
      }
   }
}

I use the test_analyzer on the documents while indexing.

I am using ElasticSearch 1.6.0.

As an example I have even attached link to a text file that I was using.

Thanks.
Any suggestions are welcome.


(system) #2