Span_near fails to work as expected

I have an Elasticsearch index structured as follows:

Example entry:

  'ID': '1',
  'creator': 'Sam',
  'instance': 'court of cassation',
  'text': [
      'depth': 0,
      'role': 'introduction',
      'caseText': ['27', 'october', 'decision', 'of', 'the', 'high', 'court', 'in', 'the', 'case', 'of']
      'ngrams': ['in_the', 'case_of']
      'depth': 1,
      'role': 'decision',
      'caseText': ['we', 'decide', 'a', 'sentence', 'of', 'three', 'weeks']
      'ngrams': ['three_weeks']

In my example, I'm facing the problem that 'high court' is not registered as an ngram, so I want to perform a search where I force high and court to be next to each other in the caseText-field

I tried doing it using the span_near query, but without result:

  'query': {
    'span_near': {
      'clauses': [
        'span_term': { 'text.caseText': 'high'}},
        'span_term': { 'text.caseText': 'court'}},
    'slop': 0,
    'in_order': True

I'm afraid my understanding of the span_near query is somehow wrong, but I really need this to work. Any help is greatly appreciated!

Hi Kasper,
The problem is in your use of arrays. Normally text is indexed as a single string and tokenized into multiple words by an Analyzer. Words are considered near to each other because the analyzer assigns a "position" value to each word, incremented for each one found in a string.
When you present words in an array there's a (configurable) "position_increment" added between each element in the array to make the words appear not next to each other. The default gap in positions is 100. This is designed to stop this document matching a search for "new york":

 "paragraphs" : [ ".. this was nothing new.", "York minster was first..."]

You can see the position information using the _analyze API e.g.

POST test/_doc/1 
  "message":"my string value"
POST test/_analyze
  "text": ["foo", "bar"]
POST test/_analyze
  "text": ["foo bar"]

Hi Mark,

Thanks for the response! With help of _analyze API, and your insight on the default gap in positions (100), the quick fix seems to be setting slop to 100.

Are there any unwanted consequences I can expect by doing this? If I want to tokenize my strings using my own preprocessing technique, instead of Elastic's Analyzer, how should I index my text correctly? Or can I expect the way I handled it now to not be such a big deal?

Honestly, I don’t know what the full impact is because it’s the first time I’ve seen this attempted. I expect highlighting will not work well. There may be added inefficiencies in the way things are stored in the index and retrieved but, again, I’m not sure because this is not a use case we advocate.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.