Span_near fails to work as expected

Kasper_Kooijman · December 1, 2020, 8:40am

I have an Elasticsearch index structured as follows:

Example entry:

{
  'ID': '1',
  'creator': 'Sam',
  'instance': 'court of cassation',
  'text': [
  {
      'depth': 0,
      'role': 'introduction',
      'caseText': ['27', 'october', 'decision', 'of', 'the', 'high', 'court', 'in', 'the', 'case', 'of']
      'ngrams': ['in_the', 'case_of']
    },
    {
      'depth': 1,
      'role': 'decision',
      'caseText': ['we', 'decide', 'a', 'sentence', 'of', 'three', 'weeks']
      'ngrams': ['three_weeks']
    }]
}

In my example, I'm facing the problem that 'high court' is not registered as an ngram, so I want to perform a search where I force high and court to be next to each other in the caseText-field

I tried doing it using the span_near query, but without result:

{
  'query': {
    'span_near': {
      'clauses': [
        'span_term': { 'text.caseText': 'high'}},
        'span_term': { 'text.caseText': 'court'}},
    ],
    'slop': 0,
    'in_order': True
    }
  }
}

I'm afraid my understanding of the span_near query is somehow wrong, but I really need this to work. Any help is greatly appreciated!

Mark_Harwood · December 1, 2020, 10:17am

Hi Kasper,
The problem is in your use of arrays. Normally text is indexed as a single string and tokenized into multiple words by an Analyzer. Words are considered near to each other because the analyzer assigns a "position" value to each word, incremented for each one found in a string.
When you present words in an array there's a (configurable) "position_increment" added between each element in the array to make the words appear not next to each other. The default gap in positions is 100. This is designed to stop this document matching a search for "new york":

 "paragraphs" : [ ".. this was nothing new.", "York minster was first..."]

You can see the position information using the _analyze API e.g.

DELETE test
POST test/_doc/1 
{
  "message":"my string value"
}
POST test/_analyze
{
  "field":"message",
  "text": ["foo", "bar"]
}
POST test/_analyze
{
  "field":"message",
  "text": ["foo bar"]
}

Kasper_Kooijman · December 8, 2020, 7:18am

Hi Mark,

Thanks for the response! With help of _analyze API, and your insight on the default gap in positions (100), the quick fix seems to be setting slop to 100.

Are there any unwanted consequences I can expect by doing this? If I want to tokenize my strings using my own preprocessing technique, instead of Elastic's Analyzer, how should I index my text correctly? Or can I expect the way I handled it now to not be such a big deal?

Mark_Harwood · December 8, 2020, 8:34am

Honestly, I don’t know what the full impact is because it’s the first time I’ve seen this attempted. I expect highlighting will not work well. There may be added inefficiencies in the way things are stored in the index and retrieved but, again, I’m not sure because this is not a use case we advocate.

system · January 5, 2021, 8:35am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Filename search using nGram-tokenizer and span_near-query Elasticsearch	5	1343	July 6, 2017
Nested Span Near Queries Give Results That Make No Sense Elasticsearch	5	1001	July 6, 2017
Proximity searches - sentenses and paragraphs Elasticsearch	1	1045	July 5, 2017
A couple of span query questions Elasticsearch	4	685	July 6, 2017
SpanNearQuery bug? Elasticsearch	2	314	February 13, 2019

Span_near fails to work as expected

Related topics