Proximity with intervals query question

Hi all! I'm trying to understand the inner workings of intervals query and I'm confused and hope that anyone can explain this to me a bit. So I have encountered this piece of data (mind the bold words for the query later on):

benzoyl}-methyl-amino)-acetic acid, 4- {4-[3-[(E)-hydroxyimino]-3-(l-methyl-6-oxo-l,6-dihydro-pyri din-3-yl)-l-o-tolyl-propyl]benzoylamino}-butyric acid ethyl ester, 4-{4-[3-[(E)-hydroxyimino]-3-(l-methyl-6-oxo-l,6-dihydro-pyridin-3-yl)-l-o-tolyl-propyl]benzoylamino}-butyric acid, 3-{4-[3-[(E)-hydroxyimino]-3-(l-methyl-6-oxo-l,6-dihydro-pyridin-3-yl)-l-o-tolyl-propyl]- benzoylamino}-propionic acid, {4-[3-[(E)-hydroxyimino]-3-(l-methyl-6-oxo-l,6-dihydro-pyridin-3-yl)-l-o-tolyl-propyl]benzoylamino}-acetic acid, (Ε/Ζ)-4-( 1 -(2-chlorophenyl)-3 -(hydroxyimino)-3 -(1 -methyl-6-oxo-1,6-dihydropyridin-3 yl)propyl)-2-fluoro-A-(2-hydroxyethyl)benzamide, trans-4-[l-(2-chloro-phenyl)-3-[(E)-hydroxyimino]-3-(l-methyl-6-oxo-l,6-dihydro-pyridin- 3-yl)-propyl]-2-fluoro-jV-(4-hydroxy-cyclohexyl)-benzamide, (Ε)-4-( 1 -(2-chlorophenyl)-3 -(hydroxyimino)-3 -(1 -methyl-6-oxo-1,6-dihydropyri din-3 yl)propyl)-2-fluoro-A-methyl-Af -( l -methyl pi peri din-4-yl)benzamide, (Ε)-4-( 1 -(2-chlorophenyl)-3 -(hydroxyimino)-3 -(1 -methyl-6-oxo-1,6-dihydropyridin-3 yl)propyl)-2-fluoro-A-(tetrahydro-2H-pyran-4-yl)benzamide, (Ε)-4-( 1 -(2-chlorophenyl)-3 -(hydroxyimino)-3 -(1 -methyl-6-oxo-1,6-dihydropyri din-3 yl)propyl)-2-fluoro-A-(oxetan-3-yl)benzamide, { 3 -fluoro-4-[3 -[(E)-hydroxyimino]-3 -(1 -methyl-6-oxo-1,6-dihydro-pyri din-3 -yl)-1 -o-tolylpropyl]-phenoxy}-acetic acid, 2-fluoro-4- { 3 -fluoro-4-[3 -[(E)-hydroxyimino]-3 -(1 -methyl-6-oxo-1,6-dihydro-pyri din-3 -yl)l-o-tolyl-propyl]-phenoxy}-benzoic acid, 5 - { (R)-3 -(4-bromo-phenyl)-1

Now this data is indexed with the following mapping:

PUT textanalysis
{
  "mappings": {
    "properties": {
      "text": {
        "analyzer": "word_delim",
        "type": "text",
        "fields": {
          "ngram": {
            "type": "text",
            "analyzer": "ngram_analyzer"
          }
        }
      }
    }
  },
  "settings": {
    "number_of_shards": 1,
    "max_ngram_diff": "24",
    "analysis": {
      "analyzer": {
        "word_delim": {
          "filter": [
            "filter_word_delim",
            "lowercase"
          ],
          "type": "custom",
          "tokenizer": "whitespace"
        },
        "ngram_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "filter_word_delim",
            "lowercase",
            "attach_anchors_to_token_boundary",
            "3_25_ngram_filter"
          ]
        }
      },
      "filter": {
        "filter_word_delim": {
          "split_on_numerics": "false",
          "split_on_case_change": "false",
          "generate_word_parts": "true",
          "type": "word_delimiter_graph",
          "preserve_original": "true",
          "generate_number_parts": "true"
        },
        "3_25_ngram_filter": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 25
        },
        "attach_anchors_to_token_boundary": {
          "type": "pattern_replace",
          "pattern": "(.+)",
          "flags": "UNICODE_CHARACTER_CLASS",
          "replacement": "\u00AC$1\u00AC"
        }
      }
    }
  }
}

POST textanalysis/_doc
{
  "text": "benzoyl}-methyl-amino)-acetic acid, 4- {4-[3-[(E)-hydroxyimino]-3-(l-methyl-6-oxo-l,6-dihydro-pyri din-3-yl)-l-o-tolyl-propyl]benzoylamino}-butyric acid ethyl ester, 4-{4-[3-[(E)-hydroxyimino]-3-(l-methyl-6-oxo-l,6-dihydro-pyridin-3-yl)-l-o-tolyl-propyl]benzoylamino}-butyric acid, 3-{4-[3-[(E)-hydroxyimino]-3-(l-methyl-6-oxo-l,6-dihydro-pyridin-3-yl)-l-o-tolyl-propyl]- benzoylamino}-propionic acid, {4-[3-[(E)-hydroxyimino]-3-(l-methyl-6-oxo-l,6-dihydro-pyridin-3-yl)-l-o-tolyl-propyl]benzoylamino}-acetic acid, (Ε/Ζ)-4-( 1 -(2-chlorophenyl)-3 -(hydroxyimino)-3 -(1 -methyl-6-oxo-1,6-dihydropyridin-3 yl)propyl)-2-fluoro-A-(2-hydroxyethyl)benzamide, trans-4-[l-(2-chloro-phenyl)-3-[(E)-hydroxyimino]-3-(l-methyl-6-oxo-l,6-dihydro-pyridin- 3-yl)-propyl]-2-fluoro-jV-(4-hydroxy-cyclohexyl)-benzamide, (Ε)-4-( 1 -(2-chlorophenyl)-3 -(hydroxyimino)-3 -(1 -methyl-6-oxo-1,6-dihydropyri din-3 yl)propyl)-2-fluoro-A-methyl-Af -( l -methyl pi peri din-4-yl)benzamide, (Ε)-4-( 1 -(2-chlorophenyl)-3 -(hydroxyimino)-3 -(1 -methyl-6-oxo-1,6-dihydropyridin-3 yl)propyl)-2-fluoro-A-(tetrahydro-2H-pyran-4-yl)benzamide, (Ε)-4-( 1 -(2-chlorophenyl)-3 -(hydroxyimino)-3 -(1 -methyl-6-oxo-1,6-dihydropyri din-3 yl)propyl)-2-fluoro-A-(oxetan-3-yl)benzamide, { 3 -fluoro-4-[3 -[(E)-hydroxyimino]-3 -(1 -methyl-6-oxo-1,6-dihydro-pyri din-3 -yl)-1 -o-tolylpropyl]-phenoxy}-acetic acid, 2-fluoro-4- { 3 -fluoro-4-[3 -[(E)-hydroxyimino]-3 -(1 -methyl-6-oxo-1,6-dihydro-pyri din-3 -yl)l-o-tolyl-propyl]-phenoxy}-benzoic acid, 5 - { (R)-3 -(4-bromo-phenyl)-1"
}

Now I have this following query:

GET textanalysis/_search
{
  "query": {
    "intervals": {
      "text": {
        "all_of": {
          "ordered": false,
          "max_gaps": 259,
          "intervals": [
            {
              "match": {
                "query": "amino",
                "max_gaps": 0,
                "ordered": false
              }
            },
            {
              "all_of": {
                "ordered": false,
                "max_gaps": 0,
                "intervals": [
                  {
                    "wildcard": {
                      "pattern": "¬phenox",
                      "use_field": "text.ngram"
                    }
                  },
                  {
                    "all_of": {
                      "ordered": false,
                      "max_gaps": 255,
                      "intervals": [
                        {
                          "match": {
                            "query": "benzoic",
                            "max_gaps": 0,
                            "ordered": false
                          }
                        },
                        {
                          "match": {
                            "query": "acid",
                            "max_gaps": 0,
                            "ordered": false
                          }
                        }
                      ]
                    }
                  }
                ]
              }
            }
          ]
        }
      }
    }
  }
}

This query returns the document and can be read as amino NEAR259 (phenox* NEAR0 (benzoic NEAR255 acid)) where NEAR means there can be position difference of N+1 between the words. So now when I change the top level max_gaps (or first NEAR) from 259 to 258 it doesn't bring back the document with this data anymore. This made me look at the positions of all the words in the data. I'll sum them up here with word and position in order they appear:

  • amino: 2
  • acid: 4
  • acid: 28
  • acid: 53
  • acid: 76
  • acid: 98
  • phenoxy: 260
  • acid: 262
  • phenoxy: 289
  • benzoic: 290
  • acid: 291

Now when we work inside out and take the benzoic NEAR255 acid part of the query first, this could take 290/53, 290/76, 290/98, 290/262 or 290/291. Then comes phenox* NEAR0 which can then either be 260 or 289, depending on the first one. Now when adding the amino NEAR259 part, it becomes clear that the combination chosen in the most inner part is 290/262 and for phenoxy 289 is taken, since amino NEAR258 does not return the document, but amino NEAR259 does.

Now follows my question: why does it take 290/262 for the benzoic NEAR255 acid part? 290/291 is clearly closer to each other.

Edit: I have changed the numbers in the query a bit.

1 Like