Understanding a match query on a search-as-you-type index

I'm curious if there is any way to understand the lucene query that's generated for a particular Elasticsearch DSL query.

For example, if I create an index:

  await client.indices.create({
    index: 'activities',
    mappings: {
      dynamic: false,
      properties: {
        title: { type: 'search_as_you_type', analyzer: 'english' },
        description: { type: 'search_as_you_type', analyzer: 'english' },
        keywords: {
          type: 'search_as_you_type',
          analyzer: 'english',
          fields: {
            exact: {
              type: 'keyword'
            }
          }
        },
        searchable: { type: 'boolean' }
      }
    }
  });
]

and then query it:

{
        index: 'activities',
        query: {
          multi_match: {
            query,
            type: 'bool_prefix',
            minimum_should_match: "75%",
            fields: [
              'title^2',
              'title._2gram^2',
              'title._3gram^2',
              'description',
              'description._2gram',
              'description._3gram'
            ]
          }
        }
      }

There's a lot going on here. Fields are analyzed and tokenized. There's the "operator" and "min_should_match" on the bool query, and the last term is treated differently (targeting the _index_prefix I assume, though that's implicit in the query).

I know that I can use GET /activities/_explain/5ddbf9ae009cd90bcdeaadd7 to explain the scoring for any particular document returned, but I'm curious if there's a way to see the Lucene query that's generated from this DSL. Reading through MultiMatchQueryBuilder is proving rather convoluted, though that's the best source of info on this I've found.

Is there a description of the algorithm that converts the DSL into a lucene query? Or better yet, a tool like the explain endpoint that can show me the lucene query generated from DSL?

Answering my own question - since ES support was very helpful in this.

The Profile API exposes a query description and query structure. So for the example above,

GET /activities/_search
{
  "profile": true,
  "query": {
    "multi_match": {
      "query": "turtle time tri",
      "type": "bool_prefix",
      "operator": "and",
      "fields": [
        "title^2",
        "title._2gram^2",
        "title._3gram^2",
        "description",
        "description._2gram",
        "description._3gram"
    ]}
  }
}

Returns the following structure:

BooleanQuery
  ConstantScore(description._index_prefix:turtl time tri)
  (ConstantScore(title._index_prefix:turtl time tri))^2.0
  (+title:turtl +title:time +ConstantScore(title._index_prefix:tri))^2.0
  (+description:turtl +description:time +ConstantScore(description._index_prefix:tri))
  (+description._2gram:turtl time +ConstantScore(description._index_prefix:time tri))
  (+title._2gram:turtl time +ConstantScore(title._index_prefix:time tri))^2.0

Conceptually, it seems like the following queries:

  • for each target field, try to match the full phrase on the _index_prefix field.
  • for each target field, try to match every term except for last against the single-term field, and match just the last term as a constant query against the _index_prefix field.
  • if there are more than 3 terms, match pairs against the _2gram field, and the remainder against the prefix.

I still wish for some more detail here. For example, I can't tell if the clauses in the BooleanQueries are "shoulds" or "musts", and if they are shoulds, what the min_should_match parameter is.

It's also surprising that in the last query, (+description._2gram:turtl time +ConstantScore(description._index_prefix:time tri)) the second word from the query "turtle" is used twice - as part of the 2gram and also as part of the _index_prefix.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.