Is there a way to explain the More Like This query? (not the result, but the pre-process)

I am trying to find out how exactly the More Like This query works. In my scenario I have two indices:

  • the first one is an official list of products, each one with a description and a barcode
  • the other one is the client's list of products, each one with some attributes (not exactly like the same fields as the official database, but the same informatio e. g. model, brand, amount, unity, ...) and also a barcode (kind of)

It would be very easy to match the barcodes, but the thing is, it is not garanted that the client's side barcode is correct or even valid. So I am trying to get the MLT query to solve this.

  • In both indices I create a concatenated field with all the available attribute values, separeted by "/"
  • I then run a MLT query, for which I am getting some satisfactory results

But the thing is, I want to debug how the MLT chooses the terms for the disjunctive query.

(from this point on I'm sorry I kept the examples in portuguese)

I know that for a determinate hit I can use the _explain API to find out why it was chosen...

GET shopping-br-base-interna/_explain/233687
{
  "query": {
    "more_like_this": {
      "fields": [
        "completo"
      ],
      "like": "7891010026264/Higiêne Pessoal/Repelentes e Proteção Solar/Protetor Solar Sundown Kids Fps60 120Ml protetor",
      "min_term_freq": 1,
      "min_doc_freq": 0,
      "minimum_should_match": "30%"
    }
  }
}

And with the _termvectors API I can get statistics about the content's terms...

GET shopping-br-base-interna/_termvectors/233687
{
  "fields": [
    "completo",
    "7891010026264/Higiêne Pessoal/Repelentes e Proteção Solar/Protetor Solar Sundown Kids Fps60 120Ml"
  ],
  "offsets": true,
  "positions": true,
  "term_statistics": true,
  "field_statistics": true
}

But I would like to debug the MLT at the moment it chooses the terms. For example, although I have "min_term_freq: 1" and the default "max_query_terms: 25" (default), why for the above example it only seens to have chosen the "e", "pessoal", "inf" (synonym applied), "120ml", "fps60", "sundown", "prot" (synonym) and "solar"? Why it didn't pick the "7891010026264" (barcode)?

Anyway, I guess what I am trying to ask here is:

  • Is there a way to see what the disjunctive query created by MLT is?
  • Is there a way to ask it to explain the process that got to this disjunctive query?

Thanks!

It basically tries to pick terms with high TF-IDF (common in the doc, rare in the index).
I suspect the reason it didnt pick the barcode is because of the default ‘min_doc_freq’ - see https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html#mlt-query-term-selection