Question about TF and prefix search

Allen7_Skillz · March 20, 2024, 1:49am

Hello everyone, how are you?

I'm having some problems with a search.

Currently, I am searching for the word "Cracha", where the purpose of my search is precisely to bring up all the results that start with that word.

It works well, but there is a small inversion of results in my return, for example:

Cracha xxx
Cracha xxxz
Cracha xxxxzx
Cordao p cracha
Cracha xxxxxxx

I verified that the score of "cordao" is higher than "cracha xxxxxx" using the explain method.

Basically they all have the same values, but when it arrives in the TF, the word "cordão" is slightly higher, making it more relevant and returning in a higher order than it should be.

Is there anything I can do to get around this problem? I really don't understand why TF has greater relevance in the record "Cordao"

Kathleen_DeRusso · March 20, 2024, 12:28pm

Hi there @Allen7_Skillz - this will depend on a little more information. For example, what is the mapping of the field you are searching, what query are you running, and what is the output of _explain?

I can tell you that for a very simple example, cordao is returned at the bottom of the result set, so it has to be something else about how you're storing or querying the data.

PUT rank-test/_doc/1
{
  "text": "Cracha xxx"
}
PUT rank-test/_doc/2
{
  "text": "Cracha xxxz"
}

PUT rank-test/_doc/3
{
  "text": "Cracha xxxxzx"
}

PUT rank-test/_doc/4
{
  "text": "Cordao p cracha"
}

PUT rank-test/_doc/5
{
  "text": "Cracha xxxxxxx"
}

GET rank-test/_search
{
  "query": {
    "query_string": {
      "query": "cracha"
    }
  }
}

Allen7_Skillz · March 20, 2024, 2:28pm

Hello!

Sorry, I ended up not including the code itself.

Basically, I am importing several products of the most varied types.

Regarding field mapping, I'm working with a field that I named "NormDescription", where it would be a normalized description. It is of type "text" with 2 subtypes (Raw (Keyword) and Folded)

Currently my research is around the normDescription.Folded field.

As I mentioned previously, I am uploading the already normalized data, before importing this data into Elastic I am transforming it into lowercase and replacing accents and special characters, this way the field is already normalized.

I did this to avoid ordering problems or problems related to case sensitive.

        {
          "bool": {
            "should": [
              {
                "match_phrase_prefix": {
                  "normDescription": {
                    "query": "Cracha",
                    "boost": 2
                  }
                }
              },
              {
                "query_string": {
                  "query": "Cracha",
                  "fields": ["sku","uuid"],
                  "minimum_should_match": "2"
                }
              },
              {
                "match": {
                  "normDescription.fuzzy": {
                    "query": "Cracha",
                    "fuzziness": 2,
                    "prefix_length": 3,
                    "minimum_should_match": 2
                  }
                }
              }
            ]
          }
        }

In this scenario, I verified that the return presented is this:

Cracha aleatorio
Cracha ika
Cordão para Cracha
Cracha de outro fornecedor
Kit de cordão p/ Cracha

Where the correct order should be:

Cracha aleatorio
Cracha ika
Cracha de outro fornecedor
Cordão para Cracha
Kit de cordão p/ Cracha

I tried to put some sort combinations, for example, score desc and normDescrip asc. The solution works but ends up ordering wrong anyway, but this time for grammatical reasons.

My intention is to make the search first return records that have to do with the word I searched for, and then the other records that contain that word somewhere.

I also tried applying * to the end of the typed word

Kathleen_DeRusso · March 20, 2024, 2:50pm

Thanks for the additional context.

What would be really helpful is to distill this into a simple example that I could reproduce - with mappings, a couple sample documents and a query.

Without more information - for example the output of the explain API and the actual mapping, it's hard to say what exactly is happening without that.

Can you narrow it down to one specific boolean clause?

Allen7_Skillz · March 21, 2024, 2:05pm

Hello again!

Thanks for the answer.

I will provide some more information and also some conclusions.

First, this is my Mapping settings:

{
  "properties": {
    "id": {"type": "long"},
    "uuid": {"type": "keyword"},
    "sku": {"type": "keyword"},
    "description": {
      "type": "text",
      "fields": {
        "folded": {"type": "text", "analyzer": "folding"},
        "fuzzy": {"type": "text", "analyzer": "simple"},
        "raw": {"type": "keyword"}
      }
    },
    "normDescription": {
      "type": "text",
      "fields": {
        "folded": {"type": "text", "analyzer": "folding"},
        "fuzzy": {"type": "text", "analyzer": "simple"},
        "raw": {"type": "keyword"}
      }
    }
}
}

And my Search:

{
  "from": 0,
  "size": 10,
  "_source": ["description"],
  "query": {
    "bool": {
      "must": [

       {
          "bool": {
            "should": [
              {
                "match_phrase_prefix": {
                  "normDescription": {
                    "query": "Cracha*",
                    "boost": 2
                  }
                }
              },
              {
                "query_string": {
                  "query": "Cracha",
                  "fields": ["sku","uuid"],
                  "minimum_should_match": "1"
                }
              },
              {
                "match": {
                  "normDescription.fuzzy": {
                    "query": "Cracha",
                    "fuzziness": 2,
                    "prefix_length": 3,
                    "minimum_should_match": 2
                  }
                }
              }
            ]
          }
        }
      ]
    }
  }
}

I'm using this source instruction, just to bring the description, just to make the return smaller, this will not be used in the future, it's just for testing.

The return was like this:

{
        "_index": "sb_product",
        "_type": "_doc",
        "_id": "pb01sb_41392",
        "_score": 72.157036,
        "_source": {
          "description": "Cordão para Cracha"
        },
        "sort": [
          72.157036,
          "Cordão para Cracha"
        ]
      },
      {
        "_index": "sb_product",
        "_type": "_doc",
        "_id": "pb01sb_41536",
        "_score": 64.49333,
        "_source": {
          "description": "Cracha de outro fornecedor"
        },
        "sort": [
          64.49333,
          "Cracha de outro fornecedor"
        ]
      }

Register "Cordão para Cracha":

_score: 7.330924
IDF (7.9352293), TF (0.41992968)

Register "Cracha de outro fornecedor":

_score: 6.3671646
IDF (7.9352293), TF (0.36472368)

So I had an idea to simply change the name of the product, from "Cordão para Chacha" to "Cordão diversos para cracha"

So for some reason, the record length impacts the TF calculation.

So, the result was this:

      {
        "_index": "sb_product",
        "_type": "_doc",
        "_id": "pb01sb_41536",
        "_score": 15.399356,
        "_source": {
          "description": "Cracha de outro fornecedor"
        }
      },
      {
        "_index": "sb_product",
        "_type": "_doc",
        "_id": "pb01sb_41392",
        "_score": 15.399356,
        "_source": {
          "description": "Cordão diversos para cracha"
        }
      }

system · April 18, 2024, 2:05pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.