Wildcard query returning results in random order

frugal-derision · August 3, 2016, 3:33pm

Hi,

I have an index and a field in the index has a custom analyzer. The custom analyzer definition is:

{
    "settings": {
        "analysis": {
            "analyzer": {
                "anl_email": {
                    "type": "custom",
                    "tokenizer": "uax_url_email",
                    "filter": ["lowercase", "asciifolding"] 
                }
            }
        }
    }
}

and I'm doing my wildcard query as:

{
    "query": {
        "wildcard": {
            "Username": "*wa*"
        }
    }
}

(the Username field has the custom analyzer anl_email defined above)

The problem now is that (assuming the default size of 10), this query returns the results in a random order. So if I query once, the results will be different and executing that same query again will produce a different set of results.

In all those results, each document has a score of 1. I also tried sorting on the _score and _doc properties e.g. "sort": ["_score", "_doc"] (as suggested in #elasticsearch), but same results.

I need a consistent and predictable result set so that I can then paginate over them. Any ideas what am I doing wrong here?

cbuescher · August 3, 2016, 4:10pm

Hi,

multi-term queries like "wildcard" are rewritten before beeing executed. The default setting for this is to rewrite to a constant_score query. Thats why your documents are all scored 1. If you want scoring, try one of the other options for rewrite mentioned here: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-multi-term-rewrite.html

frugal-derision · August 3, 2016, 4:22pm

I need to allow partial search over email addresses. Is there any way I can get a predictable sort order?

And also is wildcard appropriate for what I'm trying to achieve? My goal is to a) allow partial search b) paginate those results.

a) partial search is working pretty well with wildcard query but not b).

cbuescher · August 3, 2016, 4:40pm

Can you simply sort by the email address field? I saw you tried "_score" and "_doc", but why not returning the results in alphabetical order? If thats not possible because the field is analyzed, why not store an un-analyzed version alongside it (using multi-fields) to sort on?

frugal-derision · August 3, 2016, 5:29pm

I tried sorting on the Username field and it seems like that's exactly what I want, so thanks.

But a question, you mentioned that: "If thats not possible [i.e. sorting] because the field is analyzed", now even though my field is custom analyzed, I can still sort on that field. Is this a bug/not normal?

My custom analyzer's definition:

"anl_email": {
    "type": "custom",
    "tokenizer": "uax_url_email",
    "filter": ["lowercase", "asciifolding"] 
}

cbuescher · August 3, 2016, 8:08pm

Analyzed fields are usually broken into tokens, so by "cannot be sorted" I mean the results are unpredictable. Take the following example of a String field that (analyzed by default using the Standard Analyzer):

# analyzed sorting

PUT /index_a/type/1
{
  "content" : "X B A"
}

PUT /index_a/type/2
{
  "content" : "C Y"
}

GET /index_a/type/_search
{
  "sort": [
    {
      "content": {
        "order": "asc"
      }
    }
  ]
}

===>

"hits": {
    "total": 2,
    "max_score": null,
    "hits": [
      {
        "_index": "index_a",
        "_type": "type",
        "_id": "1",
        "_score": null,
        "_source": {
          "content": "X B A"
        },
        "sort": [
          "a"
        ]
      },
      {
        "_index": "index_a",
        "_type": "type",
        "_id": "2",
        "_score": null,
        "_source": {
          "content": "C Y"
        },
        "sort": [
          "c"
        ]
      }
    ]
  }

Now you could argue that doc 2 should be sorted first because it starts with "C", but it isn't. You don't get an error but unexpected behaviour. Also I'm not sure if this will work consistently between versions. Here's some more thoughts on that: https://www.elastic.co/guide/en/elasticsearch/guide/current/multi-fields.html

frugal-derision · August 3, 2016, 8:38pm

Thanks for the great explanation and the link, I went through that page, it was helpful.

I now understand why it can be unpredictable, but for specific case here, assuming that the Username field is always a single valid email address, uax_url_email would tokenize the complete email e.g. example@example.com into a single token, so the field would always contain only one token, would it then, be still unpredictable/unexpected behavior?

Or am I better off storing the raw email address as another string field (as suggested in the link?)

Topic		Replies	Views
Sort order on a wildcard query with constant score Elasticsearch	1	582	June 27, 2018
Search results with begins with and ending with using wildcard Elasticsearch	1	932	February 27, 2018
Define ordering in wildcard query Elasticsearch	3	282	October 28, 2021
Analyzing wildcard queries Elasticsearch	5	4936	July 6, 2017
Wildcard (*) search in elasticsearch Elasticsearch	3	371	July 22, 2018

Wildcard query returning results in random order

Related topics