Wildcard query returning results in random order

Hi,

I have an index and a field in the index has a custom analyzer. The custom analyzer definition is:

{
    "settings": {
        "analysis": {
            "analyzer": {
                "anl_email": {
                    "type": "custom",
                    "tokenizer": "uax_url_email",
                    "filter": ["lowercase", "asciifolding"] 
                }
            }
        }
    }
}

and I'm doing my wildcard query as:

{
    "query": {
        "wildcard": {
            "Username": "*wa*"
        }
    }
}

(the Username field has the custom analyzer anl_email defined above)

The problem now is that (assuming the default size of 10), this query returns the results in a random order. So if I query once, the results will be different and executing that same query again will produce a different set of results.

In all those results, each document has a score of 1. I also tried sorting on the _score and _doc properties e.g. "sort": ["_score", "_doc"] (as suggested in #elasticsearch), but same results.

I need a consistent and predictable result set so that I can then paginate over them. Any ideas what am I doing wrong here?

Hi,

multi-term queries like "wildcard" are rewritten before beeing executed. The default setting for this is to rewrite to a constant_score query. Thats why your documents are all scored 1. If you want scoring, try one of the other options for rewrite mentioned here: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-multi-term-rewrite.html

I need to allow partial search over email addresses. Is there any way I can get a predictable sort order?

And also is wildcard appropriate for what I'm trying to achieve? My goal is to a) allow partial search b) paginate those results.

a) partial search is working pretty well with wildcard query but not b).

Can you simply sort by the email address field? I saw you tried "_score" and "_doc", but why not returning the results in alphabetical order? If thats not possible because the field is analyzed, why not store an un-analyzed version alongside it (using multi-fields) to sort on?

I tried sorting on the Username field and it seems like that's exactly what I want, so thanks.

But a question, you mentioned that: "If thats not possible [i.e. sorting] because the field is analyzed", now even though my field is custom analyzed, I can still sort on that field. Is this a bug/not normal?

My custom analyzer's definition:

"anl_email": {
    "type": "custom",
    "tokenizer": "uax_url_email",
    "filter": ["lowercase", "asciifolding"] 
}

Analyzed fields are usually broken into tokens, so by "cannot be sorted" I mean the results are unpredictable. Take the following example of a String field that (analyzed by default using the Standard Analyzer):

# analyzed sorting

PUT /index_a/type/1
{
  "content" : "X B A"
}

PUT /index_a/type/2
{
  "content" : "C Y"
}

GET /index_a/type/_search
{
  "sort": [
    {
      "content": {
        "order": "asc"
      }
    }
  ]
}

===>

"hits": {
    "total": 2,
    "max_score": null,
    "hits": [
      {
        "_index": "index_a",
        "_type": "type",
        "_id": "1",
        "_score": null,
        "_source": {
          "content": "X B A"
        },
        "sort": [
          "a"
        ]
      },
      {
        "_index": "index_a",
        "_type": "type",
        "_id": "2",
        "_score": null,
        "_source": {
          "content": "C Y"
        },
        "sort": [
          "c"
        ]
      }
    ]
  }

Now you could argue that doc 2 should be sorted first because it starts with "C", but it isn't. You don't get an error but unexpected behaviour. Also I'm not sure if this will work consistently between versions. Here's some more thoughts on that: https://www.elastic.co/guide/en/elasticsearch/guide/current/multi-fields.html

Thanks for the great explanation and the link, I went through that page, it was helpful.

I now understand why it can be unpredictable, but for specific case here, assuming that the Username field is always a single valid email address, uax_url_email would tokenize the complete email e.g. example@example.com into a single token, so the field would always contain only one token, would it then, be still unpredictable/unexpected behavior?

Or am I better off storing the raw email address as another string field (as suggested in the link?)