Filtered wildcard query

Hi,

I need your help/advice to increase performance of wildcard query. The same question was here Performance of filtered wildcard queries , but is closed now.

We have to use wildcard for data that is filtered by customer id. So data for applying wildcard is really small, about 200-300 records and should not be big deal for ES. But time it takes is about 5-10 seconds, while just filtering by customerId is less than second.

What can we do to increase performance?

I need to run query like this one:
SELECT * FROM Transactions WHERE (creditCusomerId = 123 OR debitCustomerId=123) AND search_field LIKE '%FOO%'

Here is ES query:

{
  "query": {
    "constant_score": {
      "filter": {
        "bool": {
          "must": [
            {
              "bool": {
                "should": [
                  {
                    "term": {
                      "payload.creditCustomerId": 123
                    }
                  },
                  {
                    "term": {
                      "payload.debitCustomerId": 123
                    }
                  }
                ]
              }
            },
            {
              "bool": {
                "should": [
                  {
                    "wildcard": {
                      "search_field": "*FOO*"   //type: KEYWORD
                    }
                  }
                ]
              }
            }
          ]
        }
      }
    }
  }
}

Thanks

1st of all: don't use wildcard queries.
Then try something more like:

GET index/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "bool": {
            "should": [
              {
                "term": {
                  "payload.creditCustomerId": 123
                }
              },
              {
                "term": {
                  "payload.debitCustomerId": 123
                }
              }
            ]
          }
        }
      ],
      "must": [
              {
                "wildcard": {
                  "search_field": "*FOO*"
                }
              }
      ]
    }
  }
}

Thanks David,

We know wildcard is expensive feature. But requirements forces us to search for a substring. On other hand we expected that wildcard will be applied to tiny subset, about 300 docs. We tried post_filter, results were just slightly better

I used Profile API and noticed that most time is spent on build_scorer. In our case we use constant_score. Why does build_scorer take so much time, when we do not need it?

{
	"type": "MultiTermQueryConstantScoreWrapper",
	"description": "search_field:*FOO*",
	"time_in_nanos": 10514827001,
	"breakdown": {
		"score": 0,
		"build_scorer_count": 15,
		"match_count": 0,
		"create_weight": 1535,
		"next_doc": 0,
		"match": 0,
		"create_weight_count": 1,
		"next_doc_count": 0,
		"score_count": 0,
		"build_scorer": 10514821338,
		"advance": 4093,
		"advance_count": 19
	}
}

That's why I'd encourage you looking at ngrams instead. You'll pay the price at index time (disk space wise and index time) but don't pay it at search time.

Did you try my query proposal?

Could you also try this one?

GET index/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "bool": {
            "should": [
              {
                "term": {
                  "payload.creditCustomerId": 123
                }
              },
              {
                "term": {
                  "payload.debitCustomerId": 123
                }
              }
            ]
          }
        },
        {
          "wildcard": {
            "search_field": "*FOO*"
          }
        }
      ]
    }
  }
}

BTW which version are you using?

BTW I forgot, you can use the Rescore API to first filter by ID which should be fast then apply the wildcard in the rescore part.

See https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-rescore.html

ES version is 6.3.1
Yes, I tried your query - same results.
We have not experience yet with ngrams, not quite understand how it searches longer substrings than ngram length. If you know some good article, please post.
Thanks for Rescore API, will try it also

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.