We are currently using regex filters to support contains query for an application, as expected the performance is pretty abysmal since the regex doesn't have any leading prefix. Below is an example query that we are using
[root@machine ~]# time curl -XGET 'localhost:9200/items_search/_count?routing=123&pretty' -d '{
"query" : {
"filtered" : {
"filter" : {
"bool" : {
"must" : [
{ "term" : { "cat_id" : "123"}},
{ "term" : { "availability" : "in stock"}},
{ "regexp": { "category.lowercase" : ".*women .*"}}
]
}
}
}
}
}'
{
"count" : 11323,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
}
}
real 0m41.518s
user 0m0.005s
sys 0m0.001s
If I change the regex to "women .*" the query returns within a couple of seconds. One thing to note is that the shard that this query is getting routed to is around 80 GB.
I understand that the right way to do this would be use ngram analyzers on the fields that we want have contains on which would speed up search queries, however we currently have over 3.5 billion documents in our index and the number of queries that use regex are very small so changing the analyzers for the field (currently the field is not analyzed) would hurt our indexing rate.
Are there any work around for this ? Any pointers or resources would be much appreciated.
Thanks