Queries on keyword field are extremely slow

Mujo · December 2, 2021, 11:22pm

I use Elasticsearch for text search.
Sentences used for indexing are first split into words and then each word is attached with its corresponding POS tag as below:
Raw sentence: word1 word2 word3 word4.
Sentence for indexing: (word1_tagA) (word2_tagB) (word3_tagA) (word4_tagC)

I set the field as keyword for sentences because I need to perform searching using regex
such as: such for sentences containing word of tagA, following word2 of tagB

res = es.search(index = INDEX_NAME, body = {"query": {"regexp":{"parsed_text": "@(word2_tagB)([a-zA-Z]+_tagA)@"}}})

which should return sentences such as,
(word1_tagA) (word2_tagB) (word3_tagA) (word4_tagC)

My problem is that the speed is extremely slow.
The deadline for my final project is approaching.

Sentences for indexing are about 5 million.
I'm running Elasticsearch on mac air.

Christian_Dahlqvist · December 3, 2021, 5:20am

Using regex queries on standard keyword fields like this likely means that you will not use any indices and therefore need to process all strings, which will not scale well and likely perform badly. I would recommend you try changing to the wildcard field type and see if that helps as this was designed to help with this scenario.

Another alternative could be to index all word_tag combinations as individual keywords in a separate array field and add a filter to filter out all documents that do not contain the one you need to match fully. This would use an index and hopefully reduce the documents that the regex need to run against.

Mujo · December 4, 2021, 2:59am

Thanks for your help!
I'm using Elasticsearch 7.10 which should support wildcard feature.
But when I change to use wildcard, the code complained.
Elasticsearch.exceptions.RequestError: RequestError(400, 'mapper_parsing_exception', 'No handler for type [wildcard] declared on field [parsed_text]')

I just found that in order to use this feature, I need to buy [X-Pack subscription].
Is it the reason causing the error above?
Is there anyway to use this feature for free?

dadoonet · December 4, 2021, 5:50am

It's probably because you are using the default distribution.

You should switch to the default distribution and upgrade to 7.15.2, while you are at it.

Mark_Harwood · December 4, 2021, 8:52am

As Christian mentions, the wildcard field would help with accelerating regex matches on character sequences found in big strings.

However there is another approach to consider which lets you stick with word-based indices and word-sequence queries (phrase/span/interval). The annotated_text field would allow you to store your POS tokens as a form of inline synonym positioned at the same indexed word-position as the words they describe.

system · January 1, 2022, 8:53am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Wildcard query on keyword vs N-gram analyzer + multi-match Elasticsearch	1	255	March 23, 2024
Elasticsearch array field of keywords - how to index it? Elasticsearch	1	690	July 6, 2017
Elasticsearch Wildcard fieldtype has slow performance for wildcard queries Elasticsearch	5	2872	January 26, 2021
Searching on text fields using wildcard Elasticsearch	5	4733	October 13, 2020
Data type for Log Message Fields, does keyword add overhead? Elasticsearch	7	528	June 5, 2022

Queries on keyword field are extremely slow

Related topics