I use Elasticsearch for text search.
Sentences used for indexing are first split into words and then each word is attached with its corresponding POS tag as below:
Raw sentence: word1 word2 word3 word4.
Sentence for indexing: (word1_tagA) (word2_tagB) (word3_tagA) (word4_tagC)
I set the field as keyword for sentences because I need to perform searching using regex
such as: such for sentences containing word of tagA, following word2 of tagB
res = es.search(index = INDEX_NAME, body = {"query": {"regexp":{"parsed_text": "@(word2_tagB)([a-zA-Z]+_tagA)@"}}})
which should return sentences such as,
(word1_tagA) (word2_tagB) (word3_tagA) (word4_tagC)
My problem is that the speed is extremely slow.
The deadline for my final project is approaching.
Sentences for indexing are about 5 million.
I'm running Elasticsearch on mac air.
Using regex queries on standard keyword fields like this likely means that you will not use any indices and therefore need to process all strings, which will not scale well and likely perform badly. I would recommend you try changing to the wildcard field type and see if that helps as this was designed to help with this scenario.
Another alternative could be to index all word_tag combinations as individual keywords in a separate array field and add a filter to filter out all documents that do not contain the one you need to match fully. This would use an index and hopefully reduce the documents that the regex need to run against.
Thanks for your help!
I'm using Elasticsearch 7.10 which should support wildcard feature.
But when I change to use wildcard, the code complained.
Elasticsearch.exceptions.RequestError: RequestError(400, 'mapper_parsing_exception', 'No handler for type [wildcard] declared on field [parsed_text]')
I just found that in order to use this feature, I need to buy [X-Pack subscription].
Is it the reason causing the error above?
Is there anyway to use this feature for free?
As Christian mentions, the wildcard field would help with accelerating regex matches on character sequences found in big strings.
However there is another approach to consider which lets you stick with word-based indices and word-sequence queries (phrase/span/interval). The annotated_text field would allow you to store your POS tokens as a form of inline synonym positioned at the same indexed word-position as the words they describe.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.