Help: Elasticsearch Regexp query

Hello All,
I am new to elasticsearch and recently started using ES 6.x which is integrated with one of the third party application which stores all the data using indexes in ES. I have a scenario where I would like to read documents from ES using query, and for this, I have started exploring ES search query to read the documents containing specific key=value pairs, I have started with exact match search query i.e.

{"query":{"bool":{"must":[{"match_phrase":{"input":{"query":"key=values"}}}],"filter":[{"range":{"startTime":{"gte":"now-10d","lt":"now"}}}]}}}

This above query returns me correct documents which matches with "key=value" and field 'input' is of type 'text' which is controlled by the third party app and sample format of how "input" field looks like in ES output is:

"input":"{key1=76f435fe-ac81-49aa-8050-8c647922e51d, key2={key3=1234, key4=AB}}"

Now, I would like to write generic query which matches a specific regex pattern for values and give me all the documents and i have tried this below query but doesn't work i.e. does not return any results

{"query":{"regexp":{"input":"key4=[A-Z]{2}"}}}

could anyone please advise where i am going wrong with regexp query?

Thanks, Ketan

Hi Ketan.
Two possibilities -

  1. Case sensitivity - try search for lowercase eg key4=[a-z]{2}
  2. There is no single "word" in the index of that form.

Fields of the type text are typically chopped into word tokens or "terms" and normalized. The exact behaviour depends on your choice of "Analyzer" but typically lower-casing and removing all punctuation are things an Analyzer will do. In your case a document with key4=AB may well have been tokenized into terms key4 and ab, neither of which match the regexp key4=[A-Z]{2}.
If the field was mapped as a keyword type then the matching process would be much less mysterious - the doc content would be a single term in the index.

Thanks @Mark_Harwood for your inputs.

With option 1, I have already tried this option but didn't work.
With option 2, yes there is a possibility, but since the third party application controls mapping with Elasticsearch, I cannot modify the field mapping easily, and the field "key4" is of type text as shown in the ES mapping results. Let me know your thoughts.

Thanks, Ketan.

Let’s assume option 2 - my guess is your search string is two terms in the index - key4 and ab.
If so, it’s not enough to search for documents with a key4 AND regex [a-z]{2}. This might match a document where key3=ab,key4=99. The association with keys and values is jumbled up. To get correct matching you need to search for the term key4 NEAR the regex.
This can be done using span queries. You would need to wrap a span term and regexp query inside a span_near query

Hi @Mark_Harwood, thanks for your inputs and appreciate your help.

I have tried span near query with span multi query and regexp query,

For span multi query, here is the below query i have used but didn't get the results, giving me "total": 0 documents,

Elasticsearch source indexed data:

[
  {
    "_index": "test",
    "_id": "44a7b79d-9de6-4592-b0b8-7adae762b51e",
    "_score": 1,
    "_source": {
      "input": "{key1=8d1ad005-19ce-4a4e-bdff-ff18140fcbcb, key2={key3=1.2.528.1.1001.100.2.10477.1673.111189553.20200727204531629, key4=DX}}",
      "output": "{result=success}"
    }
  },
  {
    "_index": "test",
    "_id": "55a7b79d-9de6-4592-b0b8-7adae762b51e",
    "_score": 1,
    "_source": {
      "input": "{key1=5a1ad005-19ce-4a4e-bdff-ff18140fcbcb, key2={key3=1.2.528.1.1001.100.2.10477.1673.111189553.20200727204531629, key4=CR}}",
      "output": "{result=success}"
    }
  }
]

Here, field "input" mapping is of type 'text' and 'key4' can have any value between [A-Z] with length of 2 chars.

Here is the span near query I have formed and used but returning '0' documents,

{"query":{"span_near":{"clauses":[{"span_multi":{"match":{"regexp":{"input":"key4=[A-Z]{2}"}}}}],"slop":0,"in_order":true}}}   

Jus to check the documents, I have tried regexp directly on the 'key4', and this returns correct no. of documents, but this is not what I want. And based on your inputs, you are right in the sense that elasticsearch internally creating the tokens using analyzers and which breaks the "input" string field in multiples, such as "key4", "=", "CR' etc. but then how someone could do a specific query search if this is the behavior from elasticsearch, where especially I don't have any control on the index creation and mapping types because this is driven by another third party application.

In the span_near query above, not sure where I am missing, could you please advise? Also, I have tried span_term query inside span_near, but this too didn't work.

Thanks, Ketan.

Please re-read my last comment.
If you use the analyze api you can see how your example documents will be tokenised into multiple terms held in the index. Your span near clause will likely need 2 clauses - one for the term “key4” (we hope the analyzer kept the number part of words) and one for the [a-z]{2} regexp wrapped in a span multi.

Hi @Mark_Harwood,
Thanks for all your inputs, after I add the 2nd clause under the span near query, it worked!

Query:
{"query":{"span_near":{"clauses":[{"span_term":{"input":"key4"}},{"span_multi":{"match":{"regexp":{"input":"[a-z]{2}"}}}}],"slop":0,"in_order":true}}}

Appreciate your help!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.