Regexp Query cannot escape reserved symbols

Hayk_Hovhanisyan · November 21, 2017, 7:46am

My environment:

ElasticSearch "2.4.4"
Java "1.8.0_151"

Hi there, please clarify for me part of escaping reserved symbols . ? + * | { } [ ] ( ) "
I used backslash, but result still empty.

I stored displayName= AB*CD, and I am trying search b*c

peace of code which responsible to adding filter:

queryBuilder.filter(boolQuery().must(regexpQuery(columnName, ".*" + regexValue.toLowerCase() + ".*")));

{  
    "from":0,
    "size":10,
    "query":{  
        "bool":{  
            "must":{  
                "match_all":{  

                }
            },
            "filter":[  
                {  
                    "bool":{  
                        "must":{  
                            "regexp":{  
                                "displayName":{  
                                    "value":".*b\\*c.*",
                                    "flags_value":65535
                                }
                            }
                        }
                    }
                },
                {  
                    "terms":{  
                        "deleted":[  
                            "false"
                        ]
                    }
                }
            ]
        }
    },
    "_source":{  
        "includes":[  
            "id",
            "displayName"
        ],
        "excludes":[  

        ]
    },
    "sort":[  
        {  
            "displayName":{  
                "order":"asc"
            }
        }
    ]
}

What's wrong here?
Do we need use whitespace analyzer instead of standart analyzer?
Do we need change tokinizer as well?

jpountz · November 21, 2017, 1:37pm

The standard analyzer splits on * so a regular expression that expects to find a * would never match any documents. Indeed, a whitespace tokenizer might work.

I'd also like to point out that regexp queries are super slow, especially if there are wildcards in the beginning like here, so I would advise to do things differently if possible. For instance, this particular use-case could be solved by indexing 3-grams.

Hayk_Hovhanisyan · November 22, 2017, 2:28pm

Thanks for your quick answer Adrien.
As I understand if we use default analyzer which is standard analyzer, there is no way for searching with reserved symbols at all . ? + * | { } [ ] ( ) " \.

Solution 1: Use indexing 3-grams - it means add some n-gram(in our case 3 grams) tokenizer and so on.
Solution 2: Use some Pattern tokenizer with our new analyzer?

One more question:

Does this part of documentation not about if we use standard analyzer?

Allowed characters

Any Unicode characters may be used in the pattern, but certain characters are reserved and must be escaped. The standard reserved characters are:

. ? + * | { } [ ] ( ) " \
If you enable optional features (see below) then these characters may also be reserved:

# @ & < > ~
Any reserved character can be escaped with a backslash "*" including a literal backslash character: "\"

Additionally, any characters (except double quotes) are interpreted literally when surrounded by double quotes:

john"@smith.com"

Thanks a lot for your time.

jpountz · November 22, 2017, 3:33pm

This is correct.

This is a generic advice. Obviously it is irrelevant with analyzers that split on those chars. For instance if you field is mapped as a keyword, it applies.

Hayk_Hovhanisyan · November 22, 2017, 4:17pm

Thanks a lot Adrien, will think about solution.
If will be any update from my side I will put here.

Thanks again for your time and support

Hayk_Hovhanisyan · December 20, 2017, 7:00am

Hi Adrien one more question about searching.

When we use standard analyzer and have index analyzed, with symbol [space] there is no way for searching as well?
For instance, firstname [space] surname(ex: Hayk Hovhannisyan), is not returning any results.

if the user enter for example "Ha Hov" then it should find "Hayk Hovhannisyan" as well. And need relations AND.

At this moment I used REGEXP query , and have analyzed field for that.
What can you suggest me for that?

thanks in advance
regards Hayk Hovhannisyan

Hayk_Hovhanisyan · December 21, 2017, 12:41pm

What you think about this ?

One idea is instead of regexp match using Fuzzy Query Match and play with fuzziness and operator properties.
regards Hayk Hovhannisyan

jpountz · December 21, 2017, 2:05pm

The usual way that this would be done would be to use an edge-ngram filter in the index analyzer (but not in the search analyzer) and then use regular match queries for searching.

Hayk_Hovhanisyan · December 22, 2017, 8:56am

Hi Adrien thanks for quick response.

Did you mean ?
Index Time Search As You Type

jpountz · December 22, 2017, 9:50am

Yes, this is a good example.

system · January 19, 2018, 1:27pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Surprising behaviour when escaping reserved char in query string [1.3.4] Elasticsearch	10	2890	July 5, 2017
Query with regular expression special characters Elasticsearch	3	3279	October 17, 2019
Escaping reserved characters in a query Elasticsearch	1	1158	July 6, 2017
Reserved characters on multimatch query Elasticsearch	2	971	July 6, 2017
Can't find unit tests for reserved characters Elasticsearch	6	1678	July 6, 2017

Regexp Query cannot escape reserved symbols

Related topics