Regexp Query cannot escape reserved symbols


(Hayk Hovhanisyan) #1

My environment:

ElasticSearch "2.4.4"
Java "1.8.0_151"

Hi there, please clarify for me part of escaping reserved symbols . ? + * | { } [ ] ( ) "
I used backslash, but result still empty.

I stored displayName= AB*CD, and I am trying search b*c

peace of code which responsible to adding filter:

queryBuilder.filter(boolQuery().must(regexpQuery(columnName, ".*" + regexValue.toLowerCase() + ".*")));

{  
    "from":0,
    "size":10,
    "query":{  
        "bool":{  
            "must":{  
                "match_all":{  

                }
            },
            "filter":[  
                {  
                    "bool":{  
                        "must":{  
                            "regexp":{  
                                "displayName":{  
                                    "value":".*b\\*c.*",
                                    "flags_value":65535
                                }
                            }
                        }
                    }
                },
                {  
                    "terms":{  
                        "deleted":[  
                            "false"
                        ]
                    }
                }
            ]
        }
    },
    "_source":{  
        "includes":[  
            "id",
            "displayName"
        ],
        "excludes":[  

        ]
    },
    "sort":[  
        {  
            "displayName":{  
                "order":"asc"
            }
        }
    ]
}
  1. What's wrong here?
  2. Do we need use whitespace analyzer instead of standart analyzer?
  3. Do we need change tokinizer as well?

(Adrien Grand) #2

The standard analyzer splits on * so a regular expression that expects to find a * would never match any documents. Indeed, a whitespace tokenizer might work.

I'd also like to point out that regexp queries are super slow, especially if there are wildcards in the beginning like here, so I would advise to do things differently if possible. For instance, this particular use-case could be solved by indexing 3-grams.


(Hayk Hovhanisyan) #3

Thanks for your quick answer Adrien.
As I understand if we use default analyzer which is standard analyzer, there is no way for searching with reserved symbols at all . ? + * | { } [ ] ( ) " \.

Solution 1: Use indexing 3-grams - it means add some n-gram(in our case 3 grams) tokenizer and so on.
Solution 2: Use some Pattern tokenizer with our new analyzer?

One more question:

Does this part of documentation not about if we use standard analyzer?

Allowed characters

Any Unicode characters may be used in the pattern, but certain characters are reserved and must be escaped. The standard reserved characters are:

. ? + * | { } [ ] ( ) " \
If you enable optional features (see below) then these characters may also be reserved:

# @ & < > ~
Any reserved character can be escaped with a backslash "*" including a literal backslash character: "\"

Additionally, any characters (except double quotes) are interpreted literally when surrounded by double quotes:

john"@smith.com"

Thanks a lot for your time.


(Adrien Grand) #4

This is correct.

This is a generic advice. Obviously it is irrelevant with analyzers that split on those chars. For instance if you field is mapped as a keyword, it applies.


(Hayk Hovhanisyan) #5

Thanks a lot Adrien, will think about solution.
If will be any update from my side I will put here.

Thanks again for your time and support


(Hayk Hovhanisyan) #6

Hi Adrien one more question about searching.

When we use standard analyzer and have index analyzed, with symbol [space] there is no way for searching as well?
For instance, firstname [space] surname(ex: Hayk Hovhannisyan), is not returning any results.

if the user enter for example "Ha Hov" then it should find "Hayk Hovhannisyan" as well. And need relations AND.

At this moment I used REGEXP query , and have analyzed field for that.
What can you suggest me for that?

thanks in advance
regards Hayk Hovhannisyan


(Hayk Hovhanisyan) #7

What you think about this ?

One idea is instead of regexp match using Fuzzy Query Match and play with fuzziness and operator properties.
regards Hayk Hovhannisyan


(Adrien Grand) #8

The usual way that this would be done would be to use an edge-ngram filter in the index analyzer (but not in the search analyzer) and then use regular match queries for searching.


(Hayk Hovhanisyan) #9

Hi Adrien thanks for quick response.

Did you mean ?
Index Time Search As You Type


(Adrien Grand) #10

Yes, this is a good example.


(system) #12

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.