Regexps with lots of punctuation

goochjs · March 18, 2019, 4:13pm

Hi,

We've got a text field in Elasticsearch into which is being written a JSON document, which has been escaped. We're trying to find instances with a particular pattern of fields in, but cannot formulate the regexp required.

The ES field contains this precise string:-

,\"jobs\":{\"data\":[{\"type\":\"job\",\"id\":\"19029355471\"},

The ID number can change; we want to find documents with similar JSON structures.

Using the Dev Tools console in Kibana, no matter how we create the regexp, we either get a Bad string Syntax error in the browser or an error from the API such as json_parse_exception / Unrecognized character escape ':' (code 58)\n.

Example of a failing regexp:-

,\\\"jobs\\\"\:\{\\\"data\\\"\:\[\{\\\"type\\\"\:\\\"job\\\",\\\"id\\\"\:\\\"19029355471\\\"\},

We've even tried replacing each punctuation character after the comma with a .. This does not return an error, but nor does it return any results.

This is what the full query looks like in this case:-

GET myindex*/_search { "query": { "regexp": { "fields.payload": ".*,..jobs......data.......type.....job.....id.....19029355471...,.*" } } }

I've done some searching, and some people seem to think that the json_parse_exception problem is with ES's JSON parser rather than Lucene. Other people seem to say that Lucene's regexp language is a little bit specialised.

How can we formulate this search?

Thanks,

J.

xavierfacq · March 19, 2019, 7:33am

Hi,

And why don't you store your JSON as a document or a nested document ? This way, you will be able to search on any field, and it will be super fast.

Bye,
Xavier

goochjs · March 19, 2019, 8:35am

Because the data is already in there and we were hoping it would be quicker to search it than to write something to extract and re-import.

dadoonet · March 19, 2019, 8:52am

Do you mean that it's only a one time search operation?

goochjs · March 19, 2019, 10:34am

Who knows?

It's data logged by an application, which we're trying to decipher in order to investigate a problem. We've found another place where this data has been logged, this time without all of the \s but we're still struggling to get a working regexp.

xavierfacq · March 19, 2019, 10:46am

Ok, and if you try a simple regexp:

".*jobs.*data.*type.*job.*id.*19029355471.*"

Does it work ? Else what is the error or result ?

dadoonet · March 19, 2019, 10:49am

What is the mapping for this field?

goochjs · March 19, 2019, 12:35pm

No results...

goochjs · March 19, 2019, 12:38pm

It's "text".

"payload": {
    "type": "text",
    "fields": {
        "keyword": {
            "type": "keyword",
            "ignore_above": 256
         }
     }
 },

Looking at that - I wonder what ignore_above means. The data is definitely over 256 characters and is definitely being stored.

xavierfacq · March 19, 2019, 1:20pm

dadoonet · March 19, 2019, 1:52pm

payload is using a text type. It's then analyzed with the default analyzer, ie. broken into tokens.
payload.keyword is indexed as a whole string but only for "small" texts.

Doing a regex (non analyzed) on payload which is analyzed is not going to produce what you are expecting.

I'd highly recommend using the right mapping for your use case (it depends on what you want to do) and reindex.

Have a look at the _analyze API to see how your field has been indexed behind the scene. You'd be able to search within the tokens that have been generated but not in the whole initial string.

goochjs · March 19, 2019, 2:35pm

@xavierfacq Obviously I read that. It's not entirely clear, however, as the data is being stored, despite what the docs say.

@dadoonet Thanks. I wasn't aware of the analyze API. The data field we've got is huge, however, and it's hard to POST into the API. We'll have to reindex, I think.

All-in-all, I guess @xavierfacq gave the right answer to begin with - we will have to get this data in as a proper JSON document so that it is fully indexed. The next issue we will hit, however, is ES's limited support for JSON arrays.

dadoonet · March 19, 2019, 3:14pm

Store does not mean indexed. The _source document stores the whole JSON content whatever is happening on each field then.

We'll have to reindex, I think.

Yes. That's indeed what @xavierfacq proposed at the beginning.

Not sure this article is very accurate. That's what nested type has been made for.

goochjs · March 19, 2019, 4:26pm

Thanks again.

Re: documentation, that page says (with my emphasis):-

Strings longer than the ignore_above setting will not be indexed or stored

Re: arrays - we've had a problem storing these in ES elsewhere and have had to flatten them into hashes. I guess the nested data type is not being picked automatically by ES.

Anyway, all help much appreciated. As recommended, we'll go back and fix this at source.

dadoonet · March 19, 2019, 4:41pm

Re: documentation, that page says (with my emphasis):-

Strings longer than the ignore_above setting will not be indexed or stored

This is true. When you use the store: true option on a given field. Defaults to false.

I guess the nested data type is not being picked automatically by ES.

True. You need to be explicit.

system · April 16, 2019, 4:50pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Regexp parsing_exception for \[mytag\]. Why do regexp escapes for [, ], (, ), break the parser? Elasticsearch	3	877	April 20, 2017
Escaping characters in a JSON query? Elasticsearch	2	2669	July 6, 2017
Search for the "?" character in a field Elasticsearch	3	910	January 14, 2020
Regular Expression doesn't work Elasticsearch	2	377	April 25, 2019
Cannot search double quotes Elasticsearch	3	36868	April 9, 2018

Regexps with lots of punctuation

Related topics