Regexps with lots of punctuation

Hi,

We've got a text field in Elasticsearch into which is being written a JSON document, which has been escaped. We're trying to find instances with a particular pattern of fields in, but cannot formulate the regexp required.

The ES field contains this precise string:-

,\"jobs\":{\"data\":[{\"type\":\"job\",\"id\":\"19029355471\"},

The ID number can change; we want to find documents with similar JSON structures.

Using the Dev Tools console in Kibana, no matter how we create the regexp, we either get a Bad string Syntax error in the browser or an error from the API such as json_parse_exception / Unrecognized character escape ':' (code 58)\n.

Example of a failing regexp:-

,\\\"jobs\\\"\:\{\\\"data\\\"\:\[\{\\\"type\\\"\:\\\"job\\\",\\\"id\\\"\:\\\"19029355471\\\"\},

We've even tried replacing each punctuation character after the comma with a .. This does not return an error, but nor does it return any results.

This is what the full query looks like in this case:-

GET myindex*/_search { "query": { "regexp": { "fields.payload": ".*,..jobs......data.......type.....job.....id.....19029355471...,.*" } } }

I've done some searching, and some people seem to think that the json_parse_exception problem is with ES's JSON parser rather than Lucene. Other people seem to say that Lucene's regexp language is a little bit specialised.

How can we formulate this search?

Thanks,

J.

Hi,

And why don't you store your JSON as a document or a nested document ? This way, you will be able to search on any field, and it will be super fast.

Bye,
Xavier

Because the data is already in there and we were hoping it would be quicker to search it than to write something to extract and re-import.

Do you mean that it's only a one time search operation?

Who knows? :slight_smile:

It's data logged by an application, which we're trying to decipher in order to investigate a problem. We've found another place where this data has been logged, this time without all of the \s but we're still struggling to get a working regexp.

Ok, and if you try a simple regexp:

".*jobs.*data.*type.*job.*id.*19029355471.*"

Does it work ? Else what is the error or result ?

What is the mapping for this field?

No results...

It's "text".

"payload": {
    "type": "text",
    "fields": {
        "keyword": {
            "type": "keyword",
            "ignore_above": 256
         }
     }
 },

Looking at that - I wonder what ignore_above means. The data is definitely over 256 characters and is definitely being stored.

payload is using a text type. It's then analyzed with the default analyzer, ie. broken into tokens.
payload.keyword is indexed as a whole string but only for "small" texts.

Doing a regex (non analyzed) on payload which is analyzed is not going to produce what you are expecting.

I'd highly recommend using the right mapping for your use case (it depends on what you want to do) and reindex.

Have a look at the _analyze API to see how your field has been indexed behind the scene. You'd be able to search within the tokens that have been generated but not in the whole initial string.

@xavierfacq Obviously I read that. It's not entirely clear, however, as the data is being stored, despite what the docs say.

@dadoonet Thanks. I wasn't aware of the analyze API. The data field we've got is huge, however, and it's hard to POST into the API. We'll have to reindex, I think.

All-in-all, I guess @xavierfacq gave the right answer to begin with - we will have to get this data in as a proper JSON document so that it is fully indexed. The next issue we will hit, however, is ES's limited support for JSON arrays.

Store does not mean indexed. The _source document stores the whole JSON content whatever is happening on each field then.

We'll have to reindex, I think.

Yes. That's indeed what @xavierfacq proposed at the beginning.

Not sure this article is very accurate. That's what nested type has been made for.

Thanks again.

Re: documentation, that page says (with my emphasis):-

Strings longer than the ignore_above setting will not be indexed or stored

Re: arrays - we've had a problem storing these in ES elsewhere and have had to flatten them into hashes. I guess the nested data type is not being picked automatically by ES.

Anyway, all help much appreciated. As recommended, we'll go back and fix this at source.

Re: documentation, that page says (with my emphasis):-

Strings longer than the ignore_above setting will not be indexed or stored

This is true. When you use the store: true option on a given field. Defaults to false.

I guess the nested data type is not being picked automatically by ES.

True. You need to be explicit.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.