Searching ES regex with space/colon/hypen etc

Hi,

I am trying to find the way to search in our ES cluster for a substring contained within a doc field's string (where this substring may contain space, for example, as well as may contains a colon, hyphen, etc.).
I think it is best to demonstrate with an example so I put below a scenario that demonstrates what I'm trying to accomplish - specifically with usage of space - as I assume once I have the solution for that I will be able to apply it for strings with colon/hyphen/etc..

So for example let's say I have 2 documents:

  • {"_id": 1, "_source": {"email.subject": "one two three four"}}
  • {"_id": 2, "_source": {"email.subject": "two one three four"}}

And I would to search for a substring "wo thre" , such that it matches only the first mentioned document (i.e. by regex ".*wo thre.*"). The second should not match, of course.

Please help me understand:

  1. Am I doing something wrong here?
  2. How could I accomplish such substring search?
  3. How is it best to implement the doc structure (using some analyzers or something?) such that it allows this substring search?

Thanks!

here's the sample scenario I mentioned earlier which demonstrates what I'm failing to achieve

### Insert first doc
$ curl -X PUT "https://es-host/customer/_doc/1?pretty" -H 'Content-Type: application/json' -d'
> {
>   "email.subject": "one two three four"
> }
> '
{
  "_index" : "customer",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

### Insert second doc
$ curl -X PUT "https://es-host/customer/_doc/2?pretty" -H 'Content-Type: application/json' -d'
> {
>   "email.subject": "two one three four"
> }
> '
{
  "_index" : "customer",
  "_type" : "_doc",
  "_id" : "2",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

### Search with regex with space: ".*wo thre.*" - but no hits
$ curl -X POST "https://es-host/customer/_search?pretty" -H 'Content-Type: application/json' -d'
> {
>   "query": {
>     "regexp": {
>       "email.subject": {
>         "value": ".*wo thre.*"
>       }
>     } 
>   }
> }'
{
  "took" : 17,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

### Search with regex without space: "thre.*" - 2 hits
$ curl -X POST "https://es-host/customer/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "regexp": {
      "email.subject": {
        "value": "thre.*"
      }
    } 
  }
}'
{
  "took" : 15,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "customer",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "email.subject" : "two one three four"
        }
      },
      {
        "_index" : "customer",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "email.subject" : "one two three four"
        }
      }
    ]
  }
}

It looks like you are using default mappings, so try searching the ‘email.subject.keyword’ field which mapped as keyword and not analysed. Be aware that searching using regexp can be slow and may not scale well.

Yes that worked. Thanks @Christian_Dahlqvist!

Can you think of another way (that might be faster and more scaleable) I could allow such searching for substring within the email.subject field without regex? I am open for suggestions :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.