Documents with Regex fields

krish3 · August 20, 2019, 2:21am

I have documents with fields themselves being a regular expression.
For example
doc1: regexp:1111.*01011
doc2: regexp: 111.*01011

So if I give a query with regexp:1111010111101011 should return doc1 and doc2, while a query with regexp:111011101011 should return only doc2. Is this type of query possible with Elastic? If not any alternate way of using Elastic in achieving this?

Thanks

abdon · August 20, 2019, 11:21am

Yes, this is possible! You can use the percolator for that (one of my favorite Elasticsearch features!). The percolator allows you to index queries, and then later ask Elasticsearch if a given document matches those indexed queries.

To use the percolator, first you need to define a field of type percolator in the index' mapping. Here I'm defining a field my_query of that type, as well as a field my_expression that you can match the regular expressions against:

PUT my_index
{
  "mappings": {
    "properties": {
      "my_expression": {
        "type": "keyword"
      },
      "my_query": {
        "type": "percolator"
      }
    }
  }
}

Now, you can index your regular expressions. Here I do so in the form of regexp queries that I index into the my_query field:

PUT /my_index/_doc/1
{
  "my_query": {
    "regexp": {
      "my_expression": "1111.*01011"
    }
  }
}

PUT /my_index/_doc/2
{
  "my_query": {
    "regexp": {
      "my_expression": "111.*01011"
    }
  }
}

Finally, you can now test a given pattern using the percolate query:

GET /my_index/_search
{
  "query": {
    "percolate": {
      "field": "my_query",
      "document": {
        "my_expression": "111011101011"
      }
    }
  }
}

krish3 · August 20, 2019, 3:51pm

Andon, Thanks for the reply. I think this should solve my use case. So what is the performance if we have say a million documents each having a percolator field?

Krish

abdon · August 21, 2019, 6:50am

Good question. The Percolator doesn't quite scale the same as the other queries in Elasticsearch. The response time will basically be linear with the amount of stored percolator queries (although there are some optimizations, as detailed in the documentation).

The Percolator is one of the few examples of when it may be better to have more shards. That's because each of these shards will hold a subset of the stored percolator queries. If you have multiple shards then those will be able to percolate a document in parallel.

You probably need to do some testing with different numbers of documents and shards to see what an optimum for your cluster would be.

system · September 18, 2019, 6:50am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Is percolator supposed to work with regex queries? Elasticsearch	2	703	July 6, 2017
Percolation Regular Expressions and Hit Locations Elasticsearch	1	419	December 4, 2017
Elastic search percolator scalability issues Elasticsearch	2	331	February 15, 2019
Reverse pattern search analyser Elasticsearch	5	518	July 5, 2017
Analyzing URLs for regexp queries Elasticsearch	4	5524	July 6, 2017

Documents with Regex fields

Related topics