Documents with Regex fields

I have documents with fields themselves being a regular expression.
For example
doc1: regexp:1111.*01011
doc2: regexp: 111.*01011

So if I give a query with regexp:1111010111101011 should return doc1 and doc2, while a query with regexp:111011101011 should return only doc2. Is this type of query possible with Elastic? If not any alternate way of using Elastic in achieving this?

Thanks

Yes, this is possible! You can use the percolator for that (one of my favorite Elasticsearch features!). The percolator allows you to index queries, and then later ask Elasticsearch if a given document matches those indexed queries.

To use the percolator, first you need to define a field of type percolator in the index' mapping. Here I'm defining a field my_query of that type, as well as a field my_expression that you can match the regular expressions against:

PUT my_index
{
  "mappings": {
    "properties": {
      "my_expression": {
        "type": "keyword"
      },
      "my_query": {
        "type": "percolator"
      }
    }
  }
}

Now, you can index your regular expressions. Here I do so in the form of regexp queries that I index into the my_query field:

PUT /my_index/_doc/1
{
  "my_query": {
    "regexp": {
      "my_expression": "1111.*01011"
    }
  }
}

PUT /my_index/_doc/2
{
  "my_query": {
    "regexp": {
      "my_expression": "111.*01011"
    }
  }
}

Finally, you can now test a given pattern using the percolate query:

GET /my_index/_search
{
  "query": {
    "percolate": {
      "field": "my_query",
      "document": {
        "my_expression": "111011101011"
      }
    }
  }
}

Andon, Thanks for the reply. I think this should solve my use case. So what is the performance if we have say a million documents each having a percolator field?

  • Krish

Good question. The Percolator doesn't quite scale the same as the other queries in Elasticsearch. The response time will basically be linear with the amount of stored percolator queries (although there are some optimizations, as detailed in the documentation).

The Percolator is one of the few examples of when it may be better to have more shards. That's because each of these shards will hold a subset of the stored percolator queries. If you have multiple shards then those will be able to percolate a document in parallel.

You probably need to do some testing with different numbers of documents and shards to see what an optimum for your cluster would be.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.