Best practice to search for a regular expression next to a list of terms

I am trying to write a query with the following criteria:

  1. The query has a regular expression and the document must contain a hit
  2. The query has a list of words and at least one of the words must appear within N words of the regular expression. I want to use slop between the regular expression and the list of terms.

Here is what I have come up with so far. It finds hits on the regexp and the list of terms but not within N words of each other.

{
  "from": 0,
  "size": 100,
  "explain": true,
  "_source": {
    "includes": [
      "*"
    ],
    "excludes": [
      "FileText"
    ]
  },
  "query": {
    "bool": {
      "must": {
        "regexp": {
          "FileText": {
            "value": "[0-9]{3}"
          }
        }
      },
      "should": {
        "match": {
          "FileText": {
            "query": "list words find please",
            "minimum_should_match": "1%"
          }
        }
      }
    }
  }
}

I also tried

{
  "from": 0,
  "size": 100,
  "explain": false,
  "_source": {
    "includes": [
      "*"
    ],
    "excludes": [
      "FileText"
    ]
  },
  "query": {
    "bool": {
      "must": [
        {
          "regexp": {
            "FileText": {
              "value": "[0-9]{3}"
            }
          }
        },
        {
          "match": {
            "FileText": {
              "query": "list words find please"
            }
          }
        }
      ]
    }
  }
}    

This finds documents with the regex and terms but they are not in proximity to each other.

Thanks in advance for any guidance you can provide.

Have you looked at span queries? Note that span queries themselves are already io heavy (since they must gather and compute position data at multiple levels within the query execution). In combination with regex queries, this is likely to be very slow.

I don't think you can use regexp with span queries, even though the documentation says you can.

Someone else mentioned this over here on the Lucene list.

I have been trying to use span and you are correct, regexp is not supported:

{
"error": {
"root_cause": [
{
"type": "parsing_exception",
"reason": "[span_multi] query does not support [regexp]",
"line": 1,
"col": 59
}
],
"type": "parsing_exception",
"reason": "[span_multi] query does not support [regexp]",
"line": 1,
"col": 59
},
"status": 400
}

That was looking really promising too. Without proximity searching we get thousands of false positives.

1 Like

Actually, the docs are right but it's confusing. Here's a working example for you:

DELETE my_index
PUT my_index/doc/2
{
  "content": "I got 99 problems but this query ain't one"
}
GET my_index/doc/_search
{
  "query": {
    "span_near": {
      "clauses": [
        {
          "span_multi": {
            "match": {
              "regexp": {
                "content": "[0-9]{2}"
              }
            }
          }
        },
        {
          "span_term": {
            "content": "query"
          }
        }
      ],
      "slop": 3,
      "in_order": true
    }
  }
}

Change slop to 2 and there's no match.

2 Likes

That works great and with our actual regexp and a search term at a customer installation the performance wasn't horrible. These queries will be run by a backend service nightly so I don't need sub second performance.

Thank you both for the assistance.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.