Minimise dataset for regexp query


(Hrathore) #1

I am running a regexp query that starts from (.) and it runs on filetext field for a specific employeeId.
The searched text can be anywhere inside the token, so I am wrapping the regex with .

Index Size: 100GB
Mapping for Index is: employeeId, filetext
Search Type: Scan
Scroll=5m

{
  "query": {
    "filtered" : {
      "query" : { "regexp" : { "filetext" : ".*1234.*" }},
      "filter" : {
        "bool" : {
          "must": { "term" : { "employeeId" : 5725 }},
        }
      }
    }
  } 
} 

This query scans the full index instead of searching data specific to that employeeId (confirmed as the full disk is read, as per IO stats), and as it is a regex query, it's very slow.

  1. How do I make sure that query runs only on specific employeeId, and not on complete dataset?
  2. How do I speed this up?

(Christoph) #2

Hi,

I assume this is Elasticsearch 1.7, since the filtered query has been deprecated in 2.0. Have you tried moving the regex query after the term query in the must section of the bool-filter like this? I haven't tested this with a large number of documents, but the must clauses should be executed in order, so the term-filter should reduce the number of documents the regexp-query runs on.

"query" : {
    "filtered" : {
      "filter" : {
        "bool" : {
          "must": [
            { "term" : { "employeeId" : 5725 }},
            { "regexp" : { "filetext" : ".*1234.*" }}
        ]}
      }
    }
  } 


(Hrathore) #4

@Christoph
I am using version 1.5.2.
I have tried the query you suggested, but it didn't make any difference. The new query took the same time as earlier, and searched the whole index instead of the employeeId filter.

Does it change in version 2.0 ?


(system) #5