Combine regex with match_phrase

Hey all!

I ran into a problem with my elasticsearch query.
I try to match log patterns of different tests, ran on a specific app.
My goal is to match all of the patterns, in any of the tests.
The full query I built is this:

GET index/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "auto.bundle_id": "com.app.myapp"
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": "2023-07-09T12:20:11.722304"
            }
          }
        }
      ],
      "should": [
        {
          "bool": {
            "must": [
              {
                "regexp": {
                  "auto.test_1.device_log.keyword": ".*ActivityManager: Dumping to \\/data\\/anr\\/anr_[0-9]{4}-[0-9]{2}-[0-9]{2}-[0-9]{2}-[0-9]{2}-[0-9]{2}-[0-9]{3}.*"
                }
              },
              {
                "match_phrase": {
                  "auto.test_1.device_log": "--------- beginning of crash"
                }
              },
              {
                "match_phrase": {
                  "auto.test_1.device_log": ": crash_dump helper failed to exec, or was killed "
                }
              }
            ]
          }
        },
        {
          "bool": {
            "must": [
              {
                "regexp": {
                  "auto.test_2.device_log.keyword": ".*ActivityManager: Dumping to \\/data\\/anr\\/anr_[0-9]{4}-[0-9]{2}-[0-9]{2}-[0-9]{2}-[0-9]{2}-[0-9]{2}-[0-9]{3}.*"
                }
              },
              {
                "match_phrase": {
                  "auto.test_2.device_log": "--------- beginning of crash"
                }
              },
              {
                "match_phrase": {
                  "auto.test_2.device_log": ": crash_dump helper failed to exec, or was killed "
                }
              }
            ]
          }
        }
      ],
      "must_not": [],
      "minimum_should_match": 1
    }
  }
}

This gives 0 hits.
If i remove the regex parts, like this:

GET index/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "auto.bundle_id": "com.app.myapp"
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": "2023-07-09T12:20:11.722304"
            }
          }
        }
      ],
      "should": [
        {
          "bool": {
            "must": [
              {
                "match_phrase": {
                  "auto.test_1.device_log": "--------- beginning of crash"
                }
              },
              {
                "match_phrase": {
                  "auto.test_1.device_log": ": crash_dump helper failed to exec, or was killed "
                }
              }
            ]
          }
        },
        {
          "bool": {
            "must": [
              {
                "match_phrase": {
                  "auto.test_2.device_log": "--------- beginning of crash"
                }
              },
              {
                "match_phrase": {
                  "auto.test_2.device_log": ": crash_dump helper failed to exec, or was killed "
                }
              }
            ]
          }
        }
      ],
      "must_not": [],
      "minimum_should_match": 1
    }
  }
}

I get hits, some of them include the regex, which are the results I want to get, but some other hits that don't have that pattern.
On the other hand, if i remove all of the rest, and keep just the regex, like this:

{
  "query": {
    "bool": {
      "must": [],
      "should": [
        {
          "bool": {
            "must": [
              {
                "regexp": {
                  "auto.test_1.device_log.keyword": ".*ActivityManager: Dumping to \\/data\\/anr\\/anr_[0-9]{4}-[0-9]{2}-[0-9]{2}-[0-9]{2}-[0-9]{2}-[0-9]{2}-[0-9]{3}.*"
                }
              }
            ]
          }
        },
        {
          "bool": {
            "must": [
              {
                "regexp": {
                  "auto.test_2.device_log.keyword": ".*ActivityManager: Dumping to \\/data\\/anr\\/anr_[0-9]{4}-[0-9]{2}-[0-9]{2}-[0-9]{2}-[0-9]{2}-[0-9]{2}-[0-9]{3}.*"
                }
              }
            ]
          }
        }
      ],
      "must_not": [],
      "minimum_should_match": 1
    }
  }
}

I also get hits, including the ones that satisfy all of the original query.
I tried to look for a way to run this query, I believe it should be possible but maybe I was looking in the wrong places.

Any help will be appreciated.

hey @Eldan_Vaknin :

I think you need to use should clauses with minimum_should_match: 1 to signal that you want the terms to be optional, but at least one of them should match in order to get hits.

Something similar to:

{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "auto.bundle_id": "com.app.myapp"
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": "2023-07-09T12:20:11.722304"
            }
          }
        }
      ],
      "should": [
        {
          "bool": {
            "should": [
              {
                "regexp": {
                  "auto.test_1.device_log.keyword": ".*ActivityManager: Dumping to \\/data\\/anr\\/anr_[0-9]{4}-[0-9]{2}-[0-9]{2}-[0-9]{2}-[0-9]{2}-[0-9]{2}-[0-9]{3}.*"
                }
              },
              {
                "match_phrase": {
                  "auto.test_1.device_log": "--------- beginning of crash"
                }
              },
              {
                "match_phrase": {
                  "auto.test_1.device_log": ": crash_dump helper failed to exec, or was killed "
                }
              }
            ],
            "minimum_should_match": 1
          }
        },
        {
          "bool": {
            "should": [
              {
                "regexp": {
                  "auto.test_2.device_log.keyword": ".*ActivityManager: Dumping to \\/data\\/anr\\/anr_[0-9]{4}-[0-9]{2}-[0-9]{2}-[0-9]{2}-[0-9]{2}-[0-9]{2}-[0-9]{3}.*"
                }
              },
              {
                "match_phrase": {
                  "auto.test_2.device_log": "--------- beginning of crash"
                }
              },
              {
                "match_phrase": {
                  "auto.test_2.device_log": ": crash_dump helper failed to exec, or was killed "
                }
              }
            ],
            "minimum_should_match": 1
          }
        }
      ],
      "must_not": [],
      "minimum_should_match": 1
    }
  }
}

Also, you should consider using the bool query filter for doing this kind of filtering in case you're not interested in scoring.

Hope that helps!

Thank you for the response.

should wouldn't work for me, as I'm trying to find documents that have all patterns.
my should is on the fields, since the patterns can appear in different logs.
the problem is that while I know for a fact that there are documents with all patterns in a single log field, mixing the regex with any other of the clauses in the query results in 0 hits.

perhaps I don't understand something about how the regex works.

Here is a log that matches my query (it's the value of the field "auto.test_1.device_log")

06-19 16:53:53.150 12410 19782 I ActivityManager: Dumping to /data/anr/anr_2024-06-19-16-53-53-148 
 --------- beginning of crash 
 06-19 16:53:55.554   671   671 F libc    : crash_dump helper failed to exec, or was killed 
 06-19 16:53:55.556 12410 19782 I ActivityManager: Collecting stacks for native pid 683 
----------------------------------------

Maybe it'll help clarifying what I'm after