Simple Query to search within log text (not keyword)

I am trying to search within text that originates from log files.

Am I using the correct query? Any suggestions on how to restrict results only to the exact match? (eg show only results with higher score?)

I am using elasticsearch 7.17 (no wildcard search available and cannot upgrade).
Since I was unable to use grok to properly make keywords (lots of different log files, with different format), I have to search mainly within a string with something like this:

GET /log*/_search
{
  "sort" : [
    { "@timestamp": {}}
  ],
  "query" : {
    "match_phrase": {
      "message": "identifier=54345243"       
    }
  }
}

And here is a sample of how the log field "message" looks like:

      "message" : "2023-02-09 13:12:12.156 DEBUG Info:[replacement=xydfre1, usrid=test2, uname=bolkojt2, process=activerqw, entrypoint=, app_server=main, identifier=54345243] 2752271 --- [Camel (testing) thread #2 - timer://externalConfigurationRefresh] g.u.m.g.c.filters.RestClientFilters      : Request: Method: GET, URL: http://test-env12:8384/test-jbpm-confmap/dev",
          "@timestamp" : "2023-02-09T13:18:12.230Z"
        },

Note that the log output can vary a lot, or else I would have use grok to make keywords for these words... Perhaps this is possible, but I was not able to create a flexible enough grok matching pattern (but that is a different question...)

Hi,
Given the constraints you describe it looks like you’re on the right track using a phrase query on a text field.
The trick to successful matching is to be certain that your search tokens match those tokens indexed from the docs. The choice of “analyzer” used on the field will influence how text is chopped into search tokens (typically, words).

Taking this part of your document’s text:

Depending on the choice of analyzer used at index time that might be tokenised into 1, 2 or 3 words. Also, the number might include the ’]’ in the indexed token which could explain why your search might not match.
To remove the mystery behind matching use the _analyze api to see how searches and documents are turned into tokens

Thanks! I have not explicitly set a tokenizer for my indices. So I guess the default one applies. Is it ok to change the tokenizer for all of my indices now? Or would something like Remapping be needed?
(maybe pattern analyzer \W+ is better suited for logs?)

That would require reindexing your content.

Ultimately the issue with indexing log messages is there’s a vocabulary problem. Unlike human-authored text there’s no common agreement between searchers and search engine on what constitutes a word. The example from my last post with the numbers and brackets is a good example of not knowing how this content might be stored in the index.
This blog post gives a detailed breakdown of the options.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.