I am trying to search within text that originates from log files.
Am I using the correct query? Any suggestions on how to restrict results only to the exact match? (eg show only results with higher score?)
I am using elasticsearch 7.17 (no wildcard search available and cannot upgrade).
Since I was unable to use grok to properly make keywords (lots of different log files, with different format), I have to search mainly within a string with something like this:
Note that the log output can vary a lot, or else I would have use grok to make keywords for these words... Perhaps this is possible, but I was not able to create a flexible enough grok matching pattern (but that is a different question...)
Hi,
Given the constraints you describe it looks like you’re on the right track using a phrase query on a text field.
The trick to successful matching is to be certain that your search tokens match those tokens indexed from the docs. The choice of “analyzer” used on the field will influence how text is chopped into search tokens (typically, words).
Taking this part of your document’s text:
Depending on the choice of analyzer used at index time that might be tokenised into 1, 2 or 3 words. Also, the number might include the ’]’ in the indexed token which could explain why your search might not match.
To remove the mystery behind matching use the _analyze api to see how searches and documents are turned into tokens
Thanks! I have not explicitly set a tokenizer for my indices. So I guess the default one applies. Is it ok to change the tokenizer for all of my indices now? Or would something like Remapping be needed?
(maybe pattern analyzer \W+ is better suited for logs?)
Ultimately the issue with indexing log messages is there’s a vocabulary problem. Unlike human-authored text there’s no common agreement between searchers and search engine on what constitutes a word. The example from my last post with the numbers and brackets is a good example of not knowing how this content might be stored in the index.
This blog post gives a detailed breakdown of the options.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.