Strange search results


(Patrick Proniewski) #1

Hello,

Disclaimer: I'm a total newbie with Elasticsearch. I've installed a dedicated ES 1.1.0 server (FreeBSD port), Logstash 1.4.0 (and it's bundled Kibana 3.x). Everything is working fine, except some particular searches.

I'm indexing server logs (postfix, apache, and so on), with some grok pattern matching. My problem arise when I try some queries, either in Kibana or in Sense interface. In few of my postfix log lines strings "a79.e.ipso1978.fr" or "e.ipso1978.fr" appear:

Apr 24 06:26:53 rack postfix/smtpd[73065]: 7F32D47C: client=localhost[127.0.0.1], orig_client=a79.e.ipso1978.fr[178.32.165.79]
Apr 24 06:26:53 rack postfix/smtpd[73057]: ... from=news@e.ipso1978.fr to=... helo=<e.ipso1978.fr>

And a vast majority of log lines does not read either strings.
Each line is store verbatim into a field named "message", I have more fields of course corresponding to various patterns extracted.

Doing a search for a79.e.ipso1978.fr (w/o quotes) in Kibana returns 21048 results: absolutely not good.
a79.e.ipso1978.fr* (w/o quotes) : 0 result, not good.
"a79.e.ipso1978.fr" (w quotes) in ES returns 4 results : good.
"79.e.ipso1978.fr" : 0 result, not good.
".e.ipso1978.fr" : 10 results, good.
"e.ipso1978.fr" : 10 results, good.
".ipso1978.fr" : 0 result, not good.
ipso1978 : 0 result, not good.
*ipso1978 : 10 results, good.
*ipso1978.fr : 0 result, not good.
"ipso1978" : 0 result, not good.

Basically, I expect any of these search to return (only) every log lines containing the query (as would do grep, awk...).
Obviously, I'm missing something here. I don't understand why a simple string search can go so wrong. I'm struggling with this for more than a day now. It looks like it's not a Kibana problem, because I get the same irrelevant results using Sense.

Any help is greatly appreciated,
Patrick

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/EFE80EB8-E6C7-4F46-A522-B2AB915BEEFB%40patpro.net.
For more options, visit https://groups.google.com/d/optout.


(Patrick Proniewski) #2

Hello,

Any idea?

On 25 avr. 2014, at 13:45, Patrick Proniewski wrote:

Hello,

Disclaimer: I'm a total newbie with Elasticsearch. I've installed a dedicated ES 1.1.0 server (FreeBSD port), Logstash 1.4.0 (and it's bundled Kibana 3.x). Everything is working fine, except some particular searches.

I'm indexing server logs (postfix, apache, and so on), with some grok pattern matching. My problem arise when I try some queries, either in Kibana or in Sense interface. In few of my postfix log lines strings "a79.e.ipso1978.fr" or "e.ipso1978.fr" appear:

Apr 24 06:26:53 rack postfix/smtpd[73065]: 7F32D47C: client=localhost[127.0.0.1], orig_client=a79.e.ipso1978.fr[178.32.165.79]
Apr 24 06:26:53 rack postfix/smtpd[73057]: ... from=news@e.ipso1978.fr to=... helo=<e.ipso1978.fr>

And a vast majority of log lines does not read either strings.
Each line is store verbatim into a field named "message", I have more fields of course corresponding to various patterns extracted.

Doing a search for a79.e.ipso1978.fr (w/o quotes) in Kibana returns 21048 results: absolutely not good.
a79.e.ipso1978.fr* (w/o quotes) : 0 result, not good.
"a79.e.ipso1978.fr" (w quotes) in ES returns 4 results : good.
"79.e.ipso1978.fr" : 0 result, not good.
".e.ipso1978.fr" : 10 results, good.
"e.ipso1978.fr" : 10 results, good.
".ipso1978.fr" : 0 result, not good.
ipso1978 : 0 result, not good.
*ipso1978 : 10 results, good.
*ipso1978.fr : 0 result, not good.
"ipso1978" : 0 result, not good.

Basically, I expect any of these search to return (only) every log lines containing the query (as would do grep, awk...).
Obviously, I'm missing something here. I don't understand why a simple string search can go so wrong. I'm struggling with this for more than a day now. It looks like it's not a Kibana problem, because I get the same irrelevant results using Sense.

Any help is greatly appreciated,
Patrick

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/EFE80EB8-E6C7-4F46-A522-B2AB915BEEFB%40patpro.net.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/B6178425-6BFF-4688-AC43-0B5F257725C5%40patpro.net.
For more options, visit https://groups.google.com/d/optout.


(Hannes Korte) #3

Hi Patrick,

as you didn't mention your Elasticsearch type mapping, I guess you are
using the default one, which analyzes your "message" field. This leads to
the original string being split into terms.

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html

You can see this behavior using the analyze API:

curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty' -d
'from=news@e.ipso1978.fr to=... helo=<e.ipso1978.fr>'

Using the standard analyzer this text consists of the terms "from", "news",
"e.ipso1978", "fr", etc. This analyzer is actually meant to be used with
natural language text. If you now search for the query string
"a79.e.ipso1978.fr" you are actually searching for "a79 OR e.ipso1978 OR
fr", because the query string gets analyzed as well. Enclosing your query
terms in double quotes gives you a phrase search. This works, because the
search terms then have to be contiguous in the documents.

So, using phrase queries you will get what you want, as long as your query
string starts and ends at term borders. You can see this in your examples:
"a79.e.ipso1978.fr" -> 4 results, "79.e.ipso1978.fr" -> 0 results.

A theoretically possible but in practice not advisable way to get an exact
substring search would be to set the field to be "not_analyzed" and search
it with a regexp query like this:

"regexp": { "message": ".a79\.e\.ipso1978\.fr." }

The problem with this scenario is, that you end up with one unique term per
document. And this does not scale.

So, if you want to have a pure substring search, this blog post might help
you:
http://blog.rnf.me/2013/exact-substring-search-in-elasticsearch.html

And here are some links about how to set the mapping for your logstash
indices:


http://logstash.net/docs/1.4.0/outputs/elasticsearch
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-templates.html

I hope this was helpful.

Best regards,
Hannes

On 25.04.2014 13:45, Patrick Proniewski wrote:

Hello,

Disclaimer: I'm a total newbie with Elasticsearch. I've installed a
dedicated ES 1.1.0 server (FreeBSD port), Logstash 1.4.0 (and it's bundled
Kibana 3.x). Everything is working fine, except some particular searches.

I'm indexing server logs (postfix, apache, and so on), with some grok
pattern matching. My problem arise when I try some queries, either in
Kibana or in Sense interface. In few of my postfix log lines strings "
a79.e.ipso1978.fr" or "e.ipso1978.fr" appear:

Apr 24 06:26:53 rack postfix/smtpd[73065]: 7F32D47C:
client=localhost[127.0.0.1], orig_client=a79.e.ipso1978.fr[178.32.165.79]
Apr 24 06:26:53 rack postfix/smtpd[73057]: ... from=<ne...@e.ipso1978.fr<javascript:>>
to=... helo=<e.ipso1978.fr>

And a vast majority of log lines does not read either strings.
Each line is store verbatim into a field named "message", I have more
fields of course corresponding to various patterns extracted.

Doing a search for a79.e.ipso1978.fr (w/o quotes) in Kibana returns 21048
results: absolutely not good.
a79.e.ipso1978.fr* (w/o quotes) : 0 result, not good.
"a79.e.ipso1978.fr" (w quotes) in ES returns 4 results : good.
"79.e.ipso1978.fr" : 0 result, not good.
".e.ipso1978.fr" : 10 results, good.
"e.ipso1978.fr" : 10 results, good.
".ipso1978.fr" : 0 result, not good.
ipso1978 : 0 result, not good.
*ipso1978 : 10 results, good.
*ipso1978.fr : 0 result, not good.
"ipso1978" : 0 result, not good.

Basically, I expect any of these search to return (only) every log lines
containing the query (as would do grep, awk...).
Obviously, I'm missing something here. I don't understand why a simple
string search can go so wrong. I'm struggling with this for more than a day
now. It looks like it's not a Kibana problem, because I get the same
irrelevant results using Sense.

Any help is greatly appreciated,
Patrick

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f7f090a4-91e9-4ac7-b615-0b8c4fa7381c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Patrick Proniewski) #4

Hannes,

Thank you very much for this detailed answer. You are right, I'm using the default mapping (I was not even aware of mappings as I'm very new to ES).
I'll read the links you've provided ASAP and see what's best for me (and for my future users).
I've already taken a look at the blog post. It uses concepts I'll have to learn before trying to understand how ES really works on the inside.

I've noticed that my first example (a79.e.ipso1978.fr w/o quotes in Kibana returns 21048 results) was in fact interesting, because the 21048 results were ordered by score, and this score was significantly higher for the 4 meaningful results I was looking for.
Is there any way to filter result in Kibana using a score range? My attempts in Sense failed miserably but I guess that's because filtering occurs before results are known.

Thanks again for your very helpful reply.

Patrick

On 27 avr. 2014, at 16:56, Hannes Korte wrote:

Hi Patrick,

as you didn't mention your Elasticsearch type mapping, I guess you are using the default one, which analyzes your "message" field. This leads to the original string being split into terms.

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html

You can see this behavior using the analyze API:

curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty' -d 'from=news@e.ipso1978.fr to=... helo=<e.ipso1978.fr>'

Using the standard analyzer this text consists of the terms "from", "news", "e.ipso1978", "fr", etc. This analyzer is actually meant to be used with natural language text. If you now search for the query string "a79.e.ipso1978.fr" you are actually searching for "a79 OR e.ipso1978 OR fr", because the query string gets analyzed as well. Enclosing your query terms in double quotes gives you a phrase search. This works, because the search terms then have to be contiguous in the documents.

So, using phrase queries you will get what you want, as long as your query string starts and ends at term borders. You can see this in your examples: "a79.e.ipso1978.fr" -> 4 results, "79.e.ipso1978.fr" -> 0 results.

A theoretically possible but in practice not advisable way to get an exact substring search would be to set the field to be "not_analyzed" and search it with a regexp query like this:

"regexp": { "message": ".a79\.e\.ipso1978\.fr." }

The problem with this scenario is, that you end up with one unique term per document. And this does not scale.

So, if you want to have a pure substring search, this blog post might help you:
http://blog.rnf.me/2013/exact-substring-search-in-elasticsearch.html

And here are some links about how to set the mapping for your logstash indices:
http://www.elasticsearch.org/blog/new-in-logstash-1-3-elasticsearch-index-template-management/
http://logstash.net/docs/1.4.0/outputs/elasticsearch
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-templates.html

I hope this was helpful.

Best regards,
Hannes

On 25.04.2014 13:45, Patrick Proniewski wrote:
Hello,

Disclaimer: I'm a total newbie with Elasticsearch. I've installed a dedicated ES 1.1.0 server (FreeBSD port), Logstash 1.4.0 (and it's bundled Kibana 3.x). Everything is working fine, except some particular searches.

I'm indexing server logs (postfix, apache, and so on), with some grok pattern matching. My problem arise when I try some queries, either in Kibana or in Sense interface. In few of my postfix log lines strings "a79.e.ipso1978.fr" or "e.ipso1978.fr" appear:

Apr 24 06:26:53 rack postfix/smtpd[73065]: 7F32D47C: client=localhost[127.0.0.1], orig_client=a79.e.ipso1978.fr[178.32.165.79]
Apr 24 06:26:53 rack postfix/smtpd[73057]: ... from=ne...@e.ipso1978.fr to=... helo=<e.ipso1978.fr>

And a vast majority of log lines does not read either strings.
Each line is store verbatim into a field named "message", I have more fields of course corresponding to various patterns extracted.

Doing a search for a79.e.ipso1978.fr (w/o quotes) in Kibana returns 21048 results: absolutely not good.
a79.e.ipso1978.fr* (w/o quotes) : 0 result, not good.
"a79.e.ipso1978.fr" (w quotes) in ES returns 4 results : good.
"79.e.ipso1978.fr" : 0 result, not good.
".e.ipso1978.fr" : 10 results, good.
"e.ipso1978.fr" : 10 results, good.
".ipso1978.fr" : 0 result, not good.
ipso1978 : 0 result, not good.
*ipso1978 : 10 results, good.
*ipso1978.fr : 0 result, not good.
"ipso1978" : 0 result, not good.

Basically, I expect any of these search to return (only) every log lines containing the query (as would do grep, awk...).
Obviously, I'm missing something here. I don't understand why a simple string search can go so wrong. I'm struggling with this for more than a day now. It looks like it's not a Kibana problem, because I get the same irrelevant results using Sense.

Any help is greatly appreciated,
Patrick

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4B37E3DE-BB68-4F2F-BB59-50EB7640949B%40patpro.net.
For more options, visit https://groups.google.com/d/optout.


(Hannes Korte) #5

Hi Patrick,

actually, there are two types of filters: filters inside queries are
applied before the query, filters outside the query are applied after
the query:

That is why they have been renamed to post_filter in ES 1.0:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/_search_requests.html#_search_requests

To answer your question: I am not sure, whether it is possible to filter
results by score. I guess it could be possible using a script filter,
but in a quick test I didn't find a way to access the score value in the
filter script:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-script-filter.html#query-dsl-script-filter

But you shouldn't do it anyway as the scores might vary strongly between
searches. Scores are only meaningful relative to each other within one
search result. So it does not make sense to define an absolute limit.

Best regards,
Hannes

On 27.04.2014 20:42, Patrick Proniewski wrote:

Hannes,

Thank you very much for this detailed answer. You are right, I'm
using the default mapping (I was not even aware of mappings as I'm
very new to ES). I'll read the links you've provided ASAP and see
what's best for me (and for my future users). I've already taken a
look at the blog post. It uses concepts I'll have to learn before
trying to understand how ES really works on the inside.

I've noticed that my first example (a79.e.ipso1978.fr w/o quotes in
Kibana returns 21048 results) was in fact interesting, because the
21048 results were ordered by score, and this score was significantly
higher for the 4 meaningful results I was looking for. Is there any
way to filter result in Kibana using a score range? My attempts in
Sense failed miserably but I guess that's because filtering occurs
before results are known.

Thanks again for your very helpful reply.

Patrick

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/535EAC1C.2020804%40hkorte.com.
For more options, visit https://groups.google.com/d/optout.


(system) #6