Weird Result ElasticSearch Facets with Apache Log field:PATH

Hi there,

I put Apache Log access to E.S for searching and do some statistics.

My logs example is something like this :

curl "localhost:9200/logstash-2013.07.16/_search?pretty" -d '{"size":4}'
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 45,
"max_score" : 1.0,
"hits" : [ {
"_index" : "logstash-2013.07.16",
"_type" : "fluentd",
"_id" : "dCi2ALWDSj6tA4I0W6kZnQ",
"_score" : 1.0, "_source" :
{"host":"10.0.0.1","user":null,"method":"GET","path":"/somepath01/somefile01.html","code":200,"size":12,"referer":null,"agent":"curl/7.19.7
(x86_64-unknown-linux-gnu) libcurl/7.19.7 NSS/3.12.7.0 zlib/1.2.3
libidn/1.18 libssh2/1.2.2","@timestamp":"2013-07-16T11:25:01+07:00"}
}, {
"_index" : "logstash-2013.07.16",
"_type" : "fluentd",
"_id" : "udCTJUP8Sce5N-A5QSdCDw",
"_score" : 1.0, "_source" :
{"host":"10.0.0.2","user":null,"method":"GET","path":"/somepath02/somefile02.html","code":200,"size":12,"referer":null,"agent":"curl/7.19.7
(x86_64-unknown-linux-gnu) libcurl/7.19.7 NSS/3.12.7.0 zlib/1.2.3
libidn/1.18 libssh2/1.2.2","@timestamp":"2013-07-16T11:25:02+07:00"}
}, {
"_index" : "logstash-2013.07.16",
"_type" : "fluentd",
"_id" : "Yghj0YQtTYWaS_kGy8T7YQ",
"_score" : 1.0, "_source" :
{"host":"10.0.0.3","user":null,"method":"GET","path":"/somepath03/somefile03.html","code":200,"size":12,"referer":null,"agent":"curl/7.19.7
(x86_64-unknown-linux-gnu) libcurl/7.19.7 NSS/3.12.7.0 zlib/1.2.3
libidn/1.18 libssh2/1.2.2","@timestamp":"2013-07-16T11:25:03+07:00"}
}, {
"_index" : "logstash-2013.07.16",
"_type" : "fluentd",
"_id" : "WkrnKPd5Sc2qvE1fmRuEDA",
"_score" : 1.0, "_source" :
{"host":"10.0.0.4","user":null,"method":"GET","path":"/somepath04/somefile04.html","code":200,"size":12,"referer":null,"agent":"curl/7.19.7
(x86_64-unknown-linux-gnu) libcurl/7.19.7 NSS/3.12.7.0 zlib/1.2.3
libidn/1.18 libssh2/1.2.2","@timestamp":"2013-07-16T11:25:04+07:00"}
} ]
}
}

Now I want to list the top IPs (field:host) which occurs most in the log,
so I use the facets query :

curl "localhost:9200/logstash-2013.07.16/_search?pretty" -d
'{"facets":{"Top IP":{"terms":{"field":"host"}}}}'

"facets" : {
"Top IP" : {
"_type" : "terms",
"missing" : 0,
"total" : 45,
"other" : 0,
"terms" : [ {
"term" : "10.0.0.9",
"count" : 9
}, {
"term" : "10.0.0.8",
"count" : 8
}, {
"term" : "10.0.0.7",
"count" : 7
}, {
"term" : "10.0.0.6",
"count" : 6
}, {
"term" : "10.0.0.5",
"count" : 5
}, {
"term" : "10.0.0.4",
"count" : 4
}, {
"term" : "10.0.0.3",
"count" : 3
}, {
"term" : "10.0.0.2",
"count" : 2
}, {
"term" : "10.0.0.1",
"count" : 1
} ]
}
}

The result is perfect ! E.S did work well.

Now I want to do the same thing with field:path to list the top most
access-ed URL.

curl "localhost:9200/logstash-2013.07.16/_search?pretty" -d
'{"facets":{"Top URL":{"terms":{"field":"path"}}}}'

"facets" : {
"Top IP" : {
"_type" : "terms",
"missing" : 0,
"total" : 135,
"other" : 28,
"terms" : [ {
"term" : "html",
"count" : 45
}, {
"term" : "somepath09",
"count" : 9
}, {
"term" : "somefile09",
"count" : 9
}, {
"term" : "somepath08",
"count" : 8
}, {
"term" : "somepath07",
"count" : 7
}, {
"term" : "somefile08",
"count" : 7
}, {
"term" : "somepath06",
"count" : 6
}, {
"term" : "somefile07",
"count" : 6
}, {
"term" : "somepath05",
"count" : 5
}, {
"term" : "somefile06",
"count" : 5
} ]
}
}

This time, the result is weird. It should be like this : "term" : "
/somepath05/somefile05.html","count" : 5

I guest E.S have some errors with the forward slash "/" in the path field.

I don't know how to fix this.

Could you pro show me the problem and help me to fix this.

Many appreciates.

Atrus@

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The standard tokenizers index your "path" field in many term (word) where "/" char is like a separator.
I'm not really sure that's the right solution but you can try another tokenizers, for example:
http://www.elasticsearch.org/guide/reference/index-modules/analysis/whitespace-tokenizer/

Other "skilled" suggestions are appreciated

Hi Atrus,

can you create a mapping for the field "path", before indexing?
It should be "not_analyzed". After that, your facetting should work.
You can have a look how to configure "not_analyzed" fields, here
http://www.elasticsearch.org/guide/reference/mapping/core-types/.
The path part of your mapping could look like

....
"path": {
"type": "string",
"index": "not_analyzed"
}
....

Greetings

Christian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thank you both for your reply !

It looks like that the Christian's way is more simpler. I did try and it
worked perfectly :

facets" : {
"Top path" : {
"_type" : "terms",
"missing" : 0,
"total" : 21,
"other" : 0,
"terms" : [ {
"term" : "/somepath06/somefile06",
"count" : 6
}, {
"term" : "/somepath05/somefile05",
"count" : 5
}, {
"term" : "/somepath04/somefile04",
"count" : 4
}, {
"term" : "/somepath03/somefile03",
...

I have 2 more futhur questions :

  • is it possible to update the current indexed field from analyzed to *
    not_analyzed* without re-create the index ?
  • if the upper question is not possible, how can I config the E.S to
    default to use not_analyzed for string type instead of analyzed ?

Atrus@

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

  1. It is not possible to update existing fields without re-indexing.
  2. You can use dynamic templates for exactly that purpose. See the
    template_2 example in the dynamic template documentation (scroll down):
    Elasticsearch Platform — Find real-time answers at scale | Elastic

Cheers,

Ivan

On Tue, Jul 16, 2013 at 7:21 PM, Atrus Klaus linuxbyexamples@gmail.comwrote:

Thank you both for your reply !

It looks like that the Christian's way is more simpler. I did try and it
worked perfectly :

facets" : {
"Top path" : {
"_type" : "terms",
"missing" : 0,
"total" : 21,
"other" : 0,
"terms" : [ {
"term" : "/somepath06/somefile06",
"count" : 6
}, {
"term" : "/somepath05/somefile05",
"count" : 5
}, {
"term" : "/somepath04/somefile04",
"count" : 4
}, {
"term" : "/somepath03/somefile03",
...

I have 2 more futhur questions :

  • is it possible to update the current indexed field from analyzed to *
    not_analyzed* without re-create the index ?
  • if the upper question is not possible, how can I config the E.S to
    default to use not_analyzed for string type instead of analyzed ?

Atrus@

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.