Range and numeric_range and field types problem


(mobsniuk) #1

I'm encountering an odd issue. When I use range query I get more than
what I expect back. The following curl contains a range of 221-222.
Yet I get values of 2219, 2212, 2210 back.

curl -XPOST 'localhost:9200/_search?pretty=true' -d '
{
"filter": {
"range": {
"hits": {
"from": "221",
"to": "222",
"include_lower": true,
"include_upper": false
}
}
}
}'

Example of returned data.

{
"took" : 28,
"timed_out" : false,
"_shards" : {
"total" : 50,
"successful" : 50,
"failed" : 0
},
"hits" : {
"total" : 3,
"max_score" : 1.0,
"hits" : [ {
"_index" : "logstash-2011.06.17",
"_type" : "syslog",
"_id" : "FwetdW95QXmPku8pC5Ap-g",
"_score" : 1.0, "_source" :
{"@source":"file://localhost/var/log/messages","@type":"syslog","@tags":[],"@fields":{"timestamp":["2011-06-17T11:09:48-07:00","2011-06-17T11:09:48-07:00"],"YEAR":["2011","2011"],"MONTHNUM":["06","06"],"MONTHDAY":["17","17"],"HOUR":["11","07","11","07"],"MINUTE":["09","00","09","00"],"SECOND":["48","48"],"ISO8601_TIMEZONE":["-07:00","-07:00"],"caller":["daemon","daemon"],"process":["named","named"],"pid":["26353","26353"],"BASE10NUM":["26353","26353"],"view":["_default"],"size":["170010"],"hits":["2212"],"misses":["38"],"data":["
info Recursion cache view "_default": size = 170010, hits = 2212,
misses = 38"]},"@timestamp":"2011-06-17T18:09:49.350000Z","@source_host":"localhost","@source_path":"/var/log/messages","@message":"2011-06-17T11:09:48-07:00
daemon (none) named[26353]: info Recursion cache view "_default":
size = 170010, hits = 2212, misses = 38"}
}, {
"_index" : "logstash-2011.06.21",
"_type" : "syslog",
"_id" : "cte-3dmFTpmFlwpTcJs3XA",
"_score" : 1.0, "_source" :
{"@source":"file://localhost/var/log/messages","@type":"syslog","@tags":[],"@fields":{"view":["_default"],"size":[173208],"hits":[2210],"misses":[1094],"timestamp":["2011-06-21T04:12:10-07:00"],"YEAR":["2011"],"MONTHNUM":["06"],"MONTHDAY":["21"],"HOUR":["04","07"],"MINUTE":["12","00"],"SECOND":["10"],"ISO8601_TIMEZONE":["-07:00"],"caller":["daemon"],"process":["named"],"pid":["11540"],"BASE10NUM":["11540"],"data":["
info Recursion cache view "_default": size = 173208, hits = 2210,
misses = 1094"]},"@timestamp":"2011-06-21T11:12:11.396000Z","@source_host":"localhost","@source_path":"/var/log/messages","@message":"2011-06-21T04:12:10-07:00
daemon (none) named[11540]: info Recursion cache view "_default":
size = 173208, hits = 2210, misses = 1094"}
}, {
"_index" : "logstash-2011.06.21",
"_type" : "syslog",
"_id" : "x37lujEMTfGgrj-dIcXotw",
"_score" : 1.0, "_source" :
{"@source":"file://localhost/var/log/messages","@type":"syslog","@tags":[],"@fields":{"view":["_default"],"size":[173208],"hits":[2219],"misses":[1099],"timestamp":["2011-06-21T04:17:16-07:00"],"YEAR":["2011"],"MONTHNUM":["06"],"MONTHDAY":["21"],"HOUR":["04","07"],"MINUTE":["17","00"],"SECOND":["16"],"ISO8601_TIMEZONE":["-07:00"],"caller":["daemon"],"process":["named"],"pid":["11540"],"BASE10NUM":["11540"],"data":["
info Recursion cache view "_default": size = 173208, hits = 2219,
misses = 1099"]},"@timestamp":"2011-06-21T11:17:17.473000Z","@source_host":"localhost","@source_path":"/var/log/messages","@message":"2011-06-21T04:17:16-07:00
daemon (none) named[11540]: info Recursion cache view "_default":
size = 173208, hits = 2219, misses = 1099"}
} ]
}
}

I had thought maybe it was the data being stored incorrectly. I found
that I could provide a hint for logstash to store values as ints. You
can see that data is being stored differently now. From index
logstash-2011.06.17 shows "hits":["2212"] and data from index
logstash-2011.06.21 shows "hits":[2219] and "hits":[2210]. So the
changes I made to logstash to store these values as ints looks to have
succeeded. Or at least just differently. When I try the numeric_range
filter on todays data I get hits is not a numeric error.

The entries are from log files so there isn't a uniform format. I
read that you can setup an index with types. I needed to extract out a
particular set of values to do analysis on. Any clarification on this
would be appreciated. There will be more fields being extracted to
allow analysis so we need to find a solution to this numeric_range and
field types issue.

Thanks,

Mark


(Clinton Gormley) #2

Hi Mark

I had thought maybe it was the data being stored incorrectly. I found
that I could provide a hint for logstash to store values as ints. You
can see that data is being stored differently now. From index
logstash-2011.06.17 shows "hits":["2212"] and data from index
logstash-2011.06.21 shows "hits":[2219] and "hits":[2210]. So the
changes I made to logstash to store these values as ints looks to have
succeeded. Or at least just differently. When I try the numeric_range
filter on todays data I get hits is not a numeric error.

I'm guessing that your mapping is still incorrect.

You can't look at the _source field that is returned from a search and
deduce from that how the data is stored in ES, because you get back
exactly what you put in.

If the first doc that you index has { count: "123" } then the count
field will be set to string.

You can use the get_mapping api to see what ES thinks each field type
is:
http://www.elasticsearch.org/guide/reference/api/admin-indices-get-mapping.html

I don't know the internals of logstash, but it looks like you create a
new index each day. You may just need to start fresh with the correct
form that you're using now (ie { count: 123 } ) and all will be OK

Or you may need to specify the correct mapping for each field, when you
create your index:
http://www.elasticsearch.org/guide/reference/api/admin-indices-create-index.html

clint


(system) #3