I have some long string documents stored in elasticsearch(version 2.3).
I want to do a term aggregation for my documents. I choose one Field for the term , the records that stored in this field is some long strings.
the total count of the docs is 718. But the aggregation result is much less than the total count.
I think some long docs are filtered or dropped because the string is too long, some are longer than 256 characters.
So, I want to ask 'Is there any length limitation of the key in term aggregation'?
Is there any way to break this limitation?
You can see these are the aggregation results:
{
"took": 100,
"timed_out": false,
"_shards": {
"total": 36,
"successful": 36,
"failed": 0
},
"hits": {
"total": 718,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"2": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [{
"key": "select () from t_intg_dm_00v3 where XXX = XXX and ((((((fm_office in (XXX) and task_type in (XXX)) and task_status != XXX) and task_status != XXX) and task_status != XXX) and auto_schedulable != XXX) and change_time > XXX) and active = XXX limit XXX, XXX;",
"doc_count": 12
},
{
"key": "select () from t_intg_dm_00v3 where XXX = XXX and task_id = 'XX-XXXXXXX-XXXXX' and active = XXX order by operate_time asc limit XXX, XXX;",
"doc_count": 2
},
{
"key": "select () from t_intg_dm_00v3 where XXX = XXX and (task_id = 'XX-XXXXXXX-XXXXX' and (operate_type = XXX or operate_type = XXX)) and active = XXX order by task_log_id asc limit XXX, XXX;",
"doc_count": 1
},
{
"key": "select () from t_intg_dm_00v3 where XXX = XXX and (task_id = 'XX-XXXXXXX-XXXXX' and (operate_type = XXX or operate_type = XXX)) and active = XXX order by task_log_id asc limit XXX, XXX;",
"doc_count": 1
},
{
"key": "select count(1) count from t_intg_dm_00ui where XXX = XXX and create_time <= '2016-08-15 XXX:59:59' and active = XXX;",
"doc_count": 1
}]
}
}
}
Whats the mapping for the field you are aggregating on?
the name of mapping is 'slowquery'.
it has many keys, such as 'Pattern','Database','Schema'.
I'm doing term aggregation on the key 'Pattern'. It's string.
From your example its hard to tell, can you add how you are doing the aggregation?
My query body is like this,the field is 'Pattern',mapping is 'slowquery':
{
"query": {
"filtered": {
"query": {
"query_string": {
"query": "type:slowquery",
"analyze_wildcard": true
}
},
"filter": {
"bool": {
"must": [
{
"range": {
"@timestamp": {
"gte": 1471299986398,
"lte": 1471314386399,
"format": "epoch_millis"
}
}
}
],
"must_not": []
}
}
}
},
"size": 0,
"aggs": {
"2": {
"terms": {
"field": "Pattern.raw",
"size": 500,
"order": {
"_count": "desc"
}
}
}
}
}
If you run the query without the aggregation (and size set to some higher value), do all the results have the field set that you are aggregating on?
If I query without aggregation, then the results number is correct.
such as, curl -XGET 'http://esurl/_all/_search?size=200' then the return size is correct.
when I add the aggregation query body, the result is much less.
Please provide the mapping for the field. You can retrieve this through the get mapping API.
{"logstash-2016.08.16":{"mappings":{"slowquery":{"_all":{"enabled":true,"omit_norms":true},"dynamic_templates":[{"message_field":{"mapping":{"fielddata":{"format":"disabled"},"index":"analyzed","omit_norms":true,"type":"string"},"match":"message","match_mapping_type":"string"}},{"string_fields":{"mapping":{"fielddata":{"format":"disabled"},"index":"analyzed","omit_norms":true,"type":"string","fields":{"raw":{"ignore_above":256,"index":"not_analyzed","type":"string","doc_values":true}}},"match":"","match_mapping_type":"string"}},{"float_fields":{"mapping":{"type":"float","doc_values":true},"match":"","match_mapping_type":"float"}},{"double_fields":{"mapping":{"type":"double","doc_values":true},"match":"","match_mapping_type":"double"}},{"byte_fields":{"mapping":{"type":"byte","doc_values":true},"match":"","match_mapping_type":"byte"}},{"short_fields":{"mapping":{"type":"short","doc_values":true},"match":"","match_mapping_type":"short"}},{"integer_fields":{"mapping":{"type":"integer","doc_values":true},"match":"","match_mapping_type":"integer"}},{"long_fields":{"mapping":{"type":"long","doc_values":true},"match":"","match_mapping_type":"long"}},{"date_fields":{"mapping":{"type":"date","doc_values":true},"match":"","match_mapping_type":"date"}},{"geo_point_fields":{"mapping":{"type":"geo_point","doc_values":true},"match":"*","match_mapping_type":"geo_point"}}],"properties":{"@timestamp":{"type":"date","format":"strict_date_optional_time||epoch_millis"},"@version":{"type":"string","index":"not_analyzed"},"Database":{"type":"string","norms":{"enabled":false},"fielddata":{"format":"disabled"},"fields":{"raw":{"type":"string","index":"not_analyzed","ignore_above":256}}},"Lock Time":{"type":"double"},"Pattern":{"type":"string","norms":{"enabled":false},"fielddata":{"format":"disabled"},"fields":{"raw":{"type":"string","index":"not_analyzed","ignore_above":256}}},"Query Time":{"type":"double"},"Rows Examined":{"type":"long"},"Rows Sent":{"type":"long"},"SQL":{"type":"string","norms":{"enabled":false},"fielddata":{"format":"disabled"},"fields":{"raw":{"type":"string","index":"not_analyzed","ignore_above":256}}},"Schema":{"type":"string","norms":{"enabled":false},"fielddata":{"format":"disabled"},"fields":{"raw":{"type":"string","index":"not_analyzed","ignore_above":256}}},"Timestamp":{"type":"date","format":"strict_date_optional_time||epoch_millis"},"beat":{"properties":{"hostname":{"type":"string","norms":{"enabled":false},"fielddata":{"format":"disabled"},"fields":{"raw":{"type":"string","index":"not_analyzed","ignore_above":256}}},"name":{"type":"string","norms":{"enabled":false},"fielddata":{"format":"disabled"},"fields":{"raw":{"type":"string","index":"not_analyzed","ignore_above":256}}}}},"count":{"type":"long"},"geoip":{"dynamic":"true","properties":{"ip":{"type":"ip"},"latitude":{"type":"float"},"location":{"type":"geo_point"},"longitude":{"type":"float"}}},"host":{"type":"string","norms":{"enabled":false},"fielddata":{"format":"disabled"},"fields":{"raw":{"type":"string","index":"not_analyzed","ignore_above":256}}},"input_type":{"type":"string","norms":{"enabled":false},"fielddata":{"format":"disabled"},"fields":{"raw":{"type":"string","index":"not_analyzed","ignore_above":256}}},"log_format":{"type":"string","norms":{"enabled":false},"fielddata":{"format":"disabled"},"fields":{"raw":{"type":"string","index":"not_analyzed","ignore_above":256}}},"message":{"type":"string","norms":{"enabled":false},"fielddata":{"format":"disabled"}},"offset":{"type":"long"},"source":{"type":"string","norms":{"enabled":false},"fielddata":{"format":"disabled"},"fields":{"raw":{"type":"string","index":"not_analyzed","ignore_above":256}}},"tags":{"type":"string","norms":{"enabled":false},"fielddata":{"format":"disabled"},"fields":{"raw":{"type":"string","index":"not_analyzed","ignore_above":256}}},"type":{"type":"string","norms":{"enabled":false},"fielddata":{"format":"disabled"},"fields":{"raw":{"type":"string","index":"not_analyzed","ignore_above":256}}}}}}}}
It's a little bit long , so I didn't transfer it to json style.
It seems the field has one attribute, "ignore_above": 256.
Is it cause the problem. if string is longer than 256,it will be ignored when doing aggregation?
I believe it will only index the first 256 bytes of the not_analyzed string, which could explain why expressions that differ only after the 256th byte gets grouped together.
How to change it , such as to 1024 bytes.
You will need to change this in your index template.
I'm using logstash to send data to es.
It seems that index template has no impact on existing indices.
If I want to change the existing indices, what can I do?
And when I tried to update mappings,
curl -XPUT 'http://myesurl/logstash-2016.08.16/_mapping/slowquery' -d '
"properties": {
"Pattern": {
"type": "string",
"norms": {
"enabled": false
},
"fielddata": {
"format": "disabled"
},
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed",
"ignore_above": 2560
}
}
},
"SQL": {
"type": "string",
"norms": {
"enabled": false
},
"fielddata": {
"format": "disabled"
},
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed",
"ignore_above": 2560
}
}
}
}
'
It returned error:
{"error":{"root_cause":[{"type":"not_x_content_exception","reason":"not_x_content_exception: Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes"}],"type":"not_x_content_exception","reason":"not_x_content_exception: Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes"},"status":500}
You can not change existing mappings, so you will need to reindex that data.
the data is send to es in json format by logstash.
If the mapping was not exist,es will create it automatically according to the structure of json string.
So I think , is it possible to configure the value of ignore_above in logstash data,so that the mapping will be correctlly created at the very start.
And the index is created by day, does index template impact on the slowquery mapping of all indices?
Any changes to the index template will only apply for new indices that are created. It will not affect existing ones, so the data in those may need to be reindexed.