Terms aggregation returns no buckets if some records are missing the field?


(Andrew Swan) #1

I'm using elasticsearch 1.5.0 and I have a bunch of web access logs. Due to a glitch inserting, I omitted the response field from some records. But this seems to have broken terms aggregations on that field:

$ curl -H 'Content-type: application/json' 'http://localhost:9200/events-default@2015.06.03/_search?pretty' -d '{
  "size": 0,
  "aggs": {
    "group": {
      "terms": {
        "field": "response"
      }
    }
  }
}'

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 19939,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "group" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [ ]
    }
  }
}

As you can see, there are many records in that index, but the aggregation results have no buckets! I don't think this is a simple matter of issue #5324 since there are definitely records with non-null values for that field:

$ curl -H 'Content-type: application/json' 'http://localhost:9200/events-default@2015.06.03/_search?pretty' -d '{
  "size": 0,
  "query": {
    "filtered": {
      "filter": {
        "not": {
          "missing": {
            "field": "response"
          }
        }
      }    
    }  
  }
}'
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 15807,
    "max_score" : 0.0,
    "hits" : [ ]
  }
}

The mapping for that field is:

          "response" : {
            "type" : "string",
            "index" : "not_analyzed",
            "doc_values" : true,
            "fielddata" : {
              "format" : "doc_values"
            }
          },

Am I overlooking something simple?


(Colin Goodheart-Smithe) #2

I've tried this on 1.5.2 using the commands in the below gist and could not reproduce your issue. Could you modify your second request (the one for not missing) to add in the terms aggregation and see what terms come up?


(Andrew Swan) #3

Hi, thanks for the response, I ran again with the filtered query and the aggregation together and still no dice:

$ curl -H 'Content-type: application/json' 'http://localhost:9200/events-default@2015.06.03/_search?pretty' -d '{
  "size": 0,
  "query": {
    "filtered": {
      "filter": {
        "not": {
          "missing": {
            "field": "response"
          }
        }
      }    
    }  
  },
  "aggs": {
    "group": {
      "terms": {
        "field": "response"
      }
    }
  }
}'
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 15807,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "group" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [ ]
    }
  }
}

You mentioned 1.5.2 and I'm using 1.5.0. Do you think it would be more worthwhile to try to create a smaller reproducible test case or to switch to 1.5.2? (I'm of course willing to do both, just wondering if one seems more promising than the other)

Thanks for the help!


(Colin Goodheart-Smithe) #4

I think the problem here is whatever analyzer you are using for the response field is not producing any tokens for the responses which have a value. So I would look at what the values of the response field actually are. Try setting size in the request to 10 so you can see a sample of what the values of response are when they are not missing.

Switching to 1.5.2 is unlikely to solve the issue since the changes between 1.5.0 and 1.5.2 re only going to be bugfixes and IIRC there haven't been any relevant terms aggregation bug fixes for this problem. I would try the above suggestion and then work on creating a smaller reproducible test case.


(Andrew Swan) #5

Thanks for the response. As you can see in the original message, the response field is a non-analyzed string. Unless I'm misunderstanding your comment, there are definitely values there:

$ curl -H 'Content-type: application/json' 'http://localhost:9200/events-default@2015.06.03/_search?pretty' -d '{
"size": 10,
"query": { "filtered": { "filter": { "not": { "missing": { "field": "response" } } } } },
"fields": [ "response" ]
}'
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 15807,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "events-default@2015.06.03",
      "_type" : "schema-event",
      "_id" : "AU22uAllZKJmAVjVQjF0",
      "_score" : 1.0,
      "fields" : {
        "response" : [ "200" ]
      }
    }, {
      "_index" : "events-default@2015.06.03",
      "_type" : "schema-event",
      "_id" : "AU22uAllZKJmAVjVQjF1",
      "_score" : 1.0,
      "fields" : {
        "response" : [ "200" ]
      }
    }, {
      "_index" : "events-default@2015.06.03",
      "_type" : "schema-event",
      "_id" : "AU22uAllZKJmAVjVQjF2",
      "_score" : 1.0,
      "fields" : {
        "response" : [ "200" ]
      }
    }, {
      "_index" : "events-default@2015.06.03",
      "_type" : "schema-event",
      "_id" : "AU22uAllZKJmAVjVQjF3",
      "_score" : 1.0,
      "fields" : {
        "response" : [ "200" ]
      }
    }, {
      "_index" : "events-default@2015.06.03",
      "_type" : "schema-event",
      "_id" : "AU22uAllZKJmAVjVQjF4",
      "_score" : 1.0,
      "fields" : {
        "response" : [ "200" ]
      }
    }, {
      "_index" : "events-default@2015.06.03",
      "_type" : "schema-event",
      "_id" : "AU22uAllZKJmAVjVQjF5",
      "_score" : 1.0,
      "fields" : {
        "response" : [ "200" ]
      }
    }, {
      "_index" : "events-default@2015.06.03",
      "_type" : "schema-event",
      "_id" : "AU22uAllZKJmAVjVQjF6",
      "_score" : 1.0,
      "fields" : {
        "response" : [ "200" ]
      }
    }, {
      "_index" : "events-default@2015.06.03",
      "_type" : "schema-event",
      "_id" : "AU22uAllZKJmAVjVQjF7",
      "_score" : 1.0,
      "fields" : {
        "response" : [ "200" ]
      }
    }, {
      "_index" : "events-default@2015.06.03",
      "_type" : "schema-event",
      "_id" : "AU22uAllZKJmAVjVQjF8",
      "_score" : 1.0,
      "fields" : {
        "response" : [ "200" ]
      }
    }, {
      "_index" : "events-default@2015.06.03",
      "_type" : "schema-event",
      "_id" : "AU22uAllZKJmAVjVQjF9",
      "_score" : 1.0,
      "fields" : {
        "response" : [ "200" ]
      }
    } ]
  }
}

I'll work on a smaller repeatable script for reproduction


(Andrew Swan) #6

Well I finally did manage to create a relatively compact reproduction, but I think the process of creating the reproduction pointed me to the problem. The script is here: https://gist.github.com/aswan/d5fbcf92c5e727cff4e1

The issue is that the index includes multiple types (expressed via templates in the script, it could probably be shortened to just use the put mapping api) and another type includes a mapping with a different type for the response field.

Examining this led me to https://github.com/elastic/elasticsearch/issues/8870. According to the comments there, this is expected. So, perhaps this is just a documentation issue, but I am still unable to find material in the ES guide that explains the constraints on field mappings described in that issue. And I have a pile of code that needs to be majorly overhauled. Ugh.


(system) #7