Upgrading from 1.4.x alters terms aggeration

Upgrading from 1.4.5 to 1.5.x (or 1.6.0) breaks terms aggregation. I tried to make a test-case, but could reproduce the issue with simple data.

Anyway, we've got tweets indexed from Twitter streaming API. The mapping is as follows: tweet-mapping.json and an example of one tweet is: example-tweet.json

With 1.4.x the following terms aggregation works as expected (returns buckets with user.id_str):

{
  "size" : 0,
  "aggs" : {
    "foobar" : {
      "terms" : {
        "field" : "user.id_str",
        "size" : 10
      }
    }
  }
}

However, the same query with 1.5.x or 1.6.0 returns buckets where the keys are clearly from the "root" id_str (i.e. the unique key of the actual tweet, and not the user id -- and of course the count is always one).

I am sorry I was unable to draft a simple test case illustrating the issue. But does anyone have any idea what could be going on? Has something changed in the aggregations API?

This is concerning indeed. I know we have bugs in 1.x (will be fixed in 2.0) when the first part of a path also matches a type name, maybe you happen to have a type called user in your index? Does it change anything if you try to provide "tweet.user.id_str" as a field name?

1 Like

Yes, providing tweet.user.id_str as the field seems to fix the issue. IMO this sounds like a very scary regression issue indeed... and again, I could not reproduce the issue with simple example. :frowning:

Hmm... I wonder if we should just downgrade to 1.4.5, or try to modify all of our queries to include the doc-type as well... I guess according to documentation user.id_str should be enough, and the issue I described is an actual bug?

Yes, we also have a document-type user in the same index.

EDIT: But after tests, it seems the issue remains even without the user -type.

Indeed, this is an awful bug: https://github.com/elastic/elasticsearch/issues/4081. For the record, I don't think it is new to 1.5: the reason why you only see it now might be that elasticsearch had to re-read its mappings because of the upgrade which changed which maping is returned first in case there is ambiguity.

Thanks for the clarification!

It is still unclear to me what would be the most future-proof way of solving this if I keep running with 1.6.0. Should I prefix ALL my queries/aggregations/filters etc. with the document-type?

I actually always thought that the url would part of field resolution. For example, the above-mentioned aggr-issue happens when I search by url /$MY_INDEX/tweet/_search. That is, the type is part of the url, so I don't understand why is the type required also in the query. :\

Thanks, really appreciate your help!

I guess one way of solving these issues could be just to make sure each index only stores documents of one single type...

But if you have many types you will have many shards - at least one per index :frowning:

In my case I am not so much concerned with type clash restriction but with the fact that storing the fields in the same lucene field regardless of their type will skew frequencies when querying specific type not to mention that they must be anayzed in the same way etc.

I guess another way is to encapsulate every type in the index with its type as an enclosing json object named after each type. It would make all fields type specific at the cost of memory and performance as described in referenced github issues not to mention changes in the code and overall ugliness of it