Hello everyone,
Short version :
Is it possible to get (with an aggregation maybe) new values that appeared in a specific field on my index every day ?
If I'm doing a term aggregation on my field "user.location", I'd like to find the diff between the keys found today and yesterday and before.
For instance, if I already have key A,B,C in yesterday and before's keys, and if today I got A,D,E keys, I'd like to get only D&E.
Long version :
I'm currently trying to find a way to harmonize Twitter's user.location field with Elasticsearch.
My final goal is to be able to batch my geocoding queries to geolocate Twitter users without sending millions of request everyday !
I thought of several things:
- Analyze this field to only get the relevant words (ie the cities, countries)
- Make an aggregation (like term agg) to get all possible values
- Find a way to compare today's bucket with the rest
Several question are now raised...
- Is it possible to analyze such a field but keep the phrase as a key ? ie, make "London " be "London", but also keep "Paris, FR" as a full key ('cause if I only keep Paris, I will find my Paris,TX tweets in France ^^")
- Is there such an aggregation to get a diff between today's agg and the rest ?
- Is it possible to get all keys for the aggregation (ie pass the size parameter better than setting size to 1000000 ^^)
- If not, would you have a brilliant idea to do the same thing quite easily (in terms of perfs, execution time, complexity)
An idea would be
- Get today's terms agg about user.location field
- Compare this list to previous aggs found (how to store the previous list of locations ? how to compare it quickly to today's list ?)
- Store the new keys
Thanks a lot in advance