Finding new values on my index every day

(Pierrick Boutruche) #1

Hello everyone,

Short version :

Is it possible to get (with an aggregation maybe) new values that appeared in a specific field on my index every day ?
If I'm doing a term aggregation on my field "user.location", I'd like to find the diff between the keys found today and yesterday and before.

For instance, if I already have key A,B,C in yesterday and before's keys, and if today I got A,D,E keys, I'd like to get only D&E.

Long version :

I'm currently trying to find a way to harmonize Twitter's user.location field with Elasticsearch.

My final goal is to be able to batch my geocoding queries to geolocate Twitter users without sending millions of request everyday !

I thought of several things:

  1. Analyze this field to only get the relevant words (ie the cities, countries)
  2. Make an aggregation (like term agg) to get all possible values
  3. Find a way to compare today's bucket with the rest

Several question are now raised...

  1. Is it possible to analyze such a field but keep the phrase as a key ? ie, make "London " be "London", but also keep "Paris, FR" as a full key ('cause if I only keep Paris, I will find my Paris,TX tweets in France ^^")
  2. Is there such an aggregation to get a diff between today's agg and the rest ?
  3. Is it possible to get all keys for the aggregation (ie pass the size parameter better than setting size to 1000000 ^^)
  4. If not, would you have a brilliant idea to do the same thing quite easily (in terms of perfs, execution time, complexity)

An idea would be

  • Get today's terms agg about user.location field
  • Compare this list to previous aggs found (how to store the previous list of locations ? how to compare it quickly to today's list ?)
  • Store the new keys

Thanks a lot in advance :smile:

(Mark Walkom) #2

No because ES currently has no concept of what has changed beyond the version.
There is likely to be a changes API coming in future releases, but till then you will need to manage this yourself.

(Pierrick Boutruche) #3

Thanks !

I knew it might not be possible but it's better to ask :slight_smile:

(Alex Roytman) #4

Are you actually indexing tweets and want to find what new locations appeared since yesterday? If so why not extract distinct locations from each batch of tweets and place them into a separate tweet_locations index checking if they already exist first (and you can timestamp them for reporting). This is a cost but it is pay as you go. you can probably do other things with your tweets in the same manner to produce more interesting entity-centric analysis on its fields

(system) #5