Finding new values on my index every day

pbocto · June 13, 2015, 1:01pm

Hello everyone,

Short version :

Is it possible to get (with an aggregation maybe) new values that appeared in a specific field on my index every day ?
If I'm doing a term aggregation on my field "user.location", I'd like to find the diff between the keys found today and yesterday and before.

For instance, if I already have key A,B,C in yesterday and before's keys, and if today I got A,D,E keys, I'd like to get only D&E.

Long version :

I'm currently trying to find a way to harmonize Twitter's user.location field with Elasticsearch.

My final goal is to be able to batch my geocoding queries to geolocate Twitter users without sending millions of request everyday !

I thought of several things:

Analyze this field to only get the relevant words (ie the cities, countries)
Make an aggregation (like term agg) to get all possible values
Find a way to compare today's bucket with the rest

Several question are now raised...

Is it possible to analyze such a field but keep the phrase as a key ? ie, make "London " be "London", but also keep "Paris, FR" as a full key ('cause if I only keep Paris, I will find my Paris,TX tweets in France ^^")
Is there such an aggregation to get a diff between today's agg and the rest ?
Is it possible to get all keys for the aggregation (ie pass the size parameter better than setting size to 1000000 ^^)
If not, would you have a brilliant idea to do the same thing quite easily (in terms of perfs, execution time, complexity)

An idea would be

Get today's terms agg about user.location field
Compare this list to previous aggs found (how to store the previous list of locations ? how to compare it quickly to today's list ?)
Store the new keys

Thanks a lot in advance

warkolm · June 14, 2015, 2:37am

No because ES currently has no concept of what has changed beyond the version.
There is likely to be a changes API coming in future releases, but till then you will need to manage this yourself.

pbocto · June 14, 2015, 1:23pm

Thanks !

I knew it might not be possible but it's better to ask

roytmana · June 15, 2015, 5:40pm

Are you actually indexing tweets and want to find what new locations appeared since yesterday? If so why not extract distinct locations from each batch of tweets and place them into a separate tweet_locations index checking if they already exist first (and you can timestamp them for reporting). This is a cost but it is pay as you go. you can probably do other things with your tweets in the same manner to produce more interesting entity-centric analysis on its fields

Topic		Replies	Views
Finding changes (diffs) Elasticsearch	2	781	January 6, 2017
Identify the difference in data for two successive days Elasticsearch	2	27	July 15, 2024
Compare aggregation result in Elastic Search to find repetitive users Elasticsearch	5	1364	July 5, 2017
Aggregation for change in text field Elasticsearch	3	558	July 29, 2019
Outputting new values in the last X months compared to entire timeline Elasticsearch	2	331	August 23, 2019

Finding new values on my index every day

Related topics