Way to track index changes?


(Josh Harrison) #1

I'm looking for the optimal way to track record additions or changes.
So, say I've got data coming in. This data has timestamps, but the
timestamps don't correspond with the insertion time, and there's no
expectation for them to remain ordered.
I need to dump "new" data periodically, say 15 minutes. There may be new
data in those 15 minutes, there may not.

Two approaches occur to me.
One: Reindex and include _timestamp as a part of the process. Grab all data
since last dump.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-timestamp-field.html
Two: Set a boolean value like "dumped" that, on index insertion and on all
current records, will be false. My dump app would just search everything
for dumped:false, dump it, and on success, set dump:true.

Is there another approach that's better? Is either option 1 or option 2
above preferable? The boolean option seems appealing since there's no need
to keep a track in the dump process of the last time a dump occurred. Maybe
I'll do both, since for cases where "Oh hey, data for time period XYZ->ABC
didn't dump properly!" it would be nice to have explicit timestamps of when
the data was present, but not need to rely on them for the dump process..

Thoughts?

Thanks,
Josh

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/900b0e99-d4a2-4e5f-9408-d86198cd63b2%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jason Wee) #2

Hi,

It read like you want one index for write and one index for read? In
between to link them, to discover the best way to copy the index?

/Jason

On Sat, Dec 7, 2013 at 8:08 AM, Josh Harrison hijakk@gmail.com wrote:

I'm looking for the optimal way to track record additions or changes.
So, say I've got data coming in. This data has timestamps, but the
timestamps don't correspond with the insertion time, and there's no
expectation for them to remain ordered.
I need to dump "new" data periodically, say 15 minutes. There may be new
data in those 15 minutes, there may not.

Two approaches occur to me.
One: Reindex and include _timestamp as a part of the process. Grab all
data since last dump.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-timestamp-field.html
Two: Set a boolean value like "dumped" that, on index insertion and on all
current records, will be false. My dump app would just search everything
for dumped:false, dump it, and on success, set dump:true.

Is there another approach that's better? Is either option 1 or option 2
above preferable? The boolean option seems appealing since there's no need
to keep a track in the dump process of the last time a dump occurred. Maybe
I'll do both, since for cases where "Oh hey, data for time period XYZ->ABC
didn't dump properly!" it would be nice to have explicit timestamps of when
the data was present, but not need to rely on them for the dump process..

Thoughts?

Thanks,
Josh

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/900b0e99-d4a2-4e5f-9408-d86198cd63b2%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHO4ityaZ47H56tdZxhpVUyiqsLmV8ixGA6HL_B21wSfz9xkdg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Josh Harrison) #3

The dump process would be outputting to JSON files based on particular fields within the data, so having a secondary index wouldn't do the job, unfortunately.
On Dec 7, 2013, at 12:58 AM, Jason Wee peichieh@gmail.com wrote:

Hi,

It read like you want one index for write and one index for read? In between to link them, to discover the best way to copy the index?

/Jason

On Sat, Dec 7, 2013 at 8:08 AM, Josh Harrison hijakk@gmail.com wrote:
I'm looking for the optimal way to track record additions or changes.
So, say I've got data coming in. This data has timestamps, but the timestamps don't correspond with the insertion time, and there's no expectation for them to remain ordered.
I need to dump "new" data periodically, say 15 minutes. There may be new data in those 15 minutes, there may not.

Two approaches occur to me.
One: Reindex and include _timestamp as a part of the process. Grab all data since last dump. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-timestamp-field.html
Two: Set a boolean value like "dumped" that, on index insertion and on all current records, will be false. My dump app would just search everything for dumped:false, dump it, and on success, set dump:true.

Is there another approach that's better? Is either option 1 or option 2 above preferable? The boolean option seems appealing since there's no need to keep a track in the dump process of the last time a dump occurred. Maybe I'll do both, since for cases where "Oh hey, data for time period XYZ->ABC didn't dump properly!" it would be nice to have explicit timestamps of when the data was present, but not need to rely on them for the dump process..

Thoughts?

Thanks,
Josh

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/900b0e99-d4a2-4e5f-9408-d86198cd63b2%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/AnlueNUaHtY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHO4ityaZ47H56tdZxhpVUyiqsLmV8ixGA6HL_B21wSfz9xkdg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/B7D45BE5-501D-4922-B42B-9C15ADAA1D0E%40gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #4

There are more alternatives:

"sliding window" - organize index names to create incremental indexing each
15 minutes. By selecting the index name by timestamp, you do not need to
dump anything for getting a subset.

"versioning" - use explicit versioning for detecting modifications.
Versions are long values, so they could also represent timestamps. With
scan/scroll, examine document versions if they match a criteria for being
"new". This approach does not scale well with index size.

"logging" - another method would be to add logging statements, maybe near
the translog, collect the logs and process them externally. This requires a
few additional code to the ES core. Collecting logs may be tedious.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGrKNKO%2B2-tDLkA4n5y7UAWtUemPdnTpow7VXiPt-WNFA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Josh Harrison) #5

Thank you for the additional options Jörg

On Saturday, December 7, 2013 3:40:49 AM UTC-8, Jörg Prante wrote:

There are more alternatives:

"sliding window" - organize index names to create incremental indexing
each 15 minutes. By selecting the index name by timestamp, you do not need
to dump anything for getting a subset.

"versioning" - use explicit versioning for detecting modifications.
Versions are long values, so they could also represent timestamps. With
scan/scroll, examine document versions if they match a criteria for being
"new". This approach does not scale well with index size.

"logging" - another method would be to add logging statements, maybe near
the translog, collect the logs and process them externally. This requires a
few additional code to the ES core. Collecting logs may be tedious.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/05eff500-d3e0-4aa6-b020-11b970268d83%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #6