I'm looking for the optimal way to track record additions or changes.
So, say I've got data coming in. This data has timestamps, but the
timestamps don't correspond with the insertion time, and there's no
expectation for them to remain ordered.
I need to dump "new" data periodically, say 15 minutes. There may be new
data in those 15 minutes, there may not.
Two approaches occur to me.
One: Reindex and include _timestamp as a part of the process. Grab all data
since last dump. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-timestamp-field.html
Two: Set a boolean value like "dumped" that, on index insertion and on all
current records, will be false. My dump app would just search everything
for dumped:false, dump it, and on success, set dump:true.
Is there another approach that's better? Is either option 1 or option 2
above preferable? The boolean option seems appealing since there's no need
to keep a track in the dump process of the last time a dump occurred. Maybe
I'll do both, since for cases where "Oh hey, data for time period XYZ->ABC
didn't dump properly!" it would be nice to have explicit timestamps of when
the data was present, but not need to rely on them for the dump process..
It read like you want one index for write and one index for read? In
between to link them, to discover the best way to copy the index?
/Jason
On Sat, Dec 7, 2013 at 8:08 AM, Josh Harrison hijakk@gmail.com wrote:
I'm looking for the optimal way to track record additions or changes.
So, say I've got data coming in. This data has timestamps, but the
timestamps don't correspond with the insertion time, and there's no
expectation for them to remain ordered.
I need to dump "new" data periodically, say 15 minutes. There may be new
data in those 15 minutes, there may not.
Two approaches occur to me.
One: Reindex and include _timestamp as a part of the process. Grab all
data since last dump. Elasticsearch Platform — Find real-time answers at scale | Elastic
Two: Set a boolean value like "dumped" that, on index insertion and on all
current records, will be false. My dump app would just search everything
for dumped:false, dump it, and on success, set dump:true.
Is there another approach that's better? Is either option 1 or option 2
above preferable? The boolean option seems appealing since there's no need
to keep a track in the dump process of the last time a dump occurred. Maybe
I'll do both, since for cases where "Oh hey, data for time period XYZ->ABC
didn't dump properly!" it would be nice to have explicit timestamps of when
the data was present, but not need to rely on them for the dump process..
The dump process would be outputting to JSON files based on particular fields within the data, so having a secondary index wouldn't do the job, unfortunately.
On Dec 7, 2013, at 12:58 AM, Jason Wee peichieh@gmail.com wrote:
Hi,
It read like you want one index for write and one index for read? In between to link them, to discover the best way to copy the index?
/Jason
On Sat, Dec 7, 2013 at 8:08 AM, Josh Harrison hijakk@gmail.com wrote:
I'm looking for the optimal way to track record additions or changes.
So, say I've got data coming in. This data has timestamps, but the timestamps don't correspond with the insertion time, and there's no expectation for them to remain ordered.
I need to dump "new" data periodically, say 15 minutes. There may be new data in those 15 minutes, there may not.
Two approaches occur to me.
One: Reindex and include _timestamp as a part of the process. Grab all data since last dump. Elasticsearch Platform — Find real-time answers at scale | Elastic
Two: Set a boolean value like "dumped" that, on index insertion and on all current records, will be false. My dump app would just search everything for dumped:false, dump it, and on success, set dump:true.
Is there another approach that's better? Is either option 1 or option 2 above preferable? The boolean option seems appealing since there's no need to keep a track in the dump process of the last time a dump occurred. Maybe I'll do both, since for cases where "Oh hey, data for time period XYZ->ABC didn't dump properly!" it would be nice to have explicit timestamps of when the data was present, but not need to rely on them for the dump process..
"sliding window" - organize index names to create incremental indexing each
15 minutes. By selecting the index name by timestamp, you do not need to
dump anything for getting a subset.
"versioning" - use explicit versioning for detecting modifications.
Versions are long values, so they could also represent timestamps. With
scan/scroll, examine document versions if they match a criteria for being
"new". This approach does not scale well with index size.
"logging" - another method would be to add logging statements, maybe near
the translog, collect the logs and process them externally. This requires a
few additional code to the ES core. Collecting logs may be tedious.
On Saturday, December 7, 2013 3:40:49 AM UTC-8, Jörg Prante wrote:
There are more alternatives:
"sliding window" - organize index names to create incremental indexing
each 15 minutes. By selecting the index name by timestamp, you do not need
to dump anything for getting a subset.
"versioning" - use explicit versioning for detecting modifications.
Versions are long values, so they could also represent timestamps. With
scan/scroll, examine document versions if they match a criteria for being
"new". This approach does not scale well with index size.
"logging" - another method would be to add logging statements, maybe near
the translog, collect the logs and process them externally. This requires a
few additional code to the ES core. Collecting logs may be tedious.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.