I'm working on an application that will index the results of api calls from
a few different services. Right now I'm not indexing that much data (less
than 1000 documents), so when I want to include new information in the
index, I just throw out the whole thing and rebuild it from scratch. I want
to begin to scale up the application and because of limitations with the
api's, it will no longer be possible to make the calls for all the
documents each time I want to update the index.
I'm trying to figure out a strategy for incremental updating using elastic
search so that I only grab new information that is not indexed. Each of the
results have an associated timestamp, so I was going to use that to base
the updating off of, but I'm wondering how to get the maximum timestamp for
the documents in my index.
It looks like I could add a field to the documents that is an integer that
represents the timestamp. From that I could run a query using the timestamp
integer as the custom score and limit the results to 1 to get the maximum.
Is that the best way to get the maximum value of an integer field with
elasticsearch?
How do most applications handle this? Do they just make use of an external
datastore for these types of values?
I'm not sure if I understood your scenario well enough, and I'm not
exactly an Elasticsearch expert, but here's some information that
might help:
you can enable Elasticsearch to automatically add a timestamp for
your documents for the time they were inserted. You can also add that
field in your document and point _timestamp to it. More details here: Elasticsearch Platform — Find real-time answers at scale | Elastic
This allows you to do all sorts of cool stuff, like:
sort documents on that field. To get the maximum value, you can do
a "match_all" query, then sort by timestamp and get the number of
results to 1, as you said
do range queries (eg: I want all documents inserted in the last
hour)
I'm working on an application that will index the results of api calls from
a few different services. Right now I'm not indexing that much data (less
than 1000 documents), so when I want to include new information in the
index, I just throw out the whole thing and rebuild it from scratch. I want
to begin to scale up the application and because of limitations with the
api's, it will no longer be possible to make the calls for all the
documents each time I want to update the index.
I'm trying to figure out a strategy for incremental updating using elastic
search so that I only grab new information that is not indexed. Each of the
results have an associated timestamp, so I was going to use that to base
the updating off of, but I'm wondering how to get the maximum timestamp for
the documents in my index.
It looks like I could add a field to the documents that is an integer that
represents the timestamp. From that I could run a query using the timestamp
integer as the custom score and limit the results to 1 to get the maximum.
Is that the best way to get the maximum value of an integer field with
elasticsearch?
How do most applications handle this? Do they just make use of an external
datastore for these types of values?
I don't understand the possible problem as well…, why not just index the new docs?
On Thursday, March 8, 2012 at 8:49 AM, Radu Gheorghe wrote:
Hi Aubrey,
I'm not sure if I understood your scenario well enough, and I'm not
exactly an Elasticsearch expert, but here's some information that
might help:
you can enable Elasticsearch to automatically add a timestamp for
your documents for the time they were inserted. You can also add that
field in your document and point _timestamp to it. More details here: Elasticsearch Platform — Find real-time answers at scale | Elastic
This allows you to do all sorts of cool stuff, like:
sort documents on that field. To get the maximum value, you can do
a "match_all" query, then sort by timestamp and get the number of
results to 1, as you said
do range queries (eg: I want all documents inserted in the last
hour)
I'm working on an application that will index the results of api calls from
a few different services. Right now I'm not indexing that much data (less
than 1000 documents), so when I want to include new information in the
index, I just throw out the whole thing and rebuild it from scratch. I want
to begin to scale up the application and because of limitations with the
api's, it will no longer be possible to make the calls for all the
documents each time I want to update the index.
I'm trying to figure out a strategy for incremental updating using elastic
search so that I only grab new information that is not indexed. Each of the
results have an associated timestamp, so I was going to use that to base
the updating off of, but I'm wondering how to get the maximum timestamp for
the documents in my index.
It looks like I could add a field to the documents that is an integer that
represents the timestamp. From that I could run a query using the timestamp
integer as the custom score and limit the results to 1 to get the maximum.
Is that the best way to get the maximum value of an integer field with
elasticsearch?
How do most applications handle this? Do they just make use of an external
datastore for these types of values?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.