Incremental Indexing Stragegy


(Aubrey "Trey" Rhodes) #1

Hi All,

I'm working on an application that will index the results of api calls from
a few different services. Right now I'm not indexing that much data (less
than 1000 documents), so when I want to include new information in the
index, I just throw out the whole thing and rebuild it from scratch. I want
to begin to scale up the application and because of limitations with the
api's, it will no longer be possible to make the calls for all the
documents each time I want to update the index.

I'm trying to figure out a strategy for incremental updating using elastic
search so that I only grab new information that is not indexed. Each of the
results have an associated timestamp, so I was going to use that to base
the updating off of, but I'm wondering how to get the maximum timestamp for
the documents in my index.

It looks like I could add a field to the documents that is an integer that
represents the timestamp. From that I could run a query using the timestamp
integer as the custom score and limit the results to 1 to get the maximum.
Is that the best way to get the maximum value of an integer field with
elasticsearch?

How do most applications handle this? Do they just make use of an external
datastore for these types of values?

Thanks!
Aubrey Rhodes


(Radu Gheorghe) #2

Hi Aubrey,

I'm not sure if I understood your scenario well enough, and I'm not
exactly an Elasticsearch expert, but here's some information that
might help:

On Mar 8, 4:34 am, "Aubrey "Trey" Rhodes"
aubrey.c.rho...@gmail.com wrote:

Hi All,

I'm working on an application that will index the results of api calls from
a few different services. Right now I'm not indexing that much data (less
than 1000 documents), so when I want to include new information in the
index, I just throw out the whole thing and rebuild it from scratch. I want
to begin to scale up the application and because of limitations with the
api's, it will no longer be possible to make the calls for all the
documents each time I want to update the index.

I'm trying to figure out a strategy for incremental updating using elastic
search so that I only grab new information that is not indexed. Each of the
results have an associated timestamp, so I was going to use that to base
the updating off of, but I'm wondering how to get the maximum timestamp for
the documents in my index.

It looks like I could add a field to the documents that is an integer that
represents the timestamp. From that I could run a query using the timestamp
integer as the custom score and limit the results to 1 to get the maximum.
Is that the best way to get the maximum value of an integer field with
elasticsearch?

How do most applications handle this? Do they just make use of an external
datastore for these types of values?

Thanks!
Aubrey Rhodes


(Shay Banon) #3

I don't understand the possible problem as well…, why not just index the new docs?

On Thursday, March 8, 2012 at 8:49 AM, Radu Gheorghe wrote:

Hi Aubrey,

I'm not sure if I understood your scenario well enough, and I'm not
exactly an Elasticsearch expert, but here's some information that
might help:

On Mar 8, 4:34 am, "Aubrey "Trey" Rhodes"
<aubrey.c.rho...@gmail.com (http://gmail.com)> wrote:

Hi All,

I'm working on an application that will index the results of api calls from
a few different services. Right now I'm not indexing that much data (less
than 1000 documents), so when I want to include new information in the
index, I just throw out the whole thing and rebuild it from scratch. I want
to begin to scale up the application and because of limitations with the
api's, it will no longer be possible to make the calls for all the
documents each time I want to update the index.

I'm trying to figure out a strategy for incremental updating using elastic
search so that I only grab new information that is not indexed. Each of the
results have an associated timestamp, so I was going to use that to base
the updating off of, but I'm wondering how to get the maximum timestamp for
the documents in my index.

It looks like I could add a field to the documents that is an integer that
represents the timestamp. From that I could run a query using the timestamp
integer as the custom score and limit the results to 1 to get the maximum.
Is that the best way to get the maximum value of an integer field with
elasticsearch?

How do most applications handle this? Do they just make use of an external
datastore for these types of values?

Thanks!
Aubrey Rhodes


(system) #4