Incremental Indexing Stragegy

Aubrey_Trey_Rhodes · March 8, 2012, 2:34am

Hi All,

I'm working on an application that will index the results of api calls from
a few different services. Right now I'm not indexing that much data (less
than 1000 documents), so when I want to include new information in the
index, I just throw out the whole thing and rebuild it from scratch. I want
to begin to scale up the application and because of limitations with the
api's, it will no longer be possible to make the calls for all the
documents each time I want to update the index.

I'm trying to figure out a strategy for incremental updating using elastic
search so that I only grab new information that is not indexed. Each of the
results have an associated timestamp, so I was going to use that to base
the updating off of, but I'm wondering how to get the maximum timestamp for
the documents in my index.

It looks like I could add a field to the documents that is an integer that
represents the timestamp. From that I could run a query using the timestamp
integer as the custom score and limit the results to 1 to get the maximum.
Is that the best way to get the maximum value of an integer field with
elasticsearch?

How do most applications handle this? Do they just make use of an external
datastore for these types of values?

Thanks!
Aubrey Rhodes

Radu_Gheorghe1 · March 8, 2012, 6:49am

Hi Aubrey,

I'm not sure if I understood your scenario well enough, and I'm not
exactly an Elasticsearch expert, but here's some information that
might help:

you can enable Elasticsearch to automatically add a timestamp for
your documents for the time they were inserted. You can also add that
field in your document and point _timestamp to it. More details here:
Elasticsearch Platform — Find real-time answers at scale | Elastic
This allows you to do all sorts of cool stuff, like:
- sort documents on that field. To get the maximum value, you can do
  a "match_all" query, then sort by timestamp and get the number of
  results to 1, as you said
- do range queries (eg: I want all documents inserted in the last
  hour)
- put a TTL on documents (eg: only keep documents from the last two
  days). More details on TTL and other stuff here:
  Elasticsearch Platform — Find real-time answers at scale | Elastic
provided that you could get some sort of version for your api calls,
you could use versioning in Elasticsearch:
Elasticsearch Platform — Find real-time answers at scale | Elastic

On Mar 8, 4:34 am, "Aubrey "Trey" Rhodes"
aubrey.c.rho...@gmail.com wrote:

Hi All,

I'm working on an application that will index the results of api calls from
a few different services. Right now I'm not indexing that much data (less
than 1000 documents), so when I want to include new information in the
index, I just throw out the whole thing and rebuild it from scratch. I want
to begin to scale up the application and because of limitations with the
api's, it will no longer be possible to make the calls for all the
documents each time I want to update the index.

I'm trying to figure out a strategy for incremental updating using elastic
search so that I only grab new information that is not indexed. Each of the
results have an associated timestamp, so I was going to use that to base
the updating off of, but I'm wondering how to get the maximum timestamp for
the documents in my index.

It looks like I could add a field to the documents that is an integer that
represents the timestamp. From that I could run a query using the timestamp
integer as the custom score and limit the results to 1 to get the maximum.
Is that the best way to get the maximum value of an integer field with
elasticsearch?

How do most applications handle this? Do they just make use of an external
datastore for these types of values?

Thanks!
Aubrey Rhodes

kimchy · March 9, 2012, 6:42pm

I don't understand the possible problem as well…, why not just index the new docs?

On Thursday, March 8, 2012 at 8:49 AM, Radu Gheorghe wrote:

Hi Aubrey,

I'm not sure if I understood your scenario well enough, and I'm not
exactly an Elasticsearch expert, but here's some information that
might help:

you can enable Elasticsearch to automatically add a timestamp for
your documents for the time they were inserted. You can also add that
field in your document and point _timestamp to it. More details here:
Elasticsearch Platform — Find real-time answers at scale | Elastic
This allows you to do all sorts of cool stuff, like:

sort documents on that field. To get the maximum value, you can do
a "match_all" query, then sort by timestamp and get the number of
results to 1, as you said

do range queries (eg: I want all documents inserted in the last
hour)

put a TTL on documents (eg: only keep documents from the last two
days). More details on TTL and other stuff here:
Elasticsearch Platform — Find real-time answers at scale | Elastic

provided that you could get some sort of version for your api calls,
you could use versioning in Elasticsearch:
Elasticsearch Platform — Find real-time answers at scale | Elastic

On Mar 8, 4:34 am, "Aubrey "Trey" Rhodes"
<aubrey.c.rho...@gmail.com (http://gmail.com)> wrote:

Hi All,

I'm working on an application that will index the results of api calls from
a few different services. Right now I'm not indexing that much data (less
than 1000 documents), so when I want to include new information in the
index, I just throw out the whole thing and rebuild it from scratch. I want
to begin to scale up the application and because of limitations with the
api's, it will no longer be possible to make the calls for all the
documents each time I want to update the index.

I'm trying to figure out a strategy for incremental updating using elastic
search so that I only grab new information that is not indexed. Each of the
results have an associated timestamp, so I was going to use that to base
the updating off of, but I'm wondering how to get the maximum timestamp for
the documents in my index.

It looks like I could add a field to the documents that is an integer that
represents the timestamp. From that I could run a query using the timestamp
integer as the custom score and limit the results to 1 to get the maximum.
Is that the best way to get the maximum value of an integer field with
elasticsearch?

How do most applications handle this? Do they just make use of an external
datastore for these types of values?

Thanks!
Aubrey Rhodes

Topic		Replies	Views
Doing periodic searches via API Elasticsearch	7	386	May 4, 2021
How can I search for the latest data entered in the indexes? Elasticsearch	18	572	January 24, 2024
Adding a new field to an index Elasticsearch	6	1084	November 4, 2022
How to query for "max" value? Elasticsearch	4	491	July 5, 2017
Is it possible to auto increment documents as they are indexed? Elasticsearch	2	368	April 25, 2022

Incremental Indexing Stragegy

Related topics