Elasticsearch with realtime ingestion

Hi. I read an article about elasticsearch data ingestion.

Clearly, I should post new docs all the time because each new doc will force to reindex the whole thing, right? But what about logs and kibana? What about "real-time" applications? Isn't it ok to index new data everysecond? (using bulk API, ok, but every second anyway). Ingesting data every 10 seconds wouldn't be real-time really. How does kibana (by example) manage this issue?

My intention is to save data constantly in postgreSQL and also en elasticsearch ASAP in order to make new data searchable.

May I encourage you reading http://david.pilato.fr/blog/2015/05/09/advanced-search-for-your-legacy-application/

Well. I like sending data to both systems at the same time:

  • In Postgresql
  • In Elasticsearch

Now the question is about "ASAP". What does it exactly mean? How much are you ok to pay for real "real time" instead of the 1s delay by default or more like 2 or 3 s if you are using a BulkProcessor strategy?

It's always a question of tradeoffs.

If you want to have real real time access, then you can slow down elasticsearch and ask it to answer an index operation only after the refresh is done. This has been introduced recently with the new wait_for_refresh parameter.

It depends again on what you have. If you want to index 1 million documents per second, I don't believe that wait_for_refresh will help. If you have 1 or 2 docs per seconds then I believe it's totally doable.

But note that elasticsearch won't give you back a response until a refresh happened. Which can slow down your "transaction" basically and leave the end user that it's taking now more time then before to save an object.

It depends also on the number of tables you have to update in postgresql for one single object insertion. If it takes already sometime, may be you won't ever notice that you potentially waited for 1 s in elasticsearch?

I'd give elasticsearch a try and see how it works on your real use case. You might be surprised that it's not behaving as bad as you seem to think. (Or I misread your post).

1 Like

Elasticsearch is not a real-time application. Elasticsearch has near-real-time search, and a realtime get API.

Real-time applications operate on hard time constraints. That is, they can execute operations in a fixed amount of time. The time constraints are usually given in microseconds or milliseconds. This is totally different from Elasticsearch.

Near real-time search means, Elasticsearch can regularly schedule for a fresh state of searchable documents, by default one state per second. It does not mean there is any guarantee a new document will reach the index within one second. Note the important difference.

Realtime get API means you can immediately retrieve a document by its ID from the main memory cache after sending it over the API for indexing.

As a side note, the Java VM is not able to operate in realtime per se, so it is impossible for a Java application to be a realtime application. The realtime capabilities are defined by the underlying operating system.

So from your description I guess you do not mean a realtime, but a reactive application.

Thanks both for your replies.

I'm aware of the 1sec delay. That's perfectly fine. "near-real-time search" is fine.

My concern is how often can I send a bulk data to be indexed. You @dadoonet set around 5 secs in your article. What would you say about 2 secs with a 10000 items each time?

Btw, have you tried the queue manager approach (like RabbitMQ)? I'm aware that it may lead to data inconsistency, so you are saving in elastic via code after sql transaction finished. But I'd like to hear queues experiences anyway.

The BulkProcessor works like this:

  • Every 10 000 docs, it is flushed to ES, even if it's every second
  • Or every 5 seconds even if the bulk is not full with 10k docs

You can change to flush every 2 seconds. Totally doable.

Just try it under a typical load to see how it performs on your typical production system.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.