I am wondering wether the index API waits to return a response until the whole indexing process is complete.
More specifically: When I send a document to the API for indexing, will ES only check wether the request, json etc. is valid, then send back the response and proceed with the indexing process without making the client wait?
The context of my question is using ES to log content clicks in a web app. My idea is to simply call an indexing service on specific routes. But I am worried that over time as the indices grow the response time on those routes will go up due to the indexing process blocking the response.
I was thinking that I could implement some sort of pipeline the clicked content could be sent to for logging. But I am not sure if that is overcomplicated and unnecessary.
It blocks until we've fsynced the translog on all shard copies. The document might not be visible for search yet.
Generally folks write to disk and use a second process to stream then into es. That can help batching the fsyncs at the cost of some delay. Filebeat is what I used to use for this sort of thing. I'm likely out of date on that one now.
So far we never used more than one shard per index and only one node (just a medium scale monolithic web app).
Does this still apply in such a simple use case?
I like the Idea of simply writing everything to some file and then just cleaning up using a cron job
The visibility would not be a problem since the data is only accessed by backend users for reports now and then.
But for me it's hard to judge wether this is trying to over optimize. Currently we only have about 50.000 log entries coming in each day. Most of the topics I read here talk about GB's or even TB's of data coming in each day.
Edit
@nik9000 I've just done some reading at the translog docs to better understand your answer.
From what I could understand, setting index.translog.durability to async would achieve what I am looking for, at the cost of loosing some log entries in case of a crash. Did I understand this correctly?
The current logging solution is horribly slow when querying data. Also I would like to take advantage of ES awesome Aggregations and learn more about it.
I am trying to understand if I might run into performance issues like described. But I guess I might be over optimizing.
Do you have any feedback on setting index.translog.durability to async in my use case?
In order to optimize indexing speed, have a look at the guidelines avaialble in the docs. The most important one is probably to index using bulk requests if you are not already doing that.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.