We are experiencing a problem with ES 5.4.2 on production. About 25% of our indexed documents are only appearing in search results after 3 seconds or more after success response from indexing request. Our cluster consists of three nodes with 59GBs of RAM of which 31GBs are dedicated to ES heap. Heap usage never goes over 75% with GC dropping it down to ~25-30%. Indexing rate is relatively low, less than 10 documents per second. Our average refresh time is ~300 ms. Refresh interval is default 1 s.
With these numbers, I am expecting at most ~1.5 s delay between indexation and search availability. What am I missing here? Is it possible to reduce this search availability delay?
I measure a delay after a response from index operation and until a document is visible. Here is the script that I use: https://gist.github.com/rmihael/4f37bce239c9265ec69f2dc695ffd405. Measurement is performed in measure_single_document function.
Typical output looks like this:
In our application documents get indexed one by one. Of course, we can implement some form of buffering and batching, but that will add another delay to the whole flow. Also, our application indexing rate is rather low, about 10 documents per second. Is batching worth pursuing with that rate?
One curious thing that I noticed is that search thread pool starts to spike while running the benchmark. I also can see some rejections in search pool. Can it indicate some problem that is causing my problems? Indexing, merging and flushing thread pools are all hovering near zero while running the benchmark.
May be I'm wrongly reading your python script but you are not doing any pause between single index operations, am I right?
about 10 documents per second
In that case may be you need to pause between every single index operation by 100ms so you will get more accurate results?
Is batching worth pursuing with that rate?
Well. In Java we have this super nice BulkProcessor class which automatically send a bulk request every x requests or every x timeunit.
So you can define the bulk processor to flush every 1000 docs or every second and just then add index requests to the bulk processor which will do the job.
In such a case, if you have 1 doc per second, at most you will have to wait for 2 seconds to have it searchable.
If you have a peak of 1000 docs, it will be able to deal with it.
May be I'm wrongly reading your python script but you are not doing any pause between single index operations, am I right?
My apologies, I should have made myself clear. Our production indexing rate is ~10 docs per second. We are seeing problems with documents not appearing in search results fast enough, and so I decided to make this benchmark to measure our delay. I will modify the benchmark to index 10 documents per second.
So you can define the bulk processor to flush every 1000 docs or every second and just then add index requests to the bulk processor which will do the job.
In such a case, if you have 1 doc per second, at most you will have to wait for 2 seconds to have it searchable.
If you have a peak of 1000 docs, it will be able to deal with it.
That will require routing all indexing operations through a single or maybe two processors. We definitely can do it, but I really want to understand the root of the problem before and see how can we measure it in a better way than just end-to-end latency. You mentioned that we are probably creating a massive number of small segments and that it can affect this latency that we are experiencing. That is what I get by cat-ing segments: curl /_cat/segments?v · GitHub. Many segments are not searchable. Indeed, something seems to be wrong here. Could it be helped with more aggressive merging strategy? We definitely have some free IOPS resources what I will happily trade for a lower latency.
That's most likely the cause of the problem.
Elasticsearch has to do internal stuff (but @jimczi can tell more about this) and it explains it takes more and more time to make all those tiny segments searchable.
I'm pretty sure you won't see that with a flat model with no p/c and nested.
Is it possible to tune something knobs to make these segments searchable in a shorter time? I would happily trade some CPU or IOPS to drop higher percentiles of this latency down to 2-3 seconds.
If you use parent/child the refresh needs to rebuild a data structure that is used internally for search. This data structure named global ordinals needs to scan all documents in the index and is expected to take some time to be created.
When using a feature like parent/child you should not use the default refresh time of 1s, there's always a trade off between near real time and complex search feature so you should lower your refresh rate based on the average refresh time you have. Setting the refresh to something like 30s should also make the search faster, otherwise you can also try to review your requirements and not use the parent/child feature (if near real time is more important for your use case).
There is a way how we can move from parent-child to a flatter document structure. But what about nested documents? I cannot think of any trick how to represent them in a flat structure. Is their impact on refresh time neglectable compared to parent-child?
Before starting work to replace parent-child with a flat structure, I need to understand how nested documents affect search availability latency. Cannot find any relevant information on the Internet. @jimczi, can you help me with it?
Nested fields require some warming when the searcher is refreshed but these warning is done only on the newly added/updated documents and those that were merged after the last refresh. This should have an impact on the refresh time but I'd expect it to be much less than for parent/child except after a big merge.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.