Long delay between indexing a document and its availability in search results


(Michael Korbakov) #1

Hi all!

We are experiencing a problem with ES 5.4.2 on production. About 25% of our indexed documents are only appearing in search results after 3 seconds or more after success response from indexing request. Our cluster consists of three nodes with 59GBs of RAM of which 31GBs are dedicated to ES heap. Heap usage never goes over 75% with GC dropping it down to ~25-30%. Indexing rate is relatively low, less than 10 documents per second. Our average refresh time is ~300 ms. Refresh interval is default 1 s.
With these numbers, I am expecting at most ~1.5 s delay between indexation and search availability. What am I missing here? Is it possible to reduce this search availability delay?

Thank you in advance,

-- Michael


What would cause refresh=wait_for to regularly take 2-5 seconds?
(Michael Korbakov) #2

Please, any suggestions here? I'm out of ideas now. Are there any metrics to check that could be relevant to this situation?


(David Pilato) #3

When you get the response from elasticsearch after the index operation, it should not be that much.

When you wrote:

documents are only appearing in search results after 3 seconds

Do you mean after 3 seconds after you send the request or 3 seconds after you got the response?


(Michael Korbakov) #4

Hi, David!

I measure a delay after a response from index operation and until a document is visible. Here is the script that I use: https://gist.github.com/rmihael/4f37bce239c9265ec69f2dc695ffd405. Measurement is performed in measure_single_document function.
Typical output looks like this:

   Value     Percentile TotalCount 1/(1-Percentile)

   215.0 0.000000000000          1           1.00
   831.0 0.100000000000       1044           1.11
  1279.0 0.190000000000       2027           1.23
  1663.0 0.271000000000       2807           1.37
  1983.0 0.343900000000       3486           1.52
  2431.0 0.409510000000       4343           1.69
  2687.0 0.468559000000       4792           1.88
  2943.0 0.521703100000       5238           2.09
  3327.0 0.569532790000       5876           2.32
  3583.0 0.612579511000       6304           2.58
  3839.0 0.651321559900       6649           2.87
  4095.0 0.686189403910       7002           3.19
  4351.0 0.717570463519       7355           3.54
  4607.0 0.745813417167       7655           3.93
  4863.0 0.771232075450       7973           4.37
  4863.0 0.794108867905       7973           4.86
  5119.0 0.814697981115       8238           5.40
  5375.0 0.833228183003       8495           6.00
  5631.0 0.849905364703       8731           6.66
  5631.0 0.864914828233       8731           7.40
  5887.0 0.878423345409       8943           8.23
  5887.0 0.890581010868       8943           9.14
  6143.0 0.901522909782       9125          10.15
  6143.0 0.911370618803       9125          11.28
  6399.0 0.920233556923       9290          12.54
  6399.0 0.928210201231       9290          13.93
  6655.0 0.935389181108       9450          15.48
  6655.0 0.941850262997       9450          17.20
  6911.0 0.947665236697       9576          19.11
  6911.0 0.952898713028       9576          21.23
  7167.0 0.957608841725       9686          23.59
  7167.0 0.961847957552       9686          26.21
  7167.0 0.965663161797       9686          29.12
  7423.0 0.969096845617       9769          32.36
  7423.0 0.972187161056       9769          35.95
  7423.0 0.974968444950       9769          39.95
  7679.0 0.977471600455       9842          44.39
  7679.0 0.979724440410       9842          49.32
  7679.0 0.981751996369       9842          54.80
  7679.0 0.983576796732       9842          60.89
  7935.0 0.985219117059       9904          67.65
  7935.0 0.986697205353       9904          75.17
  7935.0 0.988027484817       9904          83.52
  7935.0 0.989224736336       9904          92.81
  7935.0 0.990302262702       9904         103.12
  8191.0 0.991272036432       9950         114.57
  8191.0 0.992144832789       9950         127.30
  8191.0 0.992930349510       9950         141.45
  8191.0 0.993637314559       9950         157.17
  8191.0 0.994273583103       9950         174.63
  8191.0 0.994846224793       9950         194.03
  8703.0 0.995361602313       9998         215.59
  8703.0 0.995825442082       9998         239.55
  8703.0 0.996242897874       9998         266.16
  8703.0 0.996618608086       9998         295.74
  8703.0 0.996956747278       9998         328.60
  8703.0 0.997261072550       9998         365.11
  8703.0 0.997534965295       9998         405.67
  8703.0 0.997781468766       9998         450.75
  8703.0 0.998003321889       9998         500.83
  8703.0 0.998202989700       9998         556.48
  8703.0 0.998382690730       9998         618.31
  8703.0 0.998544421657       9998         687.01
  8703.0 0.998689979491       9998         763.35
  8703.0 0.998820981542       9998         848.16
  8703.0 0.998938883388       9998         942.40
  8703.0 0.999044995049       9998        1047.11
  8703.0 0.999140495544       9998        1163.46
  8703.0 0.999226445990       9998        1292.73
  8703.0 0.999303801391       9998        1436.37
  8703.0 0.999373421252       9998        1595.97
  8703.0 0.999436079127       9998        1773.30
  8703.0 0.999492471214       9998        1970.33
  8703.0 0.999543224093       9998        2189.26
  8703.0 0.999588901683       9998        2432.51
  8703.0 0.999630011515       9998        2702.79
  8703.0 0.999667010363       9998        3003.10
  8703.0 0.999700309327       9998        3336.77
  8703.0 0.999730278394       9998        3707.53
  8703.0 0.999757250555       9998        4119.47
  8703.0 0.999781525499       9998        4577.19
  9215.0 0.999803372950      10000        5085.77
  9215.0 1.000000000000      10000
#[Mean    =       3129.2, StdDeviation   =       1932.5]
#[Max     =       9215.0, TotalCount     =      10000.0]
#[Buckets =           18, SubBuckets     =           32]

(David Pilato) #5

Instead of indexing documents one by one, what if you use bulk API instead?
And index all documents at once?

Here I'm pretty sure you are producing a lot of segments which also need to be merged.

Can you try with bulks of 100, 1000, 10000 docs?


(Michael Korbakov) #6

In our application documents get indexed one by one. Of course, we can implement some form of buffering and batching, but that will add another delay to the whole flow. Also, our application indexing rate is rather low, about 10 documents per second. Is batching worth pursuing with that rate?
One curious thing that I noticed is that search thread pool starts to spike while running the benchmark. I also can see some rejections in search pool. Can it indicate some problem that is causing my problems? Indexing, merging and flushing thread pools are all hovering near zero while running the benchmark.


(David Pilato) #7

May be I'm wrongly reading your python script but you are not doing any pause between single index operations, am I right?

about 10 documents per second

In that case may be you need to pause between every single index operation by 100ms so you will get more accurate results?

Is batching worth pursuing with that rate?

Well. In Java we have this super nice BulkProcessor class which automatically send a bulk request every x requests or every x timeunit.

So you can define the bulk processor to flush every 1000 docs or every second and just then add index requests to the bulk processor which will do the job.

In such a case, if you have 1 doc per second, at most you will have to wait for 2 seconds to have it searchable.
If you have a peak of 1000 docs, it will be able to deal with it.


(Michael Korbakov) #8

May be I'm wrongly reading your python script but you are not doing any pause between single index operations, am I right?

My apologies, I should have made myself clear. Our production indexing rate is ~10 docs per second. We are seeing problems with documents not appearing in search results fast enough, and so I decided to make this benchmark to measure our delay. I will modify the benchmark to index 10 documents per second.

So you can define the bulk processor to flush every 1000 docs or every second and just then add index requests to the bulk processor which will do the job.
In such a case, if you have 1 doc per second, at most you will have to wait for 2 seconds to have it searchable.
If you have a peak of 1000 docs, it will be able to deal with it.

That will require routing all indexing operations through a single or maybe two processors. We definitely can do it, but I really want to understand the root of the problem before and see how can we measure it in a better way than just end-to-end latency. You mentioned that we are probably creating a massive number of small segments and that it can affect this latency that we are experiencing. That is what I get by cat-ing segments: https://gist.github.com/rmihael/bd12885808ebb51e59a23a2106d3aaf6. Many segments are not searchable. Indeed, something seems to be wrong here. Could it be helped with more aggressive merging strategy? We definitely have some free IOPS resources what I will happily trade for a lower latency.


(David Pilato) #9

Are you using nested fields or parent/child feature by any chance?


(Michael Korbakov) #10

We use both nested fields and parent/child intensively.


(David Pilato) #11

That's most likely the cause of the problem.
Elasticsearch has to do internal stuff (but @jimczi can tell more about this) and it explains it takes more and more time to make all those tiny segments searchable.

I'm pretty sure you won't see that with a flat model with no p/c and nested.


(Michael Korbakov) #12

Is it possible to tune something knobs to make these segments searchable in a shorter time? I would happily trade some CPU or IOPS to drop higher percentiles of this latency down to 2-3 seconds.


(Jimferenczi) #13

If you use parent/child the refresh needs to rebuild a data structure that is used internally for search. This data structure named global ordinals needs to scan all documents in the index and is expected to take some time to be created.
When using a feature like parent/child you should not use the default refresh time of 1s, there's always a trade off between near real time and complex search feature so you should lower your refresh rate based on the average refresh time you have. Setting the refresh to something like 30s should also make the search faster, otherwise you can also try to review your requirements and not use the parent/child feature (if near real time is more important for your use case).


(Michael Korbakov) #14

Thank you, Jimferenczi!

There is a way how we can move from parent-child to a flatter document structure. But what about nested documents? I cannot think of any trick how to represent them in a flat structure. Is their impact on refresh time neglectable compared to parent-child?


(Michael Korbakov) #15

Before starting work to replace parent-child with a flat structure, I need to understand how nested documents affect search availability latency. Cannot find any relevant information on the Internet. @jimczi, can you help me with it?


(Jimferenczi) #16

Nested fields require some warming when the searcher is refreshed but these warning is done only on the newly added/updated documents and those that were merged after the last refresh. This should have an impact on the refresh time but I'd expect it to be much less than for parent/child except after a big merge.


(system) #17

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.