I have a process that does bulk indexing, and few other processes that do multiget requests. And there is an API that bulk writes some documents to ES and must ensure that data is searchable before it returns. The calls to the API are not very frequent, but it's quite important that they complete as soon as possible. To achieve the requirement that the data is searchable after the API call returns I tried various approaches, none of which worked well.
adding refresh=true to bulk requests: the requests always time out if there is another indexing going on
refreshing using the refresh API: refresh always times out (I even tried to refresh with curl and waited for up to an hour - the request would never complete!)
doing a search using the ids query and making sure the newly written data is read (in a hope that automatic refresh helps) - it takes ridiculously long to get the fresh data back (more than 30s with refresh_interval set to 1s).
The refresh_interval is set to 1s (initially it was 5s, but switched to 1s to test out approach #3; it really makes no difference in the result). There are 1000 shards in the index spread over 170 nodes. As far as I can see, no process is CPU or disk-bound.
How do I find out what's happening? What could block forced refreshes and why refresh_interval is not tolerated?
It is ES 6.2.1
The hardware is quite good:
Intel Xeon Silver 4114 CPU (40 cores @ 2.2GHz)
3x 2TB Intel SSD
128GB RAM (Java heap is 30GB)
10Gbit network
There are so many shards because the amount data stored is quite large: 6.5 billion documents occupying around 67TB of disk space.
We don't have monitoring deployed. What kind of measurement would be interesting to have?
Is there anything I could check right away?
For example, when I issue a refresh request it just blocks forever. Is there a way to check what is going on and why it is blocked?
It's mostly about updating the existing documents, but sometimes new ones are created.
The documents that are updated or created by API have nested documents within them. But the number of documents with nested documents is quite low: there are only around 30M of them and less than around 8 such documents are indexed per second in total, and only one document with nesting is indexed by the API in around 2-3 seconds.
Here is how a typical log exhibiting the issue looks like:
[2018-05-22T05:25:37.8540] bulk_result elapsed 0.0299 secs stats "create": 1
[2018-05-22T05:25:37.8541] waiting for 1 documents to refresh
[2018-05-22T05:25:39.2726] waiting for 1 documents to refresh
[2018-05-22T05:25:40.4803] waiting for 1 documents to refresh
[2018-05-22T05:25:41.8195] waiting for 1 documents to refresh
[2018-05-22T05:25:43.0987] waiting for 1 documents to refresh
[2018-05-22T05:25:44.4356] waiting for 1 documents to refresh
[2018-05-22T05:25:45.7103] waiting for 1 documents to refresh
[2018-05-22T05:25:46.9689] waiting for 1 documents to refresh
[2018-05-22T05:25:47.8549] wait for refresh timed out after 10 secs
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.