ES Write timeout

@Christian_Dahlqvist Can you also tell me why does the performance strat decreasing over time like after 10hrs i could see there is a lag for 30min... there wont be any read operation in between... Initially it sustains for good time after that starts decreasing...

I observe is that iowait, cpu n memory is under control...

top - 09:28:48 up 7 days,  3:13,  4 users,  load average: 25.29, 26.22, 26.14
Tasks: 662 total,   1 running, 645 sleeping,   0 stopped,  16 zombie
%Cpu(s): 71.2 us,  3.6 sy,  0.0 ni, 24.4 id,  0.6 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem : 32742016 total,  1509504 free, 24507232 used,  6725280 buff/cache
KiB Swap:  8257532 total,     5148 free,  8252384 used.  7115648 avail Mem
PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAD
3654 root      20   0 10.549g 1.703g   5604 S 545.7  5.5  12267:43 java
4540 499       20   0 18.043g 7.956g 140028 S 528.9 25.5   3044:51 java
 713 root      20   0  9.847g 761640   6432 S  86.2  2.3 392:26.51 java
29447 root      20   0 2120316  75080  11088 S  11.2  0.2   1419:29 kubelet
 5555 499       20   0 9954.5m  84136   2552 S   4.6  0.3 651:34.62 beam.smp
31611 501       20   0 15.939g 984944   6424 S   3.0  3.0 862:33.44 java

I have attached the hot threads which may be causing some cpu issue, can you let me know anything can be tuned to achieve better utilization
::: {metrics-datastore_1432058dd171e901d3813c97547a24a0}{8T-f4IVlQiSyAWDZO8H6fg}{172.30.0.12}{172.30.0.12:9300}{max_local_storage_nodes=1, master=false}
Hot threads at 2017-10-06T09:23:56.366Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

87.4% (436.9ms out of 500ms) cpu usage by thread 'elasticsearch[metrics-datastore_1432058dd171e901d3813c97547a24a0][bulk][T#14]'
9/10 snapshots sharing following 17 elements
org.elasticsearch.index.engine.InternalEngine.create(InternalEngine.java:346)
org.elasticsearch.index.shard.IndexShard.create(IndexShard.java:545)
org.elasticsearch.index.engine.Engine$Create.execute(Engine.java:810)

 unique snapshot
   java.util.ArrayList.grow(ArrayList.java:242)
   java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:216)
   java.util.ArrayList.ensureCapacityInternal(ArrayList.java:208)
   java.util.ArrayList.add(ArrayList.java:440)


   org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:287)
   org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)

84.6% (422.8ms out of 500ms) cpu usage by thread 'elasticsearch[metrics-datastore_1432058dd171e901d3813c97547a24a0][bulk][T#1]'
2/10 snapshots sharing following 22 elements
org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:425)
org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1318)
org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1297)
org.elasticsearch.index.engine.InternalEngine.innerCreateNoLock(InternalEngine.java:432)
org.elasticsearch.index.engine.InternalEngine.innerCreate(InternalEngine.java:375)

 2/10 snapshots sharing following 24 elements
   java.lang.ThreadLocal.get(ThreadLocal.java:144)
   org.apache.lucene.util.CloseableThreadLocal.get(CloseableThreadLocal.java:78)
   org.elasticsearch.common.lucene.uid.Versions.getLookupState(Versions.java:81)

 2/10 snapshots sharing following 26 elements
   org.apache.lucene.index.DefaultIndexingChain.indexDocValue(DefaultIndexingChain.java:470)
   org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:397)

 4/10 snapshots sharing following 13 elements
   org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:327)
   org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:120)

78.3% (391.5ms out of 500ms) cpu usage by thread 'elasticsearch[metrics-datastore_1432058dd171e901d3813c97547a24a0][bulk][T#9]'
3/10 snapshots sharing following 24 elements
org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:321)
org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:273)
org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:413)

 2/10 snapshots sharing following 28 elements
   sun.misc.Unsafe.park(Native Method)

 2/10 snapshots sharing following 22 elements
   org.elasticsearch.common.lucene.uid.PerThreadIDAndVersionLookup.lookup(PerThreadIDAndVersionLookup.java:88)
   org.elasticsearch.common.lucene.uid.Versions.loadDocIdAndVersion(Versions.java:124)

::: {metrics-master-99e749c5a62e880d4be629941c9831ff}{LLrkkKDmQ3ey9QpXZ5shAA}{172.30.0.7}{172.30.0.7:9300}{data=false, max_local_storage_nodes=1, master=true}
Hot threads at 2017-10-06T09:23:56.617Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

0.0% (217.3micros out of 500ms) cpu usage by thread 'elasticsearch[metrics-master-99e749c5a62e880d4be629941c9831ff][transport_client_timer][T#1]{Hashed wheel timer #1}'
 10/10 snapshots sharing following 5 elements
   java.lang.Thread.sleep(Native Method)
   org.jboss.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:445)
   org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:364)
   org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
   java.lang.Thread.run(Thread.java:748)

It looks like it is busy processing bulk requests, which is expected. I am however not sure why it is slowing down. Do you have swap disabled on these hosts?

bootstrap.memory_lock: true
cat /proc/sys/vm/swappiness
0

And also tell me is it ideal number?

I have each record with 150 fields and two indices... 16core cpu, 6.9 gb + 6.9 gb(lucene) and one HDD with 10TB

This is one node setupp

I am getting 4000 records per second

If this is the max i can get there is no point for me spending effort here i will move on to cluster with 3 nodes
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
20 1 7178288 1957900 0 2360584 19 10 153 698 0 4 51 6 35 8 0
21 0 7178256 1942376 0 2371876 16 0 16 3048 22194 80958 64 7 28 1 0
18 0 7178040 1930636 0 2377212 248 0 248 10484 26036 80636 57 11 29 3 0
17 1 7178036 1922604 0 2392356 0 0 4 16676 23970 75937 83 8 9 0 0
17 0 7178020 1910772 0 2402940 8 0 16 10668 22204 67250 81 5 14 1 0
13 0 7178000 1902928 0 2410576 0 0 12 3292 22026 99563 87 4 9 0 0
12 1 7177996 1894124 0 2418100 12 0 268 20724 19864 76057 66 5 27 2 0
14 0 7177996 1880312 0 2434116 0 0 196 9784 16767 83198 49 4 46 1 0
15 0 7177988 1859668 0 2455364 0 0 0 9644 19865 84736 64 5 30 1 0
10 0 7177984 1837128 0 2475852 0 0 0 13160 19393 81430 65 5 30 0 0

I do not know what the maximum throughput you can get is with your data and environment.

Do we have any doc on what performance testing done on ES as a product based on that we can estimate... It may be too much to ask but its critical for me...

Indexing throughput will depend a lot on size and complexity of documents and mappings as well as the hardware used. Comparing with a benchmark someone else has performed with different data and conditions will therefore not tell you much.

If you were interested in seeing how your environment performs for a standard benchmark, you could try to use Rally and one of its standard benchmarks. This might be comparable to the same benchmark run by others on different platforms, but most of the standard benchmarks have much smaller documents.

There is also a webinar that shows some examples.

Ok thanks i will give a try

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.