ElasticSearch performance trouble when indexing data

Hi.

I have about 4 TB of data I want to index in Elasticsearch. The amount of data per index will vary, depending on the source data file. These are only static files containing old data that will be ingested and made searchable in Elasticsearch.

Currently I am using a single node with 16 GB RAM (8 GB heap), 2 CPU threads, 1.8TB SSD storage, and using Logstash to parse and ship the source data.

After a few hours Elasticsearch seems to bottleneck somewhere (seems to be GC), and the ingestion rate drops to almost zero (over time). The exact same problem happened when I used 3 nodes, each with approximately 32GB RAM, but with hybrid (SSHD) drives. I therefore upgraded to SSD in hopes of solving the problem as I initially thought it was due to a I/O bottleneck.

Right now I am indexing a source data file that is 77 GB in size, with an index rate at ~1.000 docs/s, while it in the beginning was above 6.000 docs/s or closer to 10.000 docs/s. For reference, I have set shards to 16 for this index so Elasticsearch won't have to grind through a single shard of 77 GB data when doing a text-search.

Are there any recommended methods I can use to find out what is causing these problems? I am monitoring my node(s) with Kibana, and the node state is green while experiencing these issues.

What version?
Are you using Monitoring?
What do your logs show?

  1. I am using Elasticsearch version 7.12.0.
  2. Yes, I am using (self-)monitoring, but not MetricBeat.
  3. Any log files in specific? Logs seem normal, apart from gc.log which contains the following:
...
[2021-03-31T07:21:25.772+0000][4453][gc,metaspace] GC(2482) Metaspace: 121007K(126128K)->121007K(126128K) NonClass: 106031K(109668K)->106031K(109668K) Class: 14975K(16460K)->14975K(16460K)
[2021-03-31T07:21:25.772+0000][4453][gc          ] GC(2482) Pause Young (Normal) (G1 Evacuation Pause) 5791M->923M(8192M) 21.183ms
[2021-03-31T07:21:25.772+0000][4453][gc,cpu      ] GC(2482) User=0.03s Sys=0.01s Real=0.02s
[2021-03-31T07:21:25.772+0000][4453][safepoint   ] Safepoint "G1CollectForAllocation", Time since last: 84502828805 ns, Reaching safepoint: 242950 ns, At safepoint: 21256625 ns, Total: 21499575 ns
[2021-03-31T07:21:56.789+0000][4453][safepoint   ] Safepoint "Cleanup", Time since last: 31015665668 ns, Reaching safepoint: 705118 ns, At safepoint: 7156 ns, Total: 712274 ns
[2021-03-31T07:23:12.849+0000][4453][safepoint   ] Safepoint "Cleanup", Time since last: 76059855685 ns, Reaching safepoint: 356356 ns, At safepoint: 10900 ns, Total: 367256 ns
[2021-03-31T07:24:04.757+0000][4453][safepoint   ] Safepoint "Cleanup", Time since last: 51906991207 ns, Reaching safepoint: 541435 ns, At safepoint: 11329 ns, Total: 552764 ns
[2021-03-31T07:24:05.758+0000][4453][safepoint   ] Safepoint "Cleanup", Time since last: 1000163195 ns, Reaching safepoint: 799129 ns, At safepoint: 7684 ns, Total: 806813 ns
[2021-03-31T07:24:06.759+0000][4453][safepoint   ] Safepoint "Cleanup", Time since last: 1000371578 ns, Reaching safepoint: 627182 ns, At safepoint: 12088 ns, Total: 639270 ns
[2021-03-31T07:24:07.760+0000][4453][safepoint   ] Safepoint "Cleanup", Time since last: 1000825500 ns, Reaching safepoint: 679162 ns, At safepoint: 11611 ns, Total: 690773 ns
[2021-03-31T07:24:12.784+0000][4453][safepoint   ] Safepoint "Cleanup", Time since last: 5022487052 ns, Reaching safepoint: 919432 ns, At safepoint: 12914 ns, Total: 932346 ns
[2021-03-31T07:24:16.786+0000][4453][safepoint   ] Safepoint "Cleanup", Time since last: 4001492138 ns, Reaching safepoint: 700436 ns, At safepoint: 11986 ns, Total: 712422 ns
[2021-03-31T07:24:17.787+0000][4453][safepoint   ] Safepoint "Cleanup", Time since last: 1000639456 ns, Reaching safepoint: 390572 ns, At safepoint: 10786 ns, Total: 401358 ns
[2021-03-31T07:24:22.502+0000][4453][safepoint   ] Safepoint "ICBufferFull", Time since last: 4714536247 ns, Reaching safepoint: 348456 ns, At safepoint: 13587 ns, Total: 362043 ns
[2021-03-31T07:24:23.505+0000][4453][safepoint   ] Safepoint "Cleanup", Time since last: 1002573986 ns, Reaching safepoint: 459281 ns, At safepoint: 7274 ns, Total: 466555 ns
[2021-03-31T07:24:27.507+0000][4453][safepoint   ] Safepoint "Cleanup", Time since last: 4001502426 ns, Reaching safepoint: 712166 ns, At safepoint: 9956 ns, Total: 722122 ns
[2021-03-31T07:24:30.508+0000][4453][safepoint   ] Safepoint "Cleanup", Time since last: 3000586564 ns, Reaching safepoint: 472264 ns, At safepoint: 12146 ns, Total: 484410 ns
[2021-03-31T07:24:31.509+0000][4453][safepoint   ] Safepoint "Cleanup", Time since last: 1000098506 ns, Reaching safepoint: 590711 ns, At safepoint: 10511 ns, Total: 601222 ns
[2021-03-31T07:24:37.514+0000][4453][safepoint   ] Safepoint "Cleanup", Time since last: 6004311709 ns, Reaching safepoint: 784450 ns, At safepoint: 9195 ns, Total: 793645 ns
[2021-03-31T07:24:38.516+0000][4453][safepoint   ] Safepoint "Cleanup", Time since last: 1001205051 ns, Reaching safepoint: 667332 ns, At safepoint: 10415 ns, Total: 677747 ns
[2021-03-31T07:24:42.688+0000][4453][gc,start    ] GC(2483) Pause Young (Normal) (G1 Evacuation Pause)
[2021-03-31T07:24:42.688+0000][4453][gc,task     ] GC(2483) Using 2 workers of 2 for evacuation
[2021-03-31T07:24:42.688+0000][4453][gc,age      ] GC(2483) Desired survivor size 322961408 bytes, new threshold 15 (max threshold 15)
[2021-03-31T07:24:42.740+0000][4453][gc,age      ] GC(2483) Age table with threshold 15 (max threshold 15)
[2021-03-31T07:24:42.740+0000][4453][gc,age      ] GC(2483) - age   1:   41746824 bytes,   41746824 total
[2021-03-31T07:24:42.741+0000][4453][gc,age      ] GC(2483) - age   2:    3600264 bytes,   45347088 total
[2021-03-31T07:24:42.741+0000][4453][gc,age      ] GC(2483) - age   3:    3654784 bytes,   49001872 total
[2021-03-31T07:24:42.741+0000][4453][gc,age      ] GC(2483) - age   4:    1156200 bytes,   50158072 total
[2021-03-31T07:24:42.741+0000][4453][gc,age      ] GC(2483) - age   5:     546032 bytes,   50704104 total
[2021-03-31T07:24:42.741+0000][4453][gc,age      ] GC(2483) - age   6:     717064 bytes,   51421168 total
[2021-03-31T07:24:42.741+0000][4453][gc,age      ] GC(2483) - age   7:    9364616 bytes,   60785784 total
[2021-03-31T07:24:42.741+0000][4453][gc,age      ] GC(2483) - age   8:     423152 bytes,   61208936 total
[2021-03-31T07:24:42.741+0000][4453][gc,age      ] GC(2483) - age   9:     612616 bytes,   61821552 total
[2021-03-31T07:24:42.741+0000][4453][gc,age      ] GC(2483) - age  10:     423392 bytes,   62244944 total
[2021-03-31T07:24:42.741+0000][4453][gc,age      ] GC(2483) - age  11:     335208 bytes,   62580152 total
[2021-03-31T07:24:42.741+0000][4453][gc,age      ] GC(2483) - age  12:     458056 bytes,   63038208 total
[2021-03-31T07:24:42.741+0000][4453][gc,age      ] GC(2483) - age  13:     224616 bytes,   63262824 total
[2021-03-31T07:24:42.741+0000][4453][gc,age      ] GC(2483) - age  14:     405144 bytes,   63667968 total
[2021-03-31T07:24:42.741+0000][4453][gc,age      ] GC(2483) - age  15:     258472 bytes,   63926440 total
[2021-03-31T07:24:42.741+0000][4453][gc,phases   ] GC(2483)   Pre Evacuate Collection Set: 0.2ms
[2021-03-31T07:24:42.741+0000][4453][gc,phases   ] GC(2483)   Merge Heap Roots: 0.3ms
[2021-03-31T07:24:42.741+0000][4453][gc,phases   ] GC(2483)   Evacuate Collection Set: 46.0ms
[2021-03-31T07:24:42.741+0000][4453][gc,phases   ] GC(2483)   Post Evacuate Collection Set: 5.9ms
[2021-03-31T07:24:42.741+0000][4453][gc,phases   ] GC(2483)   Other: 0.3ms
[2021-03-31T07:24:42.741+0000][4453][gc,heap     ] GC(2483) Eden regions: 1217->0(1212)
[2021-03-31T07:24:42.741+0000][4453][gc,heap     ] GC(2483) Survivor regions: 11->16(154)
[2021-03-31T07:24:42.741+0000][4453][gc,heap     ] GC(2483) Old regions: 218->219
[2021-03-31T07:24:42.741+0000][4453][gc,heap     ] GC(2483) Archive regions: 2->2
[2021-03-31T07:24:42.741+0000][4453][gc,heap     ] GC(2483) Humongous regions: 2->2
[2021-03-31T07:24:42.741+0000][4453][gc,metaspace] GC(2483) Metaspace: 121094K(126128K)->121094K(126128K) NonClass: 106101K(109668K)->106101K(109668K) Class: 14992K(16460K)->14992K(16460K)
[2021-03-31T07:24:42.741+0000][4453][gc          ] GC(2483) Pause Young (Normal) (G1 Evacuation Pause) 5791M->948M(8192M) 52.910ms
[2021-03-31T07:24:42.741+0000][4453][gc,cpu      ] GC(2483) User=0.07s Sys=0.02s Real=0.05s
[2021-03-31T07:24:42.741+0000][4453][safepoint   ] Safepoint "G1CollectForAllocation", Time since last: 4171501229 ns, Reaching safepoint: 552756 ns, At safepoint: 52985751 ns, Total: 53538507 ns
[2021-03-31T07:24:46.748+0000][4453][safepoint   ] Safepoint "Cleanup", Time since last: 4006708199 ns, Reaching safepoint: 380828 ns, At safepoint: 7383 ns, Total: 388211 ns
[2021-03-31T07:24:47.751+0000][4453][safepoint   ] Safepoint "Cleanup", Time since last: 1002995420 ns, Reaching safepoint: 245741 ns, At safepoint: 16788 ns, Total: 262529 ns
[2021-03-31T07:24:56.765+0000][4453][safepoint   ] Safepoint "Cleanup", Time since last: 9014120074 ns, Reaching safepoint: 180083 ns, At safepoint: 5529 ns, Total: 185612 ns
[2021-03-31T07:25:18.779+0000][4453][safepoint   ] Safepoint "Cleanup", Time since last: 22012888636 ns, Reaching safepoint: 477756 ns, At safepoint: 9624 ns, Total: 487380 ns
[2021-03-31T07:25:29.786+0000][4453][safepoint   ] Safepoint "Cleanup", Time since last: 11006685064 ns, Reaching safepoint: 729974 ns, At safepoint: 12549 ns, Total: 742523 ns
[2021-03-31T07:25:40.799+0000][4453][safepoint   ] Safepoint "Cleanup", Time since last: 11011886240 ns, Reaching safepoint: 669179 ns, At safepoint: 12056 ns, Total: 681235 ns
[2021-03-31T07:25:51.816+0000][4453][safepoint   ] Safepoint "Cleanup", Time since last: 11015193759 ns, Reaching safepoint: 2258752 ns, At safepoint: 11974 ns, Total: 2270726 ns
[2021-03-31T07:26:02.837+0000][4453][safepoint   ] Safepoint "Cleanup", Time since last: 11019963768 ns, Reaching safepoint: 644594 ns, At safepoint: 11216 ns, Total: 655810 ns
[2021-03-31T07:26:24.857+0000][4453][safepoint   ] Safepoint "Cleanup", Time since last: 22019644037 ns, Reaching safepoint: 884874 ns, At safepoint: 13252 ns, Total: 898126 ns
[2021-03-31T07:26:28.341+0000][4453][gc,start    ] GC(2484) Pause Young (Normal) (G1 Evacuation Pause)
[2021-03-31T07:26:28.341+0000][4453][gc,task     ] GC(2484) Using 2 workers of 2 for evacuation
[2021-03-31T07:26:28.341+0000][4453][gc,age      ] GC(2484) Desired survivor size 322961408 bytes, new threshold 15 (max threshold 15)
[2021-03-31T07:26:28.377+0000][4453][gc,age      ] GC(2484) Age table with threshold 15 (max threshold 15)
[2021-03-31T07:26:28.377+0000][4453][gc,age      ] GC(2484) - age   1:   22928696 bytes,   22928696 total
...

Here are also screenshots from Kibana monitoring:

I can see some dips, but nothing that shows it approaches zero?

That is correct. It has however dropped from around 7.000 docs/s and down to 1000 docs/s (yesterday), and now to around 500 docs/s (today). The last time I had logstash running for a week, and it dropped down to almost 0 (7 docs/s) indexing rate.

Is it supposed to be like this, or is something wrong? Is it possible to make it stable at 7.000 docs/s like in the beginning? For reference, I have attached a screenshot of monitoring the last 24 hours showing the decline.

Thanks!
What do the GC and CPU charts show at that time?

Just to not cause any confusion: The last dip where indexing rate dropped from 500 to 0 was because of a disrupton by me connecting to my server running logstash.

This made me think that, could it possibly be that Logstash is at fault here, and not Elasticsearch?

Resource usage for the last 24h:

Since the decline is constant, I have taken a new excerpt of the gc.log:

[2021-03-29T13:00:28.322+0000][4453][gc,start    ] GC(10) Pause Young (Concurrent Start) (G1 Evacuation Pause)
[2021-03-29T13:00:28.322+0000][4453][gc,task     ] GC(10) Using 2 workers of 2 for evacuation
[2021-03-29T13:00:28.322+0000][4453][gc,age      ] GC(10) Desired survivor size 27262976 bytes, new threshold 15 (max threshold 15)
[2021-03-29T13:00:28.349+0000][4453][gc,age      ] GC(10) Age table with threshold 15 (max threshold 15)
[2021-03-29T13:00:28.349+0000][4453][gc,age      ] GC(10) - age   1:   18247048 bytes,   18247048 total
[2021-03-29T13:00:28.349+0000][4453][gc,age      ] GC(10) - age   2:    8488952 bytes,   26736000 total
[2021-03-29T13:00:28.349+0000][4453][gc,phases   ] GC(10)   Pre Evacuate Collection Set: 0.2ms
[2021-03-29T13:00:28.349+0000][4453][gc,phases   ] GC(10)   Merge Heap Roots: 0.1ms
[2021-03-29T13:00:28.349+0000][4453][gc,phases   ] GC(10)   Evacuate Collection Set: 25.4ms
[2021-03-29T13:00:28.349+0000][4453][gc,phases   ] GC(10)   Post Evacuate Collection Set: 1.4ms
[2021-03-29T13:00:28.349+0000][4453][gc,phases   ] GC(10)   Other: 0.2ms
[2021-03-29T13:00:28.349+0000][4453][gc,heap     ] GC(10) Eden regions: 99->0(95)
[2021-03-29T13:00:28.349+0000][4453][gc,heap     ] GC(10) Survivor regions: 3->7(13)
[2021-03-29T13:00:28.349+0000][4453][gc,heap     ] GC(10) Old regions: 14->14
[2021-03-29T13:00:28.349+0000][4453][gc,heap     ] GC(10) Archive regions: 2->2
[2021-03-29T13:00:28.349+0000][4453][gc,heap     ] GC(10) Humongous regions: 2->2
[2021-03-29T13:00:28.349+0000][4453][gc,metaspace] GC(10) Metaspace: 93718K(98608K)->93718K(98608K) NonClass: 80978K(84580K)->80978K(84580K) Class: 12739K(14028K)->12739K(14028K)
[2021-03-29T13:00:28.349+0000][4453][gc          ] GC(10) Pause Young (Concurrent Start) (G1 Evacuation Pause) 474M->93M(8192M) 27.536ms
[2021-03-29T13:00:28.349+0000][4453][gc,cpu      ] GC(10) User=0.05s Sys=0.00s Real=0.02s
[2021-03-29T13:00:28.349+0000][4453][safepoint   ] Safepoint "G1CollectForAllocation", Time since last: 584343059 ns, Reaching safepoint: 192601 ns, At safepoint: 27598857 ns, Total: 27791458 ns
[2021-03-29T13:00:28.349+0000][4453][gc          ] GC(11) Concurrent Cycle
[2021-03-29T13:00:28.349+0000][4453][gc,marking  ] GC(11) Concurrent Car Claimed Marks
[2021-03-29T13:00:28.349+0000][4453][gc,marking  ] GC(11) Concurrent Clear Claimed Marks 0.072ms
[2021-03-29T13:00:28.349+0000][4453][gc,marking  ] GC(11) Concurrent Scan Root Regions
[2021-03-29T13:00:28.376+0000][4453][gc,marking  ] GC(11) Concurrent Scan Root Regions 27.019ms
[2021-03-29T13:00:28.376+0000][4453][gc,marking  ] GC(11) Concurrent Mark (523.406s)
[2021-03-29T13:00:28.376+0000][4453][gc,marking  ] GC(11) Concurrent Mark From Roots
[2021-03-29T13:00:28.376+0000][4453][gc,task     ] GC(11) Using 1 workers of 1 for marking
[2021-03-29T13:00:28.494+0000][4453][gc,marking  ] GC(11) Concurrent Mark From Roots 117.692ms
[2021-03-29T13:00:28.494+0000][4453][gc,marking  ] GC(11) Concurrent Preclean
[2021-03-29T13:00:28.494+0000][4453][gc,marking  ] GC(11) Concurrent Preclean 0.095ms
[2021-03-29T13:00:28.494+0000][4453][gc,marking  ] GC(11) Concurrent Mark (523.406s, 523.524s) 117.832ms
[2021-03-29T13:00:28.494+0000][4453][gc,start    ] GC(11) Pause Remark
[2021-03-29T13:00:28.498+0000][4453][gc          ] GC(11) Pause Remark 105M->105M(8192M) 4.160ms
[2021-03-29T13:00:28.498+0000][4453][gc,cpu      ] GC(11) User=0.01s Sys=0.00s Real=0.00s
[2021-03-29T13:00:28.499+0000][4453][safepoint   ] Safepoint "G1Concurrent", Time since last: 145013651 ns, Reaching safepoint: 112037 ns, At safepoint: 4236346 ns, Total: 4348383 ns
[2021-03-29T13:00:28.499+0000][4453][gc,marking  ] GC(11) Concurrent Rebuild Remembered Sets
[2021-03-29T13:00:28.548+0000][4453][gc,marking  ] GC(11) Concurrent Rebuild Remembered Sets 49.529ms
[2021-03-29T13:00:28.548+0000][4453][gc,start    ] GC(11) Pause Cleanup
[2021-03-29T13:00:28.549+0000][4453][gc          ] GC(11) Pause Cleanup 105M->105M(8192M) 0.239ms
[2021-03-29T13:00:28.549+0000][4453][gc,cpu      ] GC(11) User=0.00s Sys=0.00s Real=0.00s
[2021-03-29T13:00:28.549+0000][4453][safepoint   ] Safepoint "G1Concurrent", Time since last: 49648777 ns, Reaching safepoint: 303783 ns, At safepoint: 280670 ns, Total: 584453 ns
[2021-03-29T13:00:28.549+0000][4453][gc,marking  ] GC(11) Concurrent Cleanup for Next Mark
[2021-03-29T13:00:28.558+0000][4453][gc,marking  ] GC(11) Concurrent Cleanup for Next Mark 9.200ms
[2021-03-29T13:00:28.558+0000][4453][gc          ] GC(11) Concurrent Cycle 208.850ms
[2021-03-29T13:00:29.549+0000][4453][safepoint   ] Safepoint "Cleanup", Time since last: 1000117144 ns, Reaching safepoint: 101291 ns, At safepoint: 7687 ns, Total: 108978 ns
[2021-03-29T13:00:30.549+0000][4453][safepoint   ] Safepoint "Cleanup", Time since last: 1000128221 ns, Reaching safepoint: 96023 ns, At safepoint: 4341 ns, Total: 100364 ns
[2021-03-29T13:00:31.262+0000][4453][safepoint   ] Safepoint "ICBufferFull", Time since last: 712617697 ns, Reaching safepoint: 152530 ns, At safepoint: 3016 ns, Total: 155546 ns
[2021-03-29T13:00:31.557+0000][4453][safepoint   ] Safepoint "ICBufferFull", Time since last: 294541868 ns, Reaching safepoint: 235063 ns, At safepoint: 4340 ns, Total: 239403 ns
[2021-03-29T13:00:31.848+0000][4453][safepoint   ] Safepoint "ICBufferFull", Time since last: 291074464 ns, Reaching safepoint: 261530 ns, At safepoint: 15002 ns, Total: 276532 ns
[2021-03-29T13:00:32.429+0000][4453][safepoint   ] Safepoint "ICBufferFull", Time since last: 580502902 ns, Reaching safepoint: 253730 ns, At safepoint: 13846 ns, Total: 267576 ns
[2021-03-29T13:00:32.622+0000][4453][safepoint   ] Safepoint "ICBufferFull", Time since last: 192603819 ns, Reaching safepoint: 206647 ns, At safepoint: 4427 ns, Total: 211074 ns
[2021-03-29T13:00:32.950+0000][4453][safepoint   ] Safepoint "ICBufferFull", Time since last: 328093089 ns, Reaching safepoint: 239085 ns, At safepoint: 11953 ns, Total: 251038 ns
[2021-03-29T13:00:33.356+0000][4453][safepoint   ] Safepoint "ICBufferFull", Time since last: 405525178 ns, Reaching safepoint: 298447 ns, At safepoint: 10766 ns, Total: 309213 ns
[2021-03-29T13:00:33.706+0000][4453][safepoint   ] Safepoint "ICBufferFull", Time since last: 349973687 ns, Reaching safepoint: 129744 ns, At safepoint: 5533 ns, Total: 135277 ns
[2021-03-29T13:00:34.709+0000][4453][safepoint   ] Safepoint "Cleanup", Time since last: 1003279168 ns, Reaching safepoint: 145472 ns, At safepoint: 4090 ns, Total: 149562 ns
[2021-03-29T13:00:35.025+0000][4453][safepoint   ] Safepoint "ICBufferFull", Time since last: 315357798 ns, Reaching safepoint: 151470 ns, At safepoint: 5516 ns, Total: 156986 ns
[2021-03-29T13:00:35.543+0000][4453][safepoint   ] Safepoint "ICBufferFull", Time since last: 517606601 ns, Reaching safepoint: 319444 ns, At safepoint: 16871 ns, Total: 336315 ns
[2021-03-29T13:00:36.352+0000][4453][safepoint   ] Safepoint "ICBufferFull", Time since last: 808676406 ns, Reaching safepoint: 152365 ns, At safepoint: 10883 ns, Total: 163248 ns
[2021-03-29T13:00:37.353+0000][4453][safepoint   ] Safepoint "Cleanup", Time since last: 1000747661 ns, Reaching safepoint: 133825 ns, At safepoint: 1992 ns, Total: 135817 ns
[2021-03-29T13:00:38.353+0000][4453][safepoint   ] Safepoint "Cleanup", Time since last: 1000191379 ns, Reaching safepoint: 232132 ns, At safepoint: 4277 ns, Total: 236409 ns
[2021-03-29T13:00:39.604+0000][4453][safepoint   ] Safepoint "ICBufferFull", Time since last: 1251336447 ns, Reaching safepoint: 161366 ns, At safepoint: 13150 ns, Total: 174516 ns
[2021-03-29T13:00:39.910+0000][4453][safepoint   ] Safepoint "ICBufferFull", Time since last: 304846125 ns, Reaching safepoint: 249663 ns, At safepoint: 7954 ns, Total: 257617 ns
[2021-03-29T13:00:40.612+0000][4453][safepoint   ] Safepoint "ICBufferFull", Time since last: 702128749 ns, Reaching safepoint: 266270 ns, At safepoint: 12526 ns, Total: 278796 ns
[2021-03-29T13:00:41.032+0000][4453][gc,start    ] GC(12) Pause Young (Normal) (G1 Evacuation Pause)
[2021-03-29T13:00:41.032+0000][4453][gc,task     ] GC(12) Using 2 workers of 2 for evacuation
[2021-03-29T13:00:41.032+0000][4453][gc,age      ] GC(12) Desired survivor size 27262976 bytes, new threshold 15 (max threshold 15)
[2021-03-29T13:00:41.067+0000][4453][gc,age      ] GC(12) Age table with threshold 15 (max threshold 15)
[2021-03-29T13:00:41.067+0000][4453][gc,age      ] GC(12) - age   1:   18040896 bytes,   18040896 total
[2021-03-29T13:00:41.067+0000][4453][gc,age      ] GC(12) - age   2:   17721776 bytes,   35762672 total
[2021-03-29T13:00:41.067+0000][4453][gc,age      ] GC(12) - age   3:    8474520 bytes,   44237192 total
[2021-03-29T13:00:41.067+0000][4453][gc,phases   ] GC(12)   Pre Evacuate Collection Set: 1.8ms
[2021-03-29T13:00:41.067+0000][4453][gc,phases   ] GC(12)   Merge Heap Roots: 0.2ms
[2021-03-29T13:00:41.067+0000][4453][gc,phases   ] GC(12)   Evacuate Collection Set: 30.8ms
[2021-03-29T13:00:41.067+0000][4453][gc,phases   ] GC(12)   Post Evacuate Collection Set: 2.4ms
[2021-03-29T13:00:41.067+0000][4453][gc,phases   ] GC(12)   Other: 0.2ms
[2021-03-29T13:00:41.068+0000][4453][gc,heap     ] GC(12) Eden regions: 95->0(91)
[2021-03-29T13:00:41.068+0000][4453][gc,heap     ] GC(12) Survivor regions: 7->11(13)
[2021-03-29T13:00:41.068+0000][4453][gc,heap     ] GC(12) Old regions: 14->14
[2021-03-29T13:00:41.068+0000][4453][gc,heap     ] GC(12) Archive regions: 2->2
[2021-03-29T13:00:41.068+0000][4453][gc,heap     ] GC(12) Humongous regions: 2->2
[2021-03-29T13:00:41.068+0000][4453][gc,metaspace] GC(12) Metaspace: 100146K(105008K)->100146K(105008K) NonClass: 86856K(90468K)->86856K(90468K) Class: 13289K(14540K)->13289K(14540K)
[2021-03-29T13:00:41.068+0000][4453][gc          ] GC(12) Pause Young (Normal) (G1 Evacuation Pause) 473M->110M(8192M) 35.430ms
[2021-03-29T13:00:41.068+0000][4453][gc,cpu      ] GC(12) User=0.05s Sys=0.01s Real=0.03s
[2021-03-29T13:00:41.068+0000][4453][safepoint   ] Safepoint "G1CollectForAllocation", Time since last: 413497131 ns, Reaching safepoint: 6580750 ns, At safepoint: 35500491 ns, Total: 42081241 ns
[2021-03-29T13:00:41.210+0000][4453][safepoint   ] Safepoint "ICBufferFull", Time since last: 142169687 ns, Reaching safepoint: 241097 ns, At safepoint: 6226 ns, Total: 247323 ns

How are you indexing into Elasticsearch? What is your bulk size? How large are your documents? Are you assigning document IDs yourself or letting Elasticsearch set them for you?

16 shards for 77GB of data seems excessive.

I am using Logstash to index local files from one server (server #1), to my elastic node (server #2). I run logstash with logstash -f logstash.conf. This configuration simply reads files from a folder as input, does some grok matching to extract the fields I want, and sends them to the following output:

output {
    elasticsearch {
        hosts => ["10.30.1.39:9200"] #(server #2)
        ilm_enabled => false
        template => "template-idx.json"
        template_name => ["projectname-idx"]
        template_overwrite => true
        document_id => "%{[@metadata][fingerprint]}"
        index => "projectname-idx-%{[@metadata][foldername]}-%{[@metadata][filename]}"
        action => "update"
        doc_as_upsert => true
    }
}

As you can see in the output, I specify the document ID here, based on the log entry fingerprint. This is done to avoid duplicates, with action => update and doc_as_upsert => true.

An example document looks like this, and I believe each document is between 250-350 bytes in size.

{
    "_index": "projectname-idx",
    "_type": "_doc",
    "_id": "<sha256-generated-fingerprint-that-is-quite-long>",
    "_score": 1,
    "_source": {
      "email": "username@domain.com",
      "domain": "domain.com",
      "extra": "Any notes on whatever you feel like"
    }
}

I have not specified any bulk size (or any pipeline settings) in my configuration file. I am therefore assuming the default values are used. See Logstash output:

Using bundled JDK: /usr/share/logstash/jdk
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.jruby.ext.openssl.SecurityHelper (file:/tmp/jruby-4852/jruby17673992353221350751jopenssl.jar) to field java.security.MessageDigest.provider
WARNING: Please consider reporting this to the maintainers of org.jruby.ext.openssl.SecurityHelper
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
WARNING: Could not find logstash.yml which is typically located in $LS_HOME/config or /etc/logstash. You can specify the path using --path.settings. Continuing using the defaults
Could not find log4j2 configuration at path /usr/share/logstash/config/log4j2.properties. Using default config which logs errors to the console
[INFO ] 2021-03-29 16:14:25.729 [main] runner - Starting Logstash {"logstash.version"=>"7.10.2", "jruby.version"=>"jruby 9.2.13.0 (2.5.7) 2020-08-03 9a89c94bcc OpenJDK 64-Bit Server VM 11.0.8+10 on 11.0.8+10 +indy +jit [linux-x86_64]"}
[WARN ] 2021-03-29 16:14:26.136 [LogStash::Runner] multilocal - Ignoring the 'pipelines.yml' file because modules or command line options are specified
[INFO ] 2021-03-29 16:14:28.048 [Converge PipelineAction::Create<main>] Reflections - Reflections took 27 ms to scan 1 urls, producing 23 keys and 47 values
[WARN ] 2021-03-29 16:14:28.491 [Converge PipelineAction::Create<main>] elasticsearch - Relying on default value of `pipeline.ecs_compatibility`, which may change in a future major release of Logstash. To avoid unexpected changes when upgrading Logstash, please explicitly declare your desired ECS Compatibility mode.
[INFO ] 2021-03-29 16:14:28.888 [[main]-pipeline-manager] elasticsearch - Elasticsearch pool URLs updated {:changes=>{:removed=>[], :added=>[http://10.30.1.39:9200/]}}
[WARN ] 2021-03-29 16:14:29.031 [[main]-pipeline-manager] elasticsearch - Restored connection to ES instance {:url=>"http://10.30.1.39:9200/"}
[INFO ] 2021-03-29 16:14:29.156 [[main]-pipeline-manager] elasticsearch - ES Output version determined {:es_version=>7}
[WARN ] 2021-03-29 16:14:29.158 [[main]-pipeline-manager] elasticsearch - Detected a 6.x and above cluster: the `type` event field won't be used to determine the document _type {:es_version=>7}
[INFO ] 2021-03-29 16:14:29.208 [[main]-pipeline-manager] elasticsearch - New Elasticsearch output {:class=>"LogStash::Outputs::ElasticSearch", :hosts=>["//10.30.1.39:9200"]}
[INFO ] 2021-03-29 16:14:29.260 [Ruby-0-Thread-5: :1] elasticsearch - Using mapping template from {:path=>"template-idx.json"}
[INFO ] 2021-03-29 16:14:29.270 [Ruby-0-Thread-5: :1] elasticsearch - Attempting to install template {:manage_template=>{"index_patterns"=>["projectname-idx-*"], "order"=>0, "settings"=>{"number_of_shards"=>16, "number_of_replicas"=>0, "refresh_interval"=>"10s"}, "mappings"=>{"properties"=>{"username"=>{"type"=>"text", "fields"=>{"keyword"=>{"type"=>"keyword"}}}, "email"=>{"type"=>"text", "fields"=>{"keyword"=>{"type"=>"keyword"}}}, "extra"=>{"type"=>"text", "fields"=>{"keyword"=>{"type"=>"keyword"}}}}}}}
[INFO ] 2021-03-29 16:14:29.289 [Ruby-0-Thread-5: :1] elasticsearch - Installing elasticsearch template to _template/projectname-idx
[INFO ] 2021-03-29 16:14:29.472 [[main]-pipeline-manager] javapipeline - Starting pipeline {:pipeline_id=>"main", "pipeline.workers"=>4, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>50, "pipeline.max_inflight"=>500, "pipeline.sources"=>["/home/user/normalizer/logstash/logstash-idx.conf"], :thread=>"#<Thread:0x57d9a8f run>"}
[INFO ] 2021-03-29 16:14:30.253 [[main]-pipeline-manager] javapipeline - Pipeline Java execution initialization time {"seconds"=>0.78}
[INFO ] 2021-03-29 16:14:30.408 [[main]-pipeline-manager] javapipeline - Pipeline started {"pipeline.id"=>"main"}
[INFO ] 2021-03-29 16:14:30.455 [Agent thread] agent - Pipelines running {:count=>1, :running_pipelines=>[:main], :non_running_pipelines=>[]}
[INFO ] 2021-03-29 16:14:30.464 [[main]<file] observingread - START, creating Discoverer, Watch with file and sincedb collections
[INFO ] 2021-03-29 16:14:30.662 [Api Webserver] agent - Successfully started Logstash API endpoint {:port=>9600}

Setting your own ID will make every indexing operation a potential update (even if you did not explicitly update) which will force Elasticsearch to check if the document exists before indexing or updating it. This adds a read, which tend to get slower and slower the larger the shards get. I suspect this is why you are seeing a slowdown over time. You may want to have a look at this blog post which discussed this problem.

If you can allow Elasticsearch to set the document ID and not update you should see much better and even indexing throughput. If that is not the case I would recommend reading the blog post and trying adjusting your ID. You may also want to monitor disk I/O and iowait to see if this is limiting performance as indexing slows down.

Thank you, Christian!

I will adjust my configuration and try out your suggestions.

At the same time I will improve my logstash configuration a little. Are there any settings that allow me to keep state in Logstash in the event that the service is restarted, so I do not have to reindex data that already has been indexed?

Edit: if it is only sincedb_path, then I already got it :slight_smile:

Edit #2: I updated my Logstash config and removed document_id, doc_as_upsert => true and action => update, and that solved my problem. Thank you once again!