We are doing indexing on 2 indices with one shard and a replica each.
Avg size of document is 15Kb. Total documents=35million
ES version 7.1.1
2 data nodes, 3 master and 2 coord nodes.
refresh_interval is disabled.
Using bulk requests (BulkProcessor), the indexing is being done.
Query
The index performance seems too slow. On profiling i saw that there are about 400 lucene merge threads for each index doing nothing. Just either thread among them is active and trying to merge segments.
The write threads seem to steadily write the data.
Any pointers what is happening and how can i achieve more speedy indexing ?
What is the specification of your cluster? What type of storage are you using? have you optimized your mappings? Have you followed these guidelines? What does CPU usage and disk I/O look like while you are indexing? What indexing throughput are you seeing?
--EDIT Was writing while you guys continued, sorry for stuff now out of context--
Node specs is hardware specs, what's the hardware profile of your nodes. CPU/core count, RAM size, JVM HEAP setting, etc.
Are the disks SSD?
How many fields do you have in each of your documents?
Do the the fields names vary per document? (Different fields in different documents or all document have the same schema)
How many total field in the indices, visible in your index mapping or by looking at the index pattern in Kibana.
I've had problems in the past with merge threads when I was suffering from field count explosion, I'm not guru enough to know if there is a true technical relationship there but I remember that's what I was seeing. Way too many fields in the same index lead to apparent lucene merge threads problems and what ES calls index throttling.
I don't remember if the threads were idle though, and if I remember correctly there was CPU contention in my case.
2000/sec ok but are you receiving back pressure while you're indexing? What's to say that your bottleneck is not in the client(s) and that they are simply not even trying to push faster?
If you are not finding any contention on your ES nodes, CPU RAM DISK IO, what makes you think that adding client(s) also trying to send bulk requests would not result in more doc/sec?
You could also try to provide more technical details like dumps of the info you're referencing. I don't actually know what to think of your:
On profiling i saw that there are about 400 lucene merge threads for each index doing nothing. Just either thread among them is active and trying to merge segments.
The write threads seem to steadily write the data.
Because it's a narration instead of being a properly formatted output of a command in something like a gist. When looking for technical help, you have a better chance if you post technical questions with technical data, still not a guarantee . Not sharing hardware spec is also an example, a pentium 166 with 2 spinning disk from 1994 in raid-0 is what I picture you're running since you didn't say. This goes deep, document example, index mapping, field count, what you tried, results obtained... "I added clients and couldn't past 2000doc/s, I added shards and ..., I added nodes and ... "
Thanks for your suggestion regarding lucene threads.
I understand the more technical details you mention the more things can be looked into by the one who could be answering.
But there is another angle to it. If you give the problem statement that is not too big, some ES experts have a unique ability to ask questions targeting correct technical area, because of their experience. That leads to a faster answering time and time is not wasted
Excuse me, but I would like to differ from your approach of giving plethora of technical details in initial problem statement, some of which might not be of any use.
Nevertheless, that is a separate topic, out of context as you have rightly mentioned. Period.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.