Hi!
We have a very old Elasticsearch cluster v1.6.0 (we will hopefully upgrade soon) with 20 nodes: 3 master nodes and 17 data nodes.
As a B2B service, each index represents a different business (customer).
This cluster is being heavily indexed and barley read.
Every day Spark is writing to this cluster and replace almost all indices with a fresh new ones.
As soon as a new index is ready we switch the customer alias to point on a fresh new index and delete the old one.
We are using org.elasticsearch.spark.rdd.api.java.JavaEsSpark
to saveToEs.
I saw that the load on the data nodes is reaching 100% during indexing.
I suspect that we are using bulks with high number of documents (which are relatively small) or not using bulks at all - so I enabled the index slowlog (this is a legacy code which I'm trying to understand now).
I saw a lot of slow log such as:
[2022-02-21 09:42:56,942][INFO ][index.indexing.slowlog.index] [elasticsearch-data-node-1.use.<my-domain>.com] [9877738_all_27652427755079532m][0] took[5.4s], took_millis[5402], type[user], id[AX8bqYwqNMC-u1yj5V_o], routing[], source[{ <document_content_here> }]
Questions
- From the log above, is there a chance that we are not using bulk at all? We have another Elasticsearch cluster (v6.3.0) that shows slowlog like:
[2022-01-31T22:06:44,253][WARN ][index.search.slowlog.fetch] [data-worker-fae1695] [s_9877418_1643664046760][0] took[597.4ms], took_millis[597], total_hits[318742], types[], stats[], search_type[QUERY_THEN_FETCH], total_shards[2], source[{"size":400,"query":{ <query_comes_here> },"_source":{ <something_comes_here> },"sort":[{"_doc":{"order":"asc"}}]}],
In the log above I can clearly see the bulk size 400, but I'm not sure if I should see it in a "simple" indexing (the log is QUERY_THEN_FETCH). In other words, should I always see the size of the bulk in an indexing query? Is the first log implies that we are not indexing in bulks?
-
What is the recommended/optimal indexing duration? what are the tradeoffs / implications of slow indexing time? (assuming can deal with long search time and we don't have much serving queries).
-
What is the optimal number of shards in our case? most of our indices use only 1 shard (and one replica after finish indexing), but few indices contain 40(!) shards. I assume that's slowing us down as well as write to single index involve the entire cluster. Is that make sense?
Thanks,
Itay