Bulk load via action.bulk API dropping events when overloaded?


(Franfabrizio) #1

I'm new to Elasticsearch and I've been experimenting with doing a bulk load of data from one of our custom apps. I'm doing this on an Ubuntu VM (4GB mem) on my laptop.

I'm using fluentd's elasticsearch plugin (which uses the bulk API) to load the data from CSV into ES. I previously tried with logstash but fluentd was performing significantly faster and also was a little more robust with some of the errors I was hitting.

There are exactly 100,000 records of the format:

1,"2011-10-11 12:21:01",ds1,1790_cPop,"Census of the United States, 1790: Population Data for US",ds1.dt1,"Total Population","1790_cPop: Total Population"

This data goes back to 2011 and I've got fluentd configured to create a new index for each day. I have data for every single day from then until present, so when a run is done, I should have something like 1,300 indexes (one for each day since late 2011), each with some documents in it.

WIth default fluentd-elasticsearch-plugin settings, when the load was done ES would have all of the indexes, but something like 90k documents - it had dropped 10% of them - and so some of the indexes had 0 documents in them. I figured this out because Kibana kept complaining about failed shards that it couldn't search (because they did not have the @timestamp it was looking for, since they were empty). I repeated this many times, and always there would be a significant number of dropped documents, but not the same number or same days each time.

When I finally thought to add some more conservative buffering settings to fluentd (specifically, reducing chunk sizes to 2MB), I was able to consistently get all 100k documents into ES. Larger chunk sizes (e.g. 4MB) would drop documents. I also observed a fair number of timeouts where fluentd would have to retry communicating with ES with the larger chunk sizes.

Now, I'm obviously using an underpowered ES environment, so I'm not surprised ES was having trouble keeping up. The real issue is that it took me forever to figure out that this was the problem, because the elasticsearch log wasn't hinting at any of these issues. I'm confused as to why these events were being dropped silently. Was this simply the case of a buffer somewhere overfilling such that messages were partially dropped before ES ever saw them? Is there any way to adjust the logs to make it easier to discover when ES rejects an event?

Basically, I am a newbie wanting to understand more about why this was happening and why my fix worked. Thanks!


(Harlin) #2

This is usually caused by the threadpool.bulk.queue_size being too small. It defaults to 50 and if more than 50 bulks get queues up it will just drop them. If this is happening you should see this error message in your logs:

RemoteTransportException[[][inet[/192.168.76.1:9300]][bulk/shard]];
nested: EsRejectedExecutionException[rejected execution (queue capacity
50)

try adding this to your elasticsearch.yml:

threadpool.bulk.queue_size: 1500


(Christian Dahlqvist) #3

Each index and shard in Elasticsearch carries with it a certain amount of overhead, and having 1300 indices and possibly an even larger number of shards on a node that size is likely to cause problems. I would recommend trying to index into a single index and see if you still see the same behaviour. If you really want to experiment with time-based indices, I would recommend using a yearly index instead of daily for these volumes of data.


(Mark Walkom) #4

Simply increasing threadpools is a great way to ignore the problem :wink:


(Franfabrizio) #5

With logging set on DEBUG I do not see any of those exceptions in the logs (in fact, I see no exceptions at all, even though I can verify documents are being dropped). However, increasing the threadpool queue_size does seem to reliably fix the issue. Do I need a different log setting to expose those exceptions in the log?


(Harlin) #6

Nope, those exceptions should be exposed by default. It is definitely weird that documents are being dropped and you aren't seeing anything in the logs, especially if the queue size is the issue.


(system) #7