Performance Help

@tgdesrochers But that does not align well with your previous observation:

If you have a large amount of logs, it could be handled by more nodes . Nodes do not time out just because of mapping or indexing, reporting a mapping timeout of 30s is just a coincidence. Check if your file system / disk subsystem is slowing down the indexing.

 "status"=>500, "error"=>{"type"=>"timeout_exception", "reason"=>"Failed to acknowledge mapping update within [30s]"}}}, :level=>:warn

Can I revisit this error.

While searching my /var/log/logstash/logstash.log on all of my logstash nodes the only error I am seeing is the above. Is there a way to increase the mapping timeout. I realize this may just be a bandaid but it may help.

I don't know why I keep getting this error but it is causing loss of logs and loss of data which isn't acceptable in my environment. I am happy to check anything to assure that there isn't any other issue with the ES nodes.

I have 8 logstash nodes pulling from kafka and pushing to 12 ES data nodes. At peak I am indexing 14000 records per second. I don't see any problem with I/O speeds of my data nodes. They are all 2 TB SSD HDD with very very fast read/write. The logs are small in nature but they are constant coming from a BRO ids.

Thanks in advance

Have you tried feeding Logstash data through a file input instead of reading from Kafka in order to see if that makes a difference?

FYI. I have 33 logstash nodes ingesting from kafka, feeding to 31 ES nodes. All ES are using spinning 2x1TB drives and at peak, I get 120K records/s. Record size around 1K+ in size, they are coming from ATS. So it is certainly doable on spinning drives.

I do see occasional errors like yours, around the time when new indices are created.

Mine also appears to be when new indices are created or when the first of a record type is seen in a new index. But it causes data loss, and I really need to assure I get all records.

Plus I'd like to know the root cause of the issue and fix it.

I did not try to transfer a file to a kafka node and transfer it directly into the ES cluster. I can't do it from the sensor because the sensor cannot talk to the ES cluster for a variety of reasons. I will try from a kafka node with the file input when I can.

The errors seem related to the cluster state taking long to update and/or propagate. As you are on Elasticsearch 2.2, which supports delta cluster state updates, I would expect cluster state updates to be reasonably quick. What is the specification of your dedicated master nodes? Do you have any client nodes? Do you see evidence of long garbage collection in the logs?

Sorry for the HUGE delay in responding but other duties pulled me away.

I am still seeing the issue.

My specs on the data nodes are:
12 data nodes
16 cores
32 GB RAM
6TB HDD on SSDs

I have extra I can throw at RAM and cores if needed.

I have 2 client nodes that I have kibana pointed to. Should I build more and have my logstash nodes point to the clients instead of directly into data nodes?

These are all VM's not metal servers. This is all being done in a corporate "cloud" environment and the metal isn't available to me.