If you have 24GB of RAM on the machine, try increasing the heap size in Logstash to at least 4GB or so. Just to make sure nothing is going on there. You are welcome to try to set it even higher just to make sure.
What does CPU/Load on the Logstash machines look like? Any pressure there or is it pretty low like the elasticsearch instance is?
Also you said that you are using filebeat to pipe in the data. Is filebeat installed on one machine? Or do you have it installed on multiple machines, all pointing to the same logstash instance? If it's just one then it's possible that filebeat just can't offload the data fast enough. (I have not used filebeat much so I don't know what it's capabilities are)
There are a few other tests you can run to try to narrow down the problem.
Try running the node stats command to see if one of your filters is just super slow
Comment out all filters and outputs. Then have it output to something quick like to a local file or dev/null. This is to test how quickly filebeat can deliver messages to Logstash while Logstash does the minimal amount of work possible. If the throughput here is still slow, your problem is not Elasticsearch, and is most likely with filebeat or your network.
If the logs are being sent from just one server, copy the log files from that one server to multiple other servers. Then install filebeat on those servers and have all of the servers send in data at the same time into Logstash. If logstash is the problem, throughput will remain basically unchanged as before. However if you now have 3 servers sending in data, and logstash throughput triples then you know that Logstash probably isn't the issue.
If throughput is good from the item 2 test, try keeping the filters commented out, but add Elasticsearch back to the output. See if you can deliver to Elasticsearch quickly. Then add your filters back one at a time and measure throughput each time. The elasticsearch output has a workers and flush_size options you can try messing with.
Try increasing the workers from 12 to 24. Workers can often be idle so you can get away with more workers than you have CPU cores.
So, i have One server where it installed Filebeat, Logstash, Elasticsearch and Kibana.
In this case, it's better to not use filebeat ? Just Logstash ? (I had readed somethings about, To unload logstash it's better to use filebeat to ship log).
The point 3 isn't possible because i have only one server. (My log are in the nfs).
I don't know how to proceed to make your test but i go try !
Ok, I was under the impression that you had separate machines for Logstash, Elasticsearch, and Filebeat.
Filebeat is often recommended over Logstash because it is simpler, lighter weight, and so works well when you need to install it on hundreds of production machines where you want as little overhead as possible. However in your case that doesn't really apply. You already need logstash installed so you might as well just drop filebeat for a bit and try using the file input in Logstash to see if that helps at all.
So you mentioned in your first post that you had to get to 300,000 lines per minute. So it sounds like this could be good enough.
I have read some complaints about the file input (that may also apply to filebeat). Basically the file inputs are more or less single threaded. You can't have multiple cores/threads trying to read the from the same file, so you are capped in how much throughput you can get out of it. You might be able to use the search function to see if anyone has found any way to make it go faster.
Those settings are good for testing. What they mean is that if you restart Logstash, it will start at the beginning of the file instead of where it left off previously. So this is generally not something you want for production as any restarts could mean lots of duplicate data sent to Elasticsearch. But it works great for testing as you can re-use the same log file over and over while you benchmark different settings.
I think start_position => beginning is not always for preproduction. Read a file from the begin is also good
Everything is working now, my conf, my template (so it analyze fields when i said to not analyze ) that's not important for the moment.
But do you know this error in Logstash (logs are upload correctly even so)?
[2017-04-14T11:10:38,203][WARN ][o.e.d.i.m.TypeParsers ] Expected a boolean for property [index] but got [not_analyzed]
[2017-04-14T11:10:38,204][WARN ][o.e.d.i.m.TypeParsers ] Expected a boolean for property [index] but got [not_analyzed]
[2017-04-14T11:11:11,659][WARN ][o.e.d.i.q.QueryParseContext] query malformed, empty clause found at [1:143]
[2017-04-14T11:13:16,229][WARN ][o.e.d.i.q.QueryParseContext] query malformed, empty clause found at [1:143]
[2017-04-14T11:17:59,065][WARN ][o.e.d.i.m.TypeParsers ] Expected a boolean for property [index] but got [not_analyzed]
[2017-04-14T11:18:27,091][WARN ][o.e.d.i.m.TypeParsers ] Expected a boolean for property [index] but got [not_analyzed]
[2017-04-14T11:21:18,981][WARN ][o.e.d.i.m.TypeParsers ] Expected a boolean for property [index] but got [not_analyzed]
It's possible to change the unity (octet => Gigaoctet) when logs are upload ? ^^
start_position can work well at the beginning if you have several days worth of log files and you want all of them pulled into Elasticsearch. However once everything is up to date it doesn't really do anything different. Or if you don't care about past data and only care about data going forward.
The error looks like something coming from Elasticsearch, but I am not familiar with it. Seems like an issue with your template maybe.
So your data file is 400 mb, but your index is 1 gb?
That could be normal for a few reasons.
If you have 1 primary and one replica shard, then that means every record is being stored twice, doubling your storage requirements.
Often times fields are added during this process. Things like GROK, mutate, header information and others are often added to enrich the data.
Strings normally store the analyzed version and the non-analyzed version. So a string is essentially stored twice. You can disable one or the other but you may give up some functionality.
Since this is stored as JSON, all of the JSON value pairs get stored for each record. Think of a CSV file for example. Only the very top row has the headers, while the rest of the body just has the values. So if you have 1 million records, the headers still only show up once. That's fairly efficient. JSON on the other hand stores the equivalent "header" on every single record. This can increase the storage requirements.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.