I thought if this question should go here or on the elasticsearch forum and only b/c it's a logstash configuration I decided to put it here.
What i'm doing is transferring s3 files' data containing a huge amount of JSON logs into ES. There's something I want to ask about the 'date' filter of LS and how it affects the number of indices and if it really matters.
When I'm doing this transfer of data (using the 's3' input) without the 'date' filter everything is done really fast and there's only one index created in ES for the current day: 'logstash-2015.06.04'.
The second time I decided to run with the 'date' filter and multiple indices were created and the load on the machine sky rocketed, which even causes a loss of data from time to time b/c ES became unresponsive.
The 'date' filter i'm using is:
date {
match => [ "server_time", "UNIX" ]
}
Is it expected that when I add the 'date' filter the load on the machine become this heavy and so many indices will be created ?
What's the difference between having one index and multiple indices created in ES ?
Thanks!!
Some info:
I'm running on an EC2 r3.xlarge machine with 32g RAM.
Actually i'm not creating anything manually. When I put the 'date' filter those indices are just created.
And yes, hundreds of them are created for even less than 50K events. In Marvel they all appear in red and ES becomes unresponsive.
I tried different flush_sizes. When I set a really small value (e.g.10) it works fine but the process is extremely slow. When I set higher values (>= 500) the same scenario I described before happens.
What are some of the index names? If you're seeing lots of indices created, it implies that each one has a unique date, perhaps in the future or past. Is it possible that instead of UNIX (whole seconds since the epoch) that you have UNIX_MS or nanoseconds or something like that? It would definitely make the date filter project indices into the distant future if you were sending higher precision timestamps to a "whole second" parser.
Ok so running for around 10 hours now (I guess it's going to take a few more days), it looks better than before. Indices are generated for actual dates and the machine is better on responsiveness although it's still very highly loaded.
On my 4-cpu machine this load average is pretty heavy. When I ran the same think without the 'date' filter the load was a lot lower (probably b/c of just one index that was created).
Can you explain why the 'date' filter creates an index per day?
Is it a better practice than having just one index created?
If yes, how does it affect ES performance?
Thanks for your help so far.
UPDATE:
Around the time I wrote the last message things started to break. The machine started becoming unresponsive and data insertion to ES became slow. Up until now it's like that and it's getting slower and slower...
Ideas?
Time series data should be logically grouped into common blocks of time, e.g. days, weeks, months, etc. This simplifies searching and data retention.
As far as your performance issues go, how much memory have you allocated for heap in Elasticsearch? Logstash?
How many nodes do you have? How many events per second?
I can tell by looking at your "JVM Heap Usage" chart that your cluster is overwhelmed. That solid line pegged at 75% usage indicates that your system (which I am guessing only has one node) is constantly performing garbage collections and is never getting ahead. That line should look like a sawtooth pattern, but it's just a flat line pegged at 70%+. Your indexing load seems to indicate the need to grow your cluster to a few more nodes, and/or bigger heap set aside for Elasticsearch.
Events per second - i'm working to load historical events from S3. There are loads of events in files as JSON. So I guess it's being delivered from s3 pretty fast.
I can increase the heap size but Marvel shows that it never really got to 16g. I seriously don't know what to do more ...
Now I see a lot of errors in logstash like this:
{:timestamp=>"2015-06-09T04:25:06.156000+0000", :message=>"A plugin had an unrecoverable error. Will restart this plugin.\n Plugin: <LogStash::Inputs::S3 bucket=>"***", access_key_id=>"***", secret_access_key=>"***", region=>"us-east-1", prefix=>"***", sincedb_path=>"/var/lib/logstash/.sincedb_s3", temporary_directory=>"/var/lib/logstash/logstash">\n Error: Invalid UTF-32 character 0x7b226e61(above 10ffff) at char #199, byte #799)", :level=>:error}
Not sure if it's related ... it's probably what you said. I'll try adding another cluster. Is it possible to just add memory ?
It is possible to add memory. The maximum recommended heap for an Elasticsearch node is 30.5G. We recommend that this number be no more than 1/2 of the available system memory, which implies a 64G server to use a 30.5G heap.
Not sure what the character error is.
[EDIT] Changed 31G to 30.5G based on most recent information.
Hey I saw the same problem with refaelos about "Invalid UTF-32" stuff.
My scenario here is: there's a py script appending to a log file, while LS has been configured to get fresh lines from it. The file was placed on NFS mountpoint, which means both py script and LS are actually interacting with this file remotely.
This error takes place randomly and I really appreciate you guys can tip about what's happening >
As for memory, LS runs on the machine without ES installed and there's still >50% left. So I presume that would not cause trouble.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.