Hello ES community,
First of all, disclaimer: my communications in this forum are solely mine, and do not represent the views of my employer.
This is my first post for what really has been significant research made on my part to understand the moving parts in the Elastic portfolio.
My organisation is in the business of high speed logging in the sense of outputting those at high speeds. We manufacture high speed firewall appliances that can generate anywhere from 50 logs per second to 500,000 logs per second (and even beyond that). Naturally, this is a rather wide range of use cases to address, performance wise, and my concern here is to address, at first, the higher spectrum of this. We currently use a SQL approach to this, with nodes capable of sustaining 60k EPS as "insertion rate", and frequently deal with stores worth up to 40-50TB per node. None of it is clustered, in the strict sense of the word as compared with ES. Querying such large datasets can, as you imagine, be a performance challenge so I have no illusion that to address those I would likely have multiple units in an ES cluster rather than one big unit.
Hardware wise, what I can work with in a benchmark test are devices equipped units with modern CPUs such as Intel E5-2600 series, 2 of them, providing for anywhere from 12 to 16 cores in a node, and accompanied with 64GB of RAM. I have both SSD and non-SSD versions, 16TB of storage. Those are the devices I have with me to test with at the moment.
Document wise, a sample string of what we are indexing looks like this:
date=2016-05-03 time=10:37:46 devname=MN140D devid=FG140P3G13800156 logid=1059028704 type=utm subtype=app-ctrl eventtype=app-ctrl-all level=information vd=root appid=15893 user="" srcip=192.168.83.4 srcport=53861 srcintf="LAN" dstip=18.104.22.168 dstport=80 dstintf="wan1" profiletype="applist" proto=6 service="HTTP" policyid=1 sessionid=651420 applist="default" appcat="Web.Client" app="HTTP.BROWSER" action=pass hostname="www.ask.com" url="/web?q=elasticsearch&search=&qsrc=0&o=0&l=dir" msg="Web.Client: HTTP.BROWSER," apprisk=medium
I am parsing this currently with Logstash using grok, mutate, kv and inserting into ES in my test VM (not my benchmark setup) and things appear to work quite well.
Performance wise, I want to achieve at least 60k EPS per node, and I want to be capable of running aggregation queries over what I think will be anywhere between 10 to 50TB of data in an acceptable timeframe (read: less than 30 seconds).
Now, the questions:
Am I insane? If yes, please elaborate
Logstash would ideally not be needed if we directly log using JSON to ES, but in the interim I have to find a way to get Logstash to scale to rather ridiculous levels, I.E. ideally about 120k EPS. What can I do to optimize Logstash's config, runtime parameters, etc that would help me achieve as high a throughput as I can? I have read some amount on this, but perhaps given my document sample above, some other tips can come up from this community. I have read that there are ways to have multiple Logstash instances, with external load balancing, in order to achieve higher EPS rates, but those became relatively confusing with topics involving AMQP frameworks...
As this use case is about log management, we can stand to wait some level of time for queries to run (< 30 seconds) but we can't really drag to unreasonable delays (> 120 seconds). A lot of the requests will be on recent data (reviewing logs, just like browsing them in Kibana Discover) and time-based indexes that log stash already outputs are helping me there already, but we frequently have to do aggregation queries that can span much larger parts of the dataset, as mentioned previously. Easily to the tune of 10s of TB. I am going to perform my own testing on this, but I am looking for guidance as to just how much storage I can reasonably expect my above hardware to handle on each node in terms of sheer index storage. I have no idea what reasonable performance can be derived from various storage strategies.
Sorry, no TLDR on this post
I am a technologist but I am but an amateur in relation with "big data" strategies - sorry if the above is sounding newbie-ish.