Architecture decision confusion

Hi all,

We are currently trying to decide which architecture we should use for a new system&application logs analyzing ITOA application.

Firstly we decided to use pure ELK stack, however as we researched further, we decided to use Hadoop ecosystem in our design in order to increase our analytical capabilities and opportunuties.

Here is what we think:
Beats->Logstash->HDFS->Spark<->ES->Kibana

Logstash will forward the raw log data and maybe some pre-filter to HDFS (on Isolon). Logs will be queriable from Spark. We'll apply Mlib and other extra features too. Then we will forward indexed data to ES and visualize it in Kibana.
Here are our questions:

  1. Do you recommend keeping ES indexes on HDFS too, or should we write them on NFS? I heard one Spark contributer stated that putting ES indexes on HDFS is not the best thing to do. Is this still valid?

  2. Is logstash filtering, indexing (grok) and sending to ES scenario similar with putting raw data on HDFS, index it with Spark and send to ES scenario? Should we and can we apply all those logstash filters&indexes capabilities on Spark and send it to ES like logstash can? Is this approach better and faster for ES indexing as described here: es-hadoop question or is there another scenario about Spark integration that we are not aware of? How can ES-Hadoop reduces the amount of indexing that should be performed on ES and index directly from Spark data structures to ES. What differs with respect to logstash grok filtering and indexing to ES directly?

  3. What is the advantage of backuping up ES on HDFS?

  4. Should we seperate Spark and ES servers or can they reside on the same server?

I hope that you can provide some insight about this architecture and guide us. As you may notice, we are new and a bit confused about the concepts and any guidence will be very much appretiated.

Thank you.

HDFS is not the best place to store live ES Indices. HDFS is a great long term high volume block storage system which lacks much of the intricate optimizations that a classical local filesystem provides. Lucene takes great advantage of these optimizations to deliver high speed indexing and searching of data. When serving indices from HDFS, these operations take a considerable hit in performance times depending on the use cases.

It depends on what you're trying to do. Logstash is geared much more toward pure ingestion mechanics, where Spark is more geared toward processing. If you need to perform heavy processing on your data before loading it, Spark is a great option.

HDFS is a great place to store index snapshots since snapshots normally require lots of space and are not frequently accessed except for creating them and restoring them. There are also a number of different plugins that support different massive storage solutions like S3, Azure, etc...

It depends on your hardware. If you can comfortably host both services on each node, then go for it. If not, it won't hurt anything to have them on separate nodes.

Thank you very much. I'll share my observations as the project progress.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.