Elasticsearch Architecture Problem

Hi guys. The picture below is the existing data pipeline of a certain client.

Their MAIN PROBLEM are:

  1. Not all the data are being ingested in Node 2 and Node 3
  2. Slow retrieval of data

Do you have any idea what might be the cause of the problem?

  1. Is it a hardware related? If they need to upgrade what is the ideal specs per Node?
  2. Will adding another node will help in solving the problem?
  3. Will it solve the problem by implementing 2nd picture or 3rd picture below?
  4. Also, they want to implement Machine Learning in their data, what can you suggest for them?

Other Details:

Hey @josephmanalo

Please note that fortunately we are not all guys here :slight_smile:

Do you have any elasticsearch monitoring activated so you can understand may be better what's the cause of this?

It sounds like you are using HDD disks for some nodes and SSD on other nodes. That's an issue IMO.

I also wonder why are you using Logstash for?

May be you have also too many shards per node.
What is the output of:

GET /_cat/health?v
GET /_cat/indices?v

May I suggest you look at the following resources about sizing:

https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing

And https://www.elastic.co/webinars/using-rally-to-get-your-elasticsearch-cluster-size-right

Elasticsearch assumes all nodes in the cluster are equal, so having different types of hardware can cause a problem. Are you sure your cluster has formed properly and that you have set discovery.zen.minimum_master_nodes correctly according to these guidelines?

Hi! thanks for the reply.
i'll check out what you've sent

Here's the result of the ff:

We are using logstash to append a new column to the data and push it to its corresponding node.
We tried solving the problem by dividing the data some modules will be stored in Node 2 and other modules will be stored in Node 3.
But still it doesn't solve the problem.

Hi. Thanks for the suggestion, I'll try get into this and I'll update you back. Thanks!

You seem to have a lot of shards being generated daily given the small data volumes. I would recommend reducing this significantly e.g. by changing to use only a single primary shard per index, use weekly or monthly indices or simply consolidate the data into fewer indices.

Please don't post images of text as they are hardly readable and not searchable.

Instead paste the text and format it with </> icon. Check the preview window.

I'm not sure I understood. Anyway, may be look at the node ingest feature which might be enough to replace your logstash pipeline.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.