I have tried to load data into Elasticsearch real-time using Python. I am new to Elasticsearch. I used a single node and a single server. Elasticsearch was not able able to keep up with the real-time data coming into the server and there was a pretty large backlog as it did not seem to be able to keep up with the throughput.
I went to an Elastic event and was told I need to create a cluster. I began doing research into clusters and discovered there are different types of nodes.
Master nodes
Data nodes
Client nodes
Ingest nodes
More of what type of node(s) will help with the throughput problem? Any ideas on how to begin designing my cluster?
I assume the amount of memory is a factor. To what degree? If so how do I calculate it?
Thank you in advance. Any help would be appreciated.
Are you sure your client was pushing Elasticsearch hard enough? It's very common for throughout problems to be on the client side, nothing to do with Elasticsearch config. Make sure you're sending lots of large bulk requests in parallel.
If you've confirmed that Elasticsearch really is the bottleneck then the limiting factor is probably either I/O bandwidth or CPU count, and you can fix both of these by scaling your one node up to have more power (to some extent) or by adding nodes (almost arbitrarily far). I wouldn't worry about different node types, just use the default which is for every node to do everything. You can refine that later if you want but the default is going to be the simplest way to increase performance.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.