Case studies of successful ES clusters in production


(John Ouellette) #1

Using ES in production? Are you indexing huge quantities of data per second or do you have a steady but critical trickle of data coming in? Are you running complex queries with tonnes of aggregations or just browsing through documents with Kibana? If you have a stable ES cluster that you've built up and tuned for your applications and are happy with it in production, chances are that your configuration and 'secret sauce' in the tuning and resources of your cluster are very different than someone else's with a similar application but a different indexing load, different query types, or different resources to dedicate to ES. Would you be willing to share some insight into your cluster configuration for the benefit of other users of ES, both new and experienced?

I am a relatively new user of ES. I have found that the number of knobs, levers, bells, and whistles available in ES to be somewhat daunting, and I have often bumped into problems with my cluster that I am sure are due to a poor choice for one or two of those levers... I am hoping that enough people will respond to the questions below that we can have a good overview of working production configurations, and their use-cases, so that new users will have a good sample of configurations to build their own cluster with. If you spent hours, and buckets of blood and sweat, tuning your ES cluster, wouldn't you like to share your masterpiece with the community? This might also help the ES people in improving their documentation (not saying the documentation is bad, but documentation can always stand some improving :slight_smile: ).

If this has already been done, please let me know and point me to the results! If you have any additions or changes to suggest to the questions below, please provide some feedback!

Questions:

  • If there was an over-riding reason -- such as improved performance or reliability or just a hard lesson learned -- for making a specific configuration decision below, please provide some details.

General usage:

  1. What version of ES are you using?
  2. How would you describe the use you have put ES to? Logging of services or applications? Application-specific document store? User activity analytics? System performance metrics? Others?
  3. What sort of indexing rates do you have? Number of documents per second on average?
  • Are there better metrics describing the rate of data ingestion of your cluster? If so, what are they?
  1. Is the bulk of the indexing primarily on one index, or several?
  2. What sort of queries run on the cluster? Are queries primarily against the most recent data indexed (e.g. to trigger responses to certain events or event rates)? Do searches hit new and old data alike, hitting lots of documents and indexes? Do you use lots of complex aggregations?
  3. Any other details you'd like to provide?

Data volume:

  1. How many primary shards are there in your cluster?
  2. How many indexes?
  3. How many documents are there in your largest index? Largest shard? How big are these in terms of gigabytes?
  4. What is the total volume of data in your cluster (documents, gigabytes)?
  5. Any other metrics you'd like to share?

Physical Configuration:

  1. How many data nodes are there in your cluster? What sort of resources do they have (RAM, cpu cores, disk space dedicated to ES data)? What sort of disks and RAID do you use?
  2. How many master nodes do you have in your cluster? What sort of resources do they have (RAM, cpu cores)?
  3. Are your data and master nodes dedicated to that specific purpose, or do you combine data and master nodes? Do those nodes provide any other function, other than being members of your ES cluster?
  4. Do you use query-only (client-mode) ES nodes?
  5. What sort of network do you have among these nodes?
  6. Any other details you think are relevant or interesting or novel?

ES Configuration:

  1. How many shards does each index have?
  2. How many replicas does each index have?
  3. Was there a performance or reliability reason for choosing the number of shards per index? If so, what lead you to this number?
  4. Are there any ES configuration parameters that you have had to modify from the defaults in order to improve indexing, querying, or recovery performance? Please list those parameters, along with a few words on why you modified those parameters and arrived at the specific settings you are using.
  5. What options do you run the ES java instances with (either on the command-line or sysconfig file, not the elasticsearch.yml file)?
  6. Do you snapshot your cluster or back it up in some way? If so, any lessons learned there?

General Wisdom:

  • Are there any details of your cluster that were not captured in the above questions that you'd like to share?
  • Have you had any experiences with ES that taught you some valuable lessons which you'd like to share?
  • Any other words of advice about ES, or pitfalls to avoid?

Thanks!
John Ouellette


(Isabel Drost-Fromm) #2

There is a list of case studies available here: https://www.elastic.co/use-cases - though they don't cover your questions exactly, they might be helpful anyway.

Isabel


(John Ouellette) #3

Thanks Isabel -- unfortunately, while those cover the areas in which ES is
being used successfully, there isn't enough detail to provide a starting
point for a user to configure their own cluster if they have a similar use
case or resources.

John


(Nik Everett) #4

How about


and

?

If you are willing to dig around in there you can get access to just about everything wikimedia uses for on site search and logstash. There is certainly enough information floating around on the internet to answer your questions for wikimedia.

You can always reach out to the organizations in the case study list.

I think your questions are wonderful, btw.


(John Ouellette) #5

Thanks Nik -- I might try to go through that detail and see if I can get the answers to my questions from it: if so, i'll post the results here.

I had a few more questions to add, but ran into the 5000 character limit :slight_smile: I think I've got enough there, though, to capture the essentials though.

John


(system) #6