Elasticsearch querying is terribly slow

Hello,

I have been testing Elasticsearch & Kibana for some time. I need a cluster that can process 40k events per second and search inside the last week of data under one minute of search time. My cluster consists of 9 data nodes, 3 master nodes and one client node. I installed the X-Pack today so I can use the profiler and monitoring software.

The cluster is meant to store and analyze Netflow and IPFIX data. I currently ingest around 5000 events per second, which should be really easy. However, searching this data is terribly slow. I was hoping that the monitoring could help me point out what is going wrong. Something inmediately raised a red flag for me.

I don't have 209 GB of available memory for no reason! Why does it not use more memory? I have tried all kinds of basic tweaks and optimizations that are available on the Internet for Elasticsearch clusters. However, I am probably missing something. Logstash also was way too slow, where others could get better performance.

Any ideas on why my Elasticsearch cluster uses barely 10% of my total available memory?

Thanks!

Are you using daily indices? How many primary shards and replica does each index consist of?

I believe the memory used is the heap usage. What is your heap size set to on your nodes?

I am using daily indices, and they each have 3 shards, and 2 replicas. I chose these numbers because I have 3 physical servers, with 3 data nodes on each of them. The data nodes each have 2x 1TB disks in RAID0, 40GB of RAM, and 20GB is heap for Elasticsearch. The memory is locked, like most tutorials say. :slight_smile:

My monitoring instance also tells me that the load on data nodes never exceeds 10%!

I changed the number of shards to 9, and kept the number of replicas at 2. Now the heap usage is around 50GB. I realized that shards use resources, so maybe more shards help use all the available resources. How many shards and replicas should I use for my daily indices, given that my disks are in RAID0 and Elasticsearch will have to compensate for a failure? Is it a better idea to not use daily indices?

Thanks!

The rollover and shrink APIs were designed to make daily indices not needed in as many use cases. It depends on how long you keep the data whether you need them.

With 1 replica you'll be able to stand a single disk failure until Elasticsearch recovers that indices that died. You'll want another replica if you want to withstand more failures, I suppose.

I tried some new things. I changed the number of shards to 27, since I saw the size of some of my indices to be over 400GB in size. After I this change, I saw the heap usage increase to around 80GB. I will have to collect some data in the upcoming days to find out whether this actually helps. I removed one replica, and only do 1 now.

I also tried _forcemerge'ing a few indices of days that have passed. The data in these indices will not change anymore, so the number of segments can be minimized to increase searching performance.

I performed a simple '*' discovery query over the last 7 days today, and the query takes so long to complete, that it times out instead. The query returns at most 2 billion records. Why does Kibana take so long to complete a simple query? Also, why does Kibana query all indices that are present in Elasticsearch?

Another side note: My data nodes run on three different physical machines. These machines are 10GE connected, but are not located in the same datacenter. Does the slightly increased latency matter much?

Thanks!

Any thoughts? I don't know what I could be doing wrong!

Thanks!

If you have 400GB indices with 1 billion plus records, I'm not surprised to see it take a minute or two. Is that 400GB each index? So your use case is Netflow, and you're using daily indices. From that info indices are only written on the day it is generate. We should use the shrink API to lower the number of shards if search becomes your main purpose. Maybe we can spend some on the mapping too to increase the performance. You do not always have to max out your heap to achieve efficiency.

May be you should try searching in the merged indexes to check if the search speed is fast enough?

In my situation, i created three index. the first one is used to store new data (new data), the second is a temp index to store data that are not packed to the final index (temp-old data). the third is the final index (old data).

In every midnight, i move old data from first index to the second, and merge both the indexes. After that, if the data in the second is more than a threshold, the data in this index is moved to the third, and the final index is merged to 1 segments. (In case the move operation needs more than one day, i will create another index to store temp-old data.)

Daily index is useful in case all your queries are passed to the recent days, but if your queries are passed to all the days, such setup is still slow. Assuming you have 100 day's data...

BTW: I just cannot understand why lucene stores all the data in compound indexes (what's more, the maximum size of compound index is 5GB, (So such index cannot handle big data?)), until force merge is required by the user?

I took a good look at my mapping and I disabled the _all field. Full-text search is not useful for our Netflow use case. I also took a look at the shard request cache. I put "index.requests.cache.enable": true in the mapping, and I changed the node-cache size on each node to 3% of the heap. When I collected some more data, I will check whether this actually helps!

I understand your idea @chenjinyuan87, thanks!

I will post an update when I have new findings.

Thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.