I am working on an application where I need to take huge amount of log-data and store it in ElasticSearch. The data contains information about what product feature a customer has used, time-stamp, customer name, customer location.
I am also using Kibana to query and generate reports from this data.
Currently, I have arranged data in following manner,
Storing this data in multiple indices where each index has data of one specific feature (used) and customer.
Another index for customer and location.
I want to minimize the time required search time aggregations and queries can be of following kind:
Feature vs Customer
Customer vs Location
All features vs Customer
Feature vs Customer vs Time-stamp
Am I doing this right or should I further enhance the structure to minimize search time aggregations?
the structure of the data is less important to speed then the amount of data to be searched"
number of Data Nodes
Heap size for both data nodes and client nodes (think caching)
Disk Speed and spare memory - if you linux disk reads get cached so the more spare memory you have the faster your queries will be
Number of shards - like number of data nodes to more you can break down your data(within reason) to faster it can search the data
So yes you are on a good track by breaking down the indexes but don't go to far as "Indexes" and caching has to be done which will start impacting your performance.
Think of it sort of like this
If you have 100GB of data
If you had 10 nodes with 10 shards that means each server would only have to search about 1gb of data
Hope this helps. (Unfortunately there is no easy answer as every use case/hardware/ and configurations can make a big difference to the answer)
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.