I am new to ElasticSearch. I am trying to create a reporting application with Elastic search as Data Store. Below are few Questions that i need answers for. I apologize priory as most of the questions may be basic.
Is there any way to index data from a delimited file other than Logstash that is better?
Logstash takes a long time for me to load data.(10,000 documents per minute). I learnt, i need to change the workers, JVM heap size to improve performance. Is there anything left out to improve the loading performance?
I understood that the Logstash Worker attribute works based on the Processor cores. So, if i have a 4 core processor, Do i have to specify 4? What will happen, if i have more / less no. of workers?
How much should be the ideal JVM heap size ? (Recommended)
For Indexing the Data, How should i select the no. of Clusters, Nodes, Shards etc., Also, how to do it?
How should we handle Cache to improve performance.
How aggregations work ? Does it only fetch the data or does it create buckets and store the data ?
I came across the X-Pack plugin and planning to use it. Is there any suggestions/ alternatives for this ?
I s there any Reporting tool (Kibana is for Visualizations) where the data is shown in Grids with pager?
Let elasticsearch manage the cache for you. Don't touch any setting unless you really understand what your are doing and what is the problem you think you want to solve.
It does not store any aggregation result. Everything is executed on live data. It's not like a batch job running behind the scene.
Suggestion to replace which feature of X-Pack?
Not sure what are grids with pager but X-Pack includes PDF generation. It's for now one visualisation per one so no fancy layout yet. But it will come in the future AFAIK.
if i have a 4 core processor, Do i have to specify 4? How many workers should i have for an ideal run with 4 core processor?
Grids is an Tabular data view and the data can be shown as small chunks based on the page number & number of records per page. I might export the data to Excel/CSV/PDF. Is there any way i can acheive this with elastic ?
Also, The index size seems to be pretty huge. I got a 2.5GB for a text file less than 1 GB of size.
For reference, we load 10 million records per day and we maintain a retention period of 90 days.
Well. First try with defaults. Then adjust if needed. Again, you'd better ask experts at logstash.
I might export the data to Excel/CSV/PDF. Is there any way i can acheive this with elastic ?
If you want to export a result set to CSV, it's coming in x-pack 6.0. 6.0.0-alpha2 has this feature.
I think that there is may be some open source solutions but I don't remember from the top of my head and also I don't know if there are maintained.
I got a 2.5GB for a text file less than 1 GB of size.
Start to disable _all field if you don't need it. Then adjust your mapping as the default one coming with logstash basically allows you to do as well full text search and aggregations on text fields but you might want only one of those features.
if i have a 4 core processor, Do i have to specify 4? How many workers should i have for an ideal run with 4 core processor?
1 worker per core is a good starting point. The increase in workers may help with any filters you have running, but it may not help in the input speed that much. From what I understand the file input doesn't scale well with an increase in workers. Something about having multiple workers trying to read from the same file.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.