Elastic search as Reporting Data Store

Hi,

I am new to ElasticSearch. I am trying to create a reporting application with Elastic search as Data Store. Below are few Questions that i need answers for. I apologize priory as most of the questions may be basic.

  1. Is there any way to index data from a delimited file other than Logstash that is better?
  2. Logstash takes a long time for me to load data.(10,000 documents per minute). I learnt, i need to change the workers, JVM heap size to improve performance. Is there anything left out to improve the loading performance?
  3. I understood that the Logstash Worker attribute works based on the Processor cores. So, if i have a 4 core processor, Do i have to specify 4? What will happen, if i have more / less no. of workers?
  4. How much should be the ideal JVM heap size ? (Recommended)
  5. For Indexing the Data, How should i select the no. of Clusters, Nodes, Shards etc., Also, how to do it?
  6. How should we handle Cache to improve performance.
  7. How aggregations work ? Does it only fetch the data or does it create buckets and store the data ?
  8. I came across the X-Pack plugin and planning to use it. Is there any suggestions/ alternatives for this ?
  9. I s there any Reporting tool (Kibana is for Visualizations) where the data is shown in Grids with pager?

Thanks in Advance,
Gowtham

  1. It depends. What is the input format you have?
  2. It depends on your Logstash configuration file I think. May be you are doing heavy transformations? You should share it here.
  3. I don't know. I think it uses the ideal number of workers based on what it detected. May be ask this in #logstash channel instead?
  4. Depends on what you are doing I guess. You should install x-pack and monitor the logstash instance and see how much pressure you have.
  5. There are plenty of discussions about sizing. In short, watch: https://www.elastic.co/fr/elasticon/conf/2016/sf/quantitative-cluster-sizing
  6. Let elasticsearch manage the cache for you. Don't touch any setting unless you really understand what your are doing and what is the problem you think you want to solve.
  7. It does not store any aggregation result. Everything is executed on live data. It's not like a batch job running behind the scene.
  8. Suggestion to replace which feature of X-Pack?
  9. Not sure what are grids with pager but X-Pack includes PDF generation. It's for now one visualisation per one so no fancy layout yet. But it will come in the future AFAIK.

Thanks a lot for your reply.

  1. The input format is a pipe delimited file.

The logstash config is as below.

        input {
      file {
        path => "D:/aaa/sample.txt"
        type => "test"
        start_position => "beginning"
    	sincedb_path => "D:/aaa/bbb/null"    
      }
    }
    filter {
      csv {
          separator => "|"
          columns => ["1","2","3","4","5","6","7","8","9"]
      }
    }
    output {
    	elasticsearch {
            action => "index"
            hosts => [ "localhost:9200" ]
            index => "sampleindex"
            workers => 1
        }
        stdout {}
    }
  1. if i have a 4 core processor, Do i have to specify 4? How many workers should i have for an ideal run with 4 core processor?

Grids is an Tabular data view and the data can be shown as small chunks based on the page number & number of records per page. I might export the data to Excel/CSV/PDF. Is there any way i can acheive this with elastic ?

Also, The index size seems to be pretty huge. I got a 2.5GB for a text file less than 1 GB of size.

For reference, we load 10 million records per day and we maintain a retention period of 90 days.

Thanks,
Gowtham

The logstash config...

Remove stdout {} or replace it by:

output {
  stdout { codec => dots }
}

Do i have to specify 4?

Well. First try with defaults. Then adjust if needed. Again, you'd better ask experts at #logstash.

I might export the data to Excel/CSV/PDF. Is there any way i can acheive this with elastic ?

If you want to export a result set to CSV, it's coming in x-pack 6.0. 6.0.0-alpha2 has this feature.
I think that there is may be some open source solutions but I don't remember from the top of my head and also I don't know if there are maintained.

I got a 2.5GB for a text file less than 1 GB of size.

Start to disable _all field if you don't need it. Then adjust your mapping as the default one coming with logstash basically allows you to do as well full text search and aggregations on text fields but you might want only one of those features.

if i have a 4 core processor, Do i have to specify 4? How many workers should i have for an ideal run with 4 core processor?

1 worker per core is a good starting point. The increase in workers may help with any filters you have running, but it may not help in the input speed that much. From what I understand the file input doesn't scale well with an increase in workers. Something about having multiple workers trying to read from the same file.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.