I need to import around 400GB of a json log file (which has about 300 million records) in order to make some searches and visualization using kibana.
I did some research in order to understand how is the better config setup for large cases as it is, but I would like some help.
I am planning to using only one server for that.
Thanks @A_B.
I have a logstash conf file ready to start the import, but I was just wondering to understand better about the cluster/nodes before start - in order to avoid any crash during the export (as it is millions of records).
It will take some experimentation and tuning to be sure that your setup has the performance characteristics you need. You might like to try importing increasingly large subsets of your log file first to get a handle on the performance characteristics and make sure that your mappings are set up suitably for the searches you want to perform.
If the index will eventually be 400GB then this article suggests you will want to split it into around 10-20 shards. However if you do not need all the fields to be indexed then you might find your index becomes much smaller than the source data, so you will be able to work with correspondingly fewer shards.
If your searches will be filtering on time ranges (common for searches of log data) then you might want to consider splitting the data into time-based indices rather than putting it into a single index with lots of shards. Using time-based indices will allow Elasticsearch to completely avoid searching any shards which it knows in advance do not contain any documents that match the time range specified in the search.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.