Indexing best practice

Elitlogik · November 24, 2020, 7:30pm

Just finished my first ELK deployment. On a HP EliteDesk i5, 8 GB, 120 GB SSD for $90.

It was done with CentOS 8 Minimal, Docker, Docker Compose, Cockpit, Portainer and Deviantony/Docker-ELK. A few tweaks, adding Beats and HTTPS. Figured out the defaulted Java heap of 256 MB wasn’t enough, raised to 6 GB.

Now, I have roughly 5 events/second. 200 fields, 5 indexes.

Where can I read on index best practices? The dashboards must be able to easily show last weeks events. Older events is ok to be slower.

Can you give me some tips? I’m scared that my build will crash in one week with default config.

Thank you very much for your support!

Grim · November 24, 2020, 8:38pm

Hi,
I guess the "best practices" depend on your use case. For example if you are indexing sequential log data/events you should have a closer look at Data streams. When frequently updating/deleting documents you should use "normal" indices.
No matter what you use I suggest you have a look at Index Lifecycle Policies - you have several options there to keep the data on your cluster nice and tidy. If you have a multi node deployment you can also reduce the shard count after rolling over the indices - this can save you some disc space.

Hope this helps

Elitlogik · November 24, 2020, 11:00pm

Thank you very much for your reply!

I will only store log data. An increasing amount of rows/entries that are timestamped. I never thought Elastic could be used for Word documents

So... Storing that to indexes are not the way to go, I should use Data streams instead?

I have a single node deployment, so should I have to learn shards, or don’t I?

I will look into Lifecycle. But... If you recommmend me to use Data streams, do I have to use Lifecycle or is that taken care of by some other logic?

Thank you very much for your support! There’s load to learn.

Grim · November 25, 2020, 9:52am

Well if you only store log data I would suggest to use Data streams. They also make the Index Lifecycle Management (ILM) easier in my opinion (Data streams are backed by "hidden indices").

There are multiple reasons why you should use ILM:

automatically delete logs which are too old (save space)
automatically manage index and shard sizes (keep the search and index performance of your cluster as high as possible)
automatically spread your data across multiple indices (increase search queries)

About shard sizes and shard count here is a good blog post which you should have a look at.

If you have a single node cluster I guess using only primary shards would be the best seeing you won't be able to set up your cluster to survive a node failure.
In general you should have a look at multi node deployments seeing they are one of the biggest advantages Elasticsearch has (if your log data is important)

Another thing I find very useful when indexing log data is the use of Ingest Pipelines (your node must be configured to be an Ingest node to use this feature). It enables you to do transformation and enrichment of your incoming data.

The Elastic cosmos is quite large, so yes, there sure is a lot to learn

system · December 23, 2020, 9:52am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Data Streams vs "Traditionally" Elastic Indexes Elasticsearch	7	5273	October 10, 2022
Dealing with large index collection strategy? Elasticsearch	6	1592	July 5, 2017
Best Practices - Confused Logstash	4	668	May 22, 2017
Tips on Optimization Elasticsearch	10	1411	November 6, 2017
What is the recommendation for indexing and sharding in Elasticsearch in complex use case? Elasticsearch	3	396	July 2, 2019

Indexing best practice

Related topics