We are running a 3 node cluster to index logs from a firewall.
The nodes are physical machines (20 Core CPUs, 16GB RAM, SSD Storage).
Each days logs are stored in an individual index. The storage utilized per index works out to be approximately 800 to 1000 MB per million logs lines.
Index mapping consists of some 170 fields which have been mapped dynamically. Two fields are numbers, this has been done using the mutate filter in Logstash:
Is there any way to optimize storage so that the present size of 800-1000MB/Million log lines can be reduced without reducing the read/write performance of the cluster?
If text fields are dynamically mapped thay are generally mapped as both text and keyword, which takes up more space. For many types of data both are not required so creating an index template to change this can save a lot of space. This, together with other tips, are described in the docs.
So we've modified the Index Template to mostly use Keywords. Numeric fields have been set to use numeric types.
Is unsigned_long a valid type for numeric fields? While it is mentioned as a valid type in the docs, it doesn't show up in the Field Type > Numeric Type menu.
Thanks a ton! Will post back about the space savings once there's enough data for a comparison
You might also be interested in the analyze disk usage API which can break down your disk usage by field, and also the field usage stats API which can tell you which fields you're actually using in your searches.
I would normally expect that removing unused fields would improve your write performance, sometimes substantially, and avoiding dynamic mappings is also a good move for performance.
The size per event is between 850 - 900 bytes on average now, ~800MB/Million log lines.
We converted all dynamically mapped fields containing strings to Keyword, fields with numeric values are byte/short/integer/long. The dashboards will need quite some work to be start working again, given how IP addresses are not Text types with the IP.Keyword field any more
Is unsigned integer type not available as a data type? There is a mention of unsigned_long in the documentation but it does not appear in Kibana as an available data type.
As for IP addresses, we've used the fields parameter to have them mapped as ip addresses and also as Keywords so that the Visualization feature works. No pretty graphics for non-String types? It is also not possible to filter out non-string types from the GUI using the '-'/Filter out option.
We did try gathering data from the APIs, turns out the _source field take up about two thirds of the index size. The data types that are presently in use are compatible with the use of Synthetic _source. How much can we expect the disk usage to go down by in case Synthetic _source is used? Is there a read or write performance trade off involved?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.