ES data redundancy VS Kibana visualizations

Hi everyone, I would love to hear some tips on how to avoid data redundancy in ES and Kibana.

We are using ES and Kibana to collect and visualize videogame analytics. The game client indexes an "event" - a document with lots of fields like "EventTimestamp", "EventName", "UserID", "SessionID", "ScreenResolution" etc. - when the player performs some significant actions as launching the game, loading the gameplay, dying, returning to the menu etc. All events until quitting the game are considered as one "session", marked by a unique "SessionID".

The problem is that there is dozens of fields (like "GPU_Name") whose values are constant throughout the whole session, but we still send them with every event in the session. That significantly raises the storage size of all individual event documents and makes the index size grow quickly. It feels wasteful to use the storage and memory for that much redundant information.

But we are sending the information with every individual event to be able to leverage it in Kibana. If I want to visualize for example the number of crashes by "GPU_Name", it's very straightforward if I have the gpu field on the "GameCrash" event document", but as far as I know it is very difficult or impossible to do if the gpu field is only on the "SessionStart" event document sent half an hour prior.

So, my question is: Should we just settle with redundant fields taking our storage space? Or is there some trick to structure our data differently to avoid duplicate information across the events in a session, but still be able to use it in Kibana's visualizations?

Welcome to our community! :smiley:

There's no tricks available, no.
Are you using best_compression on your indices?

Hi @MartinKolar welcome to the community.

Also remember a proper mapping will help reduce storage for all those keywords GPU_Name if you set the datatype as a keyword then the storage will be optimized if you leave it as default it will be saved as both text and keyword using more storage.

AND this feels weird but that remember that for keywords that Elasticsearch using an inverted index so technically the GPU_Name is stored once then the inverted document tells which document it belongs too...

Like you said There is also other benefits of indexing that field as it will let you do important aggregations across that fields with other dimensions like How many Crashes for GPU_Name = < gpu_name>

You can also chose not to store i.e. disable _source which will reduce storage (but I would be careful with that there are some serious downsides to that).

There is a setting to prune some of source but as the docs say that is a "Expert Setting" and it still has ramifications.

All that may save very little, I would get started, elasticsearch is pretty efficient... and see where you get whether it is really an issue or not.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.