Sparse index


i have a ES Index with aprox. 10 Million documents. It is running on a three node cluster, with 6gb Java Heap per node..

when i use this index in kibana i see considerate slowdown, it takes up to 20 seconds for a count aggregation above all documents for example..

There is no I/O Bottleneck but i see 100% CPU usage on all the nodes for the time of the query..

The data is really sparse.. We have approx. 8000 different fields but only 20-30 are used per document..
Documents share 8 common fields and there are are some 100 different document "types" regarding the combination of the rest of the "sparse" fields (not types in ES)..

I already disabled norms for these fields, but document_ids are required..

I thought about creating some 100 different indices (one index per document "type") and using a myindex-* in kibana, but i guess this will be even slower!

We already tried creating a index with the 8 common fields + 2 fields called "name" and "value".. But this is pretty unhandy when plotting the data as you don´t get a nice dropdown in kibana (with 8000 rows in our case) to select the field you are interested in.. Furthermore the relation between different values belonging to the same "document" is lost.
Using different queries for different series in the same chart in kibana is complicated/unhandy as well..

What´s the general suggestion to index such "sparse" data / preprocess it? The Index per "Type" approach?

1 Like

I would move the types that have the most documents to their own index, and keep the long tail of types that have contained numbers of documents in a shared index.

that´s a good idea, thanks! :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.