Dear community,
I am dealing with a huge amount of data (in the order of TBs) and a lot of nodes. I have two data sources, let's call them general input and events. While the general input is somehow predictable in size and arrival time I am going for monthly indices. The events are completly unpredictable. They can reach from a few dozen to tens of thousands of documents. Also, there can be a lot of events in one month and no events in another month. The total number of events is limited to a few hundred.
For this case I thought of one index per event.
And in addition: Both the general input and the events need to support hundreds of different types. Indexing time does not need to be super fast. The queries will be complex and mostly analytical. I will need per event and cross-event searches.
Are my ideas a good start or is this index design completly unsuited for my use case?
What would be the largest drawback of having the event indices time based? Some indices will be empty, other will be pretty large. That's not really a problem.
Are you talking about tens of thousands of documents per month?
I am curious about the unpredictability of the events. The "few dozens to tens of thousands" range is pretty big.
The biggest drawback would be a performance loss in per-event searches which occur quite often. Documents are added to events for a possibly longer timespan, I should have mentioned that. Although a new event occurs at a specific point in time and inital documents are added to the index at this point, other documents related to this event can be added, say, two months later.
I am well aware of the huge range, but this is the nature of my data.
I think I'm lost without knowing a lot more about your domain. Sorry.
It's tricky since you have a lot of data, but you should just try out some designs and see which one works best.
I already did some tests with this design, but with a dataload between 10 and 20 GB and on two nodes only. It works fine, but I am not sure if it will scale well. Also, I think a lot about how the number of different types will influence the overall performance. My tests covered around 20 types, but in production there will be more than 100.
To clarify the domain: The general input are logs. But I also need disk images and other media snapshots to be searched. Sometimes cross-searching with the logs, sometimes cross-searching in images, sometimes only logs or only an image.
Thanks Patrick so far. If anybody else has good ideas I would be happy to read them.