Hosting/Indexing strategy for long-term storing data (if possible single node)

I have following requirements, that appear to be not so typical:
Indexing up to ~5000 documents per second (small docs, (semi) structured, needs to reliable!)
Exporting / Querying of big data sets up to 5*10^6 documents (do not have to be fast)
Persisting data for many years (time based indices are not an option)

Currently we have following stratgey :

  • single node (vertically scalable)
  • no replica
  • 1 index per data source / origin (typically 100mb to 1 gb)
  • 1 shard per index
  • 1 index schema
  • no ilm but we close indices after a while (writing to a specific index end after a few weeks; queries must be possbile for months/years) and reopen them on demand

With this approach we eventually end up having hundres of indices, where most of them are closed (no indexing any more) and up to about 100 of them are open. This also means having hundreds of shards for a single node.
Currently queries focus on a single index / subset of index data but we will need support for queries over multiple indices (which is possible using search api).

I am wondering if there might be a better strategy overall as I know that indices should be organized by data structure rather than data origin and that we shouldn't have too many shards on a single node. Also shards should be in 10-50? Gb range - not ~ 1Gb.
On the other hand I don't think that creating a huge (!) single index when querying only small subsets of data. And specifying the amount of shards up front basically eliminates vertically scaling as we cannot simply change heapsize, right?
So far the current strategy does not perform too bad. Running a single node with about 500gb data still works but we can expect more query load in future and we should come up with a better strategy I suppose.

You can't vertically scale heap, it must be under 32G (or the FUD* of 64 bit heap appears). You need to somehow estimate where your heap will be in the month/year future.

  • Fear, uncertainty and doubt. I've never seen anyone that actually documented how a huge system would run with heap > 32G. Maybe I just missed it.

Up to 32G it would be scalable, though.

Of course we would consider scaling out horizontically or what exactly is your point?

1 Like

Once you get uncompressed OOPs, you need (from memory) about 40GB of heap to hold the same amount as what you need when you're just below that compressed OOPs thresh hold. It's a large, inefficient jump.

You can go larger heaps, but that also means larger GC runs, and even with G1GC it can be problematic.

ObjectOrientedProgrammings, what does does mean?
In fact my question was not about scaling to heap above 32 GB.
I want to know if there are fundamental issues when having 5000+ (mostly closed) indices in elasticsearch with a single node and if it would make more sense to choose another index strategy..

Given we are talking about the JVM - CompressedOops - CompressedOops - OpenJDK Wiki

java - Trick behind JVM's compressed Oops - Stack Overflow is a good explainer.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.