Huge documents - are these to blame for our Young GC problem?

I will admit that I almost know the answer to this problem, but I'm somehow looking for confirmation, I guess.

We're seeing excessive Young GC "problems". Our setup is more or less the following:

3 data nodes (24 GB RAM, 12 GB heap)
3 master nodes (8 GB RAM, 4 GB heap)

I realize that a lot of young GC is normal, but we see jumps in 5% heap to 75% heap in maybe 30-60 seconds, resulting in GC runs of up to 10 seconds in some cases. 10 second GC runs every few minutes isn't very nice, when it happens across all 3 nodes - many timeouts.

I would say we have a very moderate influx of data, maybe 10-30 documents per second.

However, and I'll have to confess, we have some very large documents - with a lot of fields. So much that we've had to bump the default number of allowed fields, which leads me to think we're doing something very bad.

Is it normal to see that very large documents (in terms of fields) can cause this sort of behaviour?

if so, how do people deal with these issues? Split the data into multiple indices?

How many fields do you have now?

When adding documents with never-seen-before fields this requires a change to the index's mapping definition which in turn requires coordination with the master node to revise the schema which in turn then needs disseminating to all other nodes.
Clearly this adds more overhead to what would otherwise be a straight write of a document's contents on a data node. Declaring index fields up-front or avoiding the need to introduce new fields in your JSON will help matters.

My guess is that our mapping consists of between 6.000 and 8.000 fields, with a single document being around 1.200 to 1.500 fields. I know this is most likely way too much.

I get your point here, but it does look like we're seeing the same GC pattern even after all of our documents have been indexed, without adding any new fields.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.