Field count v. performance

Is indexing speed affected by the total number of fields in an index?

I understand there are costs from a few areas:

  • cluster state size
  • disk efficiency
  • Lucene resources

And the "Mappings limit settings" doc says many fields can result in "performance degradations and memory issues", but

someone at my company is suggesting that we reduce our number of fields in order to improve our indexing speed. I'm thinking they have fields confused with tokens for the inverted index columns?

The link you provided is very old (2017) and as mentioned in it handling of large number of fields has improved over time.

If you have very large number of fields it will affect the amont of memory used, but it is also important to distinguish between the cases where you have large static mappings compared to if you are continously adding to mappings through dynamic mapping. If the mappings are large and static the cluster state does not need to be continuously updated, which reduces the impact. Every time a new field is added through dynamic mapping the mappings and cluster state need to be updated and propagated, which adds a lot more overhead.

When it comes to indexing speed the size of the documents, and as a result the amount of work that need to be done per document is very important. If the large field count is a result of larger documents with a lot of added data, it will have a larger impact on indexing performance compared to smaller documents only containing a small subset of all the defined fields.

I would therefore recommend you benchmark the different options to see what the impact is on your use case based on the actual data you have.

Thanks for your reply, Christian.

We've got a couple dozen dynamically mapped indices, each with 2000 fields max, but only a couple are growing at any given time. It seems like the cluster is handling state propagation well. My concern is more indexing performance.

Our average document size has remained the same despite a growing count of fields.

Would you expect two documents with same amount of data to index at roughly the same speed independent of field count? For example,

{"field1":"token1 token2 token3 token4 token5 token6"}
{"field1":"token1 token2 token3","field2":"token4 token5 token6"}

(Given both fields already exist...) I'm betting the majority of the indexing processing is by far from number of tokens, not fields?

There must be some processing to manage the distinct fields. I might try some benchmarking to see what kind of percent.

Tests I have done in the past have primarily been with documents of different size, which matters. I would assume your mappings could have an impact, e.g. if you are using complex analysers or have multiple subfields, as that results in more work. I am not sure anyone can give you a definitive answer as it will depend on a number of factors, so your best bet is to benchmark it yourself.

Thanks again, Christian.

I can appreciate the hedging given the complexity of ES and variety of configurations out there.

I think what we're saying here is, for indexing performance, all things equal, document size matters more than field count, probably by a large margin.

Just how much margin is a question. The best way to know for your particular situation is to benchmark, varying the factors you're comparing.

I might start by testing one hundred tokens in one field versus one hundred tokens spread across 100 fields.

1 Like

If I remember correctly Filebeat used to deploy (not sure if it still does or not) with a very large template covering all fields for all modules and the field limit was increased as a result. Each document would naturally only contain a subset of the fields. If the size of static mappings (not dynamic) was a major issue with respect to indexing performance I do not think they would have done that.

1 Like

I recently had a talk/presentation about this during our engineering all hands session. Test conducted using esrally. Indexing 100% of fields vs 30% and the overall impact on resource util during ingestion.

Awesome to see actual benchmarking on this topic already, that's lucky coincidence! Thanks!

What do you mean by "100% Field Indexing" and 30%? Does that mean each document uses 100% of all the mapped fields in an index v. 30%?

What are the relative data sizes of the 100% and 30% documents?

That is correct, index 100% of data vs 30%. I use the logs track in esrally here: rally-tracks/elastic/logs at master · elastic/rally-tracks · GitHub. I believe approx 15 or so log sources are generated. each log source mapping here: rally-tracks/elastic/logs/templates/composable at master · elastic/rally-tracks · GitHub

I'm not clear on what "index 100% of data vs 30%" means, I'm sorry. I'll try running my own benchmarks (with Rally) to see if that can help make it clearer for me.

Oh, by the way, Jess from the Elastic Stack Slack pointed me to this Opster article on reducing field counts which mentions these benefits:

reducing index size

I'm guessing there's negligible increase in index size due to a bunch of unused fields? Are we talking about the size of the inverted index table itself? I think that would be peanuts versus 2 billion documents? (Is there a way to look at the size of these things, these inverted index table resources? Maybe /<index>/_stats or /<index>/_segments?)

improving query performance

I think what Opster means by this is that for queries that don't specify a field, all searchable fields will be searched. If we've got lots of fields, this probably adds substantial burden versus, say, searching just message.
I did a few queries just now and it can be as much as a 3.3x speed-up to search message explicitly.
This actually looks like a substantial benefit to reducing field count, though it'll depend on how much we can reduce it. There is a reason we're indexing lots of fields -- many of them are valuable to search or filter or aggregate against distinctly. The other way to get this value is to let our users know they get better performance being specific about the fields they want to search.

(Thanks, Jess!)

1 Like

Hi @rsk0,

Glad you found the article helpful!

Best,
Jess

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.