And the "Mappings limit settings" doc says many fields can result in "performance degradations and memory issues", but
someone at my company is suggesting that we reduce our number of fields in order to improve our indexing speed. I'm thinking they have fields confused with tokens for the inverted index columns?
The link you provided is very old (2017) and as mentioned in it handling of large number of fields has improved over time.
If you have very large number of fields it will affect the amont of memory used, but it is also important to distinguish between the cases where you have large static mappings compared to if you are continously adding to mappings through dynamic mapping. If the mappings are large and static the cluster state does not need to be continuously updated, which reduces the impact. Every time a new field is added through dynamic mapping the mappings and cluster state need to be updated and propagated, which adds a lot more overhead.
When it comes to indexing speed the size of the documents, and as a result the amount of work that need to be done per document is very important. If the large field count is a result of larger documents with a lot of added data, it will have a larger impact on indexing performance compared to smaller documents only containing a small subset of all the defined fields.
I would therefore recommend you benchmark the different options to see what the impact is on your use case based on the actual data you have.
We've got a couple dozen dynamically mapped indices, each with 2000 fields max, but only a couple are growing at any given time. It seems like the cluster is handling state propagation well. My concern is more indexing performance.
Our average document size has remained the same despite a growing count of fields.
Would you expect two documents with same amount of data to index at roughly the same speed independent of field count? For example,
Tests I have done in the past have primarily been with documents of different size, which matters. I would assume your mappings could have an impact, e.g. if you are using complex analysers or have multiple subfields, as that results in more work. I am not sure anyone can give you a definitive answer as it will depend on a number of factors, so your best bet is to benchmark it yourself.
I can appreciate the hedging given the complexity of ES and variety of configurations out there.
I think what we're saying here is, for indexing performance, all things equal, document size matters more than field count, probably by a large margin.
Just how much margin is a question. The best way to know for your particular situation is to benchmark, varying the factors you're comparing.
I might start by testing one hundred tokens in one field versus one hundred tokens spread across 100 fields.
If I remember correctly Filebeat used to deploy (not sure if it still does or not) with a very large template covering all fields for all modules and the field limit was increased as a result. Each document would naturally only contain a subset of the fields. If the size of static mappings (not dynamic) was a major issue with respect to indexing performance I do not think they would have done that.
I recently had a talk/presentation about this during our engineering all hands session. Test conducted using esrally. Indexing 100% of fields vs 30% and the overall impact on resource util during ingestion.
I'm not clear on what "index 100% of data vs 30%" means, I'm sorry. I'll try running my own benchmarks (with Rally) to see if that can help make it clearer for me.
I'm guessing there's negligible increase in index size due to a bunch of unused fields? Are we talking about the size of the inverted index table itself? I think that would be peanuts versus 2 billion documents? (Is there a way to look at the size of these things, these inverted index table resources? Maybe /<index>/_stats or /<index>/_segments?)
improving query performance
I think what Opster means by this is that for queries that don't specify a field, all searchable fields will be searched. If we've got lots of fields, this probably adds substantial burden versus, say, searching just message.
I did a few queries just now and it can be as much as a 3.3x speed-up to search message explicitly.
This actually looks like a substantial benefit to reducing field count, though it'll depend on how much we can reduce it. There is a reason we're indexing lots of fields -- many of them are valuable to search or filter or aggregate against distinctly. The other way to get this value is to let our users know they get better performance being specific about the fields they want to search.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.