Very large number of fields in Index leading to slow index rate

anand.d · May 17, 2017, 1:25pm

hi all,

I have to store some 60 million+ documents in elasticsearch. I am using bulkprocessor in JAVA api with Transport Client in Elasticsearch 5.3.1. I have tried indexing fixed number of fields (around 200) and able to achieve a speed of 10k documents/second.
The challenge i am facing is that there are many unique fields which are dynamic (around 3-4 lakhs) in my data. I have done following till now:

Used different bulk sizes. (from 100-1000).
Used varying number of threads (2 to 15).
Increased "indices.memory.index_buffer_size" from 10% to 25% of java heap memory.
Increasing "thread_pool.bulk.queue_size" to 5000.
I am using refresh_interval of 30sec.
JVM heap memory for each node is 18 GB out of 32 GB.
I have kept number of replicas as '1'. The reason is if JVM heap size overshoots and nodes go down then there will not be an issue in bringing cluster up as there will be data replication present in other nodes.
I am clearing cache every 10 minutes.
I tried setting "indices.store.throttle.type":"none", so that
segmentation and merging wont happen at index time, but still not seeing
any increase in index speed.

I am only able to see an index speed of 100 documents / second. Need someone to guide further on this.

Christian_Dahlqvist · May 17, 2017, 1:45pm

Every new field that is encountered need to get a mapping defined dynamically and be added to the cluster state, which then need to be propagated across the cluster. This is single threaded, meaning that adding a large number of dynamic fields will be slow. Having that many dynamic fields will be problematic, which is why Elasticsearch at least from 5.x is limiting the number of fields an index can hold. This is often referred to as mapping explosion, and should be avoided. Why do you have so many unique fields? Can the data be modelled some other and more efficient way?

anand.d · May 17, 2017, 2:10pm

hi Christian,

Thanks for the quick reply. I have a lot of products like that in amazon, for which if i take attributes across different product categories (like clothes/shoes/furniture/electronics) it easily crosses 1000. Also i want to store the fields both as text and keyword, so that i can do aggregation on them for analytics. Can you advise on how to store such data in elasticsearch?

Christian_Dahlqvist · May 17, 2017, 2:21pm

How do you want to query/aggregate against these sparse fields? How are you going to use the data?

anand.d · May 17, 2017, 2:25pm

Aggregation can be based on any random field for example, if i need to get all blue products in amazon, i will search for "color":"blue" or if i want all 32'' tv's, i will give some filter as "dimention":"32".
Please let me know if i am not making myself clear.

Christian_Dahlqvist · May 17, 2017, 2:28pm

I do unfortunately not have any good suggestions, so will have to leave it to the rest of the community....

anand.d · May 17, 2017, 2:30pm

thanks anyways

Mark_Harwood · May 17, 2017, 2:49pm

I am aware of one user who ran into this problem when running a large multi-tenancy solution where thousands of their customers shared a physical index.
Rather than allowing each customer to invent unique fieldnames they opted for a physical index with fixed banks of fields with reserved field names e.g. string1, string2, float1, float2. They then had a layer in the application tier that translated e.g. customer X 's logical field they call "dimension" is mapped to physical field "int7" while customer Z may use that field to store price info.

Less than ideal because it creates extra code in the application tier and only works if queries don't span types.

anand.d · May 17, 2017, 2:55pm

I don't think that will be possible for our requirement, since we want anyone to use kibana to perform aggregations on any field they want. So just wanted to check on feasibility/solution for this problem if anyone has succeeded in storing very large number of fields in elasticsearch.

anand.d · May 18, 2017, 5:04am

After reading in different forums it seems the only logical way of doing it is to decide on a limited number of fields to be indexed instead of all fields. Also normalisation of similar fields is a challenge.
Anyone has any thoughts on this?

anand.d · May 18, 2017, 9:44am

I am proceeding with keeping selected number of columns for aggregation in kibana and rest of dynamic fields i am just putting as an array under one column. It works well in searching with 60+ million products.

system · June 15, 2017, 9:44am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ES bulk index performance Elasticsearch	1	365	March 4, 2020
Bulk indexing slow down when data amount increase Elasticsearch	6	2948	July 6, 2017
Slow bulk indexing with lots of different 'types' Elasticsearch	7	795	July 5, 2017
Field count v. performance Elasticsearch	11	418	April 15, 2024
Performance issue while indexing lot of documents Elasticsearch	6	1130	July 6, 2017

Very large number of fields in Index leading to slow index rate

Related topics