Very large number of fields in Index leading to slow index rate


(anand dubey) #1

hi all,

I have to store some 60 million+ documents in elasticsearch. I am using bulkprocessor in JAVA api with Transport Client in Elasticsearch 5.3.1. I have tried indexing fixed number of fields (around 200) and able to achieve a speed of 10k documents/second.
The challenge i am facing is that there are many unique fields which are dynamic (around 3-4 lakhs) in my data. I have done following till now:

  1. Used different bulk sizes. (from 100-1000).
  2. Used varying number of threads (2 to 15).
  3. Increased "indices.memory.index_buffer_size" from 10% to 25% of java heap memory.
  4. Increasing "thread_pool.bulk.queue_size" to 5000.
  5. I am using refresh_interval of 30sec.
  6. JVM heap memory for each node is 18 GB out of 32 GB.
  7. I have kept number of replicas as '1'. The reason is if JVM heap size overshoots and nodes go down then there will not be an issue in bringing cluster up as there will be data replication present in other nodes.
  8. I am clearing cache every 10 minutes.
  9. I tried setting "indices.store.throttle.type":"none", so that
    segmentation and merging wont happen at index time, but still not seeing
    any increase in index speed.

I am only able to see an index speed of 100 documents / second. Need someone to guide further on this.


(Christian Dahlqvist) #2

Every new field that is encountered need to get a mapping defined dynamically and be added to the cluster state, which then need to be propagated across the cluster. This is single threaded, meaning that adding a large number of dynamic fields will be slow. Having that many dynamic fields will be problematic, which is why Elasticsearch at least from 5.x is limiting the number of fields an index can hold. This is often referred to as mapping explosion, and should be avoided. Why do you have so many unique fields? Can the data be modelled some other and more efficient way?


(anand dubey) #3

hi Christian,

Thanks for the quick reply. I have a lot of products like that in amazon, for which if i take attributes across different product categories (like clothes/shoes/furniture/electronics) it easily crosses 1000. Also i want to store the fields both as text and keyword, so that i can do aggregation on them for analytics. Can you advise on how to store such data in elasticsearch?


(Christian Dahlqvist) #4

How do you want to query/aggregate against these sparse fields? How are you going to use the data?


(anand dubey) #5

Aggregation can be based on any random field for example, if i need to get all blue products in amazon, i will search for "color":"blue" or if i want all 32'' tv's, i will give some filter as "dimention":"32".
Please let me know if i am not making myself clear.


(Christian Dahlqvist) #6

I do unfortunately not have any good suggestions, so will have to leave it to the rest of the community....


(anand dubey) #7

thanks anyways


(Mark Harwood) #8

I am aware of one user who ran into this problem when running a large multi-tenancy solution where thousands of their customers shared a physical index.
Rather than allowing each customer to invent unique fieldnames they opted for a physical index with fixed banks of fields with reserved field names e.g. string1, string2, float1, float2. They then had a layer in the application tier that translated e.g. customer X 's logical field they call "dimension" is mapped to physical field "int7" while customer Z may use that field to store price info.

Less than ideal because it creates extra code in the application tier and only works if queries don't span types.


(anand dubey) #9

I don't think that will be possible for our requirement, since we want anyone to use kibana to perform aggregations on any field they want. So just wanted to check on feasibility/solution for this problem if anyone has succeeded in storing very large number of fields in elasticsearch.


(anand dubey) #10

After reading in different forums it seems the only logical way of doing it is to decide on a limited number of fields to be indexed instead of all fields. Also normalisation of similar fields is a challenge.
Anyone has any thoughts on this?


(anand dubey) #11

I am proceeding with keeping selected number of columns for aggregation in kibana and rest of dynamic fields i am just putting as an array under one column. It works well in searching with 60+ million products.


(system) #12

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.