I have to store some 60 million+ documents in elasticsearch. I am using bulkprocessor in JAVA api with Transport Client in Elasticsearch 5.3.1. I have tried indexing fixed number of fields (around 200) and able to achieve a speed of 10k documents/second.
The challenge i am facing is that there are many unique fields which are dynamic (around 3-4 lakhs) in my data. I have done following till now:
Used different bulk sizes. (from 100-1000).
Used varying number of threads (2 to 15).
Increased "indices.memory.index_buffer_size" from 10% to 25% of java heap memory.
Increasing "thread_pool.bulk.queue_size" to 5000.
I am using refresh_interval of 30sec.
JVM heap memory for each node is 18 GB out of 32 GB.
I have kept number of replicas as '1'. The reason is if JVM heap size overshoots and nodes go down then there will not be an issue in bringing cluster up as there will be data replication present in other nodes.
I am clearing cache every 10 minutes.
I tried setting "indices.store.throttle.type":"none", so that
segmentation and merging wont happen at index time, but still not seeing
any increase in index speed.
I am only able to see an index speed of 100 documents / second. Need someone to guide further on this.
Every new field that is encountered need to get a mapping defined dynamically and be added to the cluster state, which then need to be propagated across the cluster. This is single threaded, meaning that adding a large number of dynamic fields will be slow. Having that many dynamic fields will be problematic, which is why Elasticsearch at least from 5.x is limiting the number of fields an index can hold. This is often referred to as mapping explosion, and should be avoided. Why do you have so many unique fields? Can the data be modelled some other and more efficient way?
Thanks for the quick reply. I have a lot of products like that in amazon, for which if i take attributes across different product categories (like clothes/shoes/furniture/electronics) it easily crosses 1000. Also i want to store the fields both as text and keyword, so that i can do aggregation on them for analytics. Can you advise on how to store such data in elasticsearch?
Aggregation can be based on any random field for example, if i need to get all blue products in amazon, i will search for "color":"blue" or if i want all 32'' tv's, i will give some filter as "dimention":"32".
Please let me know if i am not making myself clear.
I am aware of one user who ran into this problem when running a large multi-tenancy solution where thousands of their customers shared a physical index.
Rather than allowing each customer to invent unique fieldnames they opted for a physical index with fixed banks of fields with reserved field names e.g. string1, string2, float1, float2. They then had a layer in the application tier that translated e.g. customer X 's logical field they call "dimension" is mapped to physical field "int7" while customer Z may use that field to store price info.
Less than ideal because it creates extra code in the application tier and only works if queries don't span types.
I don't think that will be possible for our requirement, since we want anyone to use kibana to perform aggregations on any field they want. So just wanted to check on feasibility/solution for this problem if anyone has succeeded in storing very large number of fields in elasticsearch.
After reading in different forums it seems the only logical way of doing it is to decide on a limited number of fields to be indexed instead of all fields. Also normalisation of similar fields is a challenge.
Anyone has any thoughts on this?
I am proceeding with keeping selected number of columns for aggregation in kibana and rest of dynamic fields i am just putting as an array under one column. It works well in searching with 60+ million products.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.