I have an Elastic cluster with two nodes and about 53 indices. There are two large indices one with 1.5M [ size: 93.93MB] documents and the other one has 2.3M [Size 354.61MB] documents in it. Not really huge size..
Not seeing any issues with search or _bulk PUT operation.
UNLOAD
But when i take a back up of the index using elasticdump it is insanely slow. Takes about 6 hours to complete the unloading of documents using elasticdump.
Initially i thought it was because of the write operation. The index receives new documents every 2 mins. So the unload operation never finishes. it keeps unloading every document as the new ones are coming in..
Then, i stopped the scripts which POSTs the new docs. Still it is slow, the Unload takes hours and hours to complete.
DELETE
In addition to unload, i am doing a scheduled maintenance on my indexes everyday. Deleting [using delete_by_query API] all the documents older than 60 days. The indexes with smaller size complete in few secs upto to few mins. But the big ones with 1.5M docs in it, is dead slow, took about 4 hours and still running.
Am i doing something wrong here? Is it recommended to store millions of documents in one index..
Please advice. It is taking a greater hit on the performance.
Okay. Thank you. I like the idea of having the indexes by date thus limiting the data in each index. But the problem is , if i use BULK API to create indexes dynamically, the "string" fields in the mapping gets created as "Analyzed" by default. That will result in creating more storage space allocated for the analysis.
So , as initial setup, i create and index and mappings specifying the "Not_analyzed" and then send the data to it,
I dont want any of my "terms" in my mapping to be analyzed. How do i created one index per day having all the fields as "not_analyzed"
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.