I have just uploaded 4 years of 4GB data into aws elastic cluster.
I have indexed documents by day.
Since, i had stuck with default shard size of 5, now it has grown into more than 7300 primary shards.
I had also stick with the dynamic mapping of es, but i wanted to make my own template now.
I have thought of different options to decrease the shard size,
Reindexing data one index at a time
Need to write a script to reindex each one
2.Shrinking index
I don't think i can change the mapping of the index.
Still, i have to index one by one.
Bulk indexing.
I don't think it would be good idea to add a index settings/mapping line to each document record. (There are roughly 2 million documents)
Uploading the data again
4.a Using Logstash with es in input and output
4.b Uploading from scratch
That is indeed far too many shards. If you only have 4GB of data, you should be fine using a yearly index with 1 primary shard. It may actually be easier and more efficient to delete it all and index it again from scratch using the correct template rather than trying to reindex that many indices.
You might get away with monthly indices as well, but it all depends on your cluster spec and data volume. Having lots of very small shards is very inefficient.
It is about 2-3 gb/year.
We are a growing company, and our growth is more than 20%/year.
We plan to keep the data atleast for 5-10years.
It depends on team demand, but they will expect to cover the data of more than 5 years, mostly for visualizations.
I would recommend going for a yearly index with 1 or 2 primary shard initially. If you realise that shards are getting too big later on you can easily adjust it then and switch to a higher number of primary shards or monthly/quarterly indices.
Thank you very much for you advice.
And I don't think, it would be a problem to query the index using wildcard.... if we plan to change the index pattern to month/quarterly later on
Sorry to poke you again, how can i write the index pattern to correspond to quaterly index pattern, in Logstash es ouput....
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.