I have a dataset of 43536000 fields (I can't make it smaller, I already removed fields).
With normal Elasticsearch settings (1000 shards per node and 1000 fields per index) I would have to make 44 nodes in my Elasticsearch cluster.
I can raise the amount of fields per index or the amount of shards per node but I read everywhere that this is not recommended.
Am I missing something? I see lots success stories with large datasets.
How would one structure Elasticsearch to accommodate such a large dataset?
Because fields are included in mappings and mappings are included in cluster state which will be distributed among nodes. Too many fields will lead to big traffic between nodes. Why do you need so many fields? Maybe you can restructure documents by nested document. Here is a helpful document : https://www.elastic.co/cn/blog/found-beginner-troubleshooting#keyvalue-woes
So my original data is XML converted to JSON here is an example, I flattened the data to this because the document had a lot of mapping errors (within the same document).
All the strings in this document need to be searchable. So far I've done this by creating a large text fields with all the strings with the copy_to functionality
"fields" as in "fields within 1 document here is an example of 1 small document but some documents have 20.000 fields.
My original data is heavily nested XML here is an example of one document/xml
My first step was to convert the XML to json, but the converted json had a lot of mapping errors. It would be impossible (time wise) to go through all 945 xml's to fix the mapping errors. So I flattened the json file (this is how I got so many fields).
All the strings and dates need to be searchable, I tested with the index: falseenabled: false and store: false settings but to use the highlight and prefix suggestion functionalities I have to use the default settings, if I'm not mistaken.
Can you provide an example? Would it not be possible to create a new field containing the identifier so you can filter on this and thereby dramatically reduce the number of fields? Having entity ids in field names is generally a bad practice.
I do not understand from the example what is driving the creation of 50 million fields. The example you provided looks quite simple. Can you please elaborate?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.