Structure elasticsearch for 50 million fields

Hi!

I have a dataset of 43536000 fields (I can't make it smaller, I already removed fields).
With normal Elasticsearch settings (1000 shards per node and 1000 fields per index) I would have to make 44 nodes in my Elasticsearch cluster.
I can raise the amount of fields per index or the amount of shards per node but I read everywhere that this is not recommended.

Am I missing something? I see lots success stories with large datasets.

How would one structure Elasticsearch to accommodate such a large dataset?

Because fields are included in mappings and mappings are included in cluster state which will be distributed among nodes. Too many fields will lead to big traffic between nodes. Why do you need so many fields? Maybe you can restructure documents by nested document. Here is a helpful document : https://www.elastic.co/cn/blog/found-beginner-troubleshooting#keyvalue-woes

What do you mean by "fields"? Do you really mean fields or documents? Do you have a sample document to share?

So my original data is XML converted to JSON here is an example, I flattened the data to this because the document had a lot of mapping errors (within the same document).

All the strings in this document need to be searchable. So far I've done this by creating a large text fields with all the strings with the copy_to functionality

PUT test
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1
  },
  "mappings": {
    "properties" : {
      "full_text": {
        "type": "text",
        "store": true
      },
      "suggest" : {
        "type" : "completion"
      }
    },
    "dynamic_templates": [
      {
        "strings": {
          "match_mapping_type": "string",
          "mapping": {
            "type": "keyword",
            "copy_to": [ "full_text", "suggest" ]
          }
        }
      }
    ]
  }
}

"fields" as in "fields within 1 document here is an example of 1 small document but some documents have 20.000 fields.

My original data is heavily nested XML here is an example of one document/xml

My first step was to convert the XML to json, but the converted json had a lot of mapping errors. It would be impossible (time wise) to go through all 945 xml's to fix the mapping errors. So I flattened the json file (this is how I got so many fields).

All the strings and dates need to be searchable, I tested with the index: false enabled: false and store: false settings but to use the highlight and prefix suggestion functionalities I have to use the default settings, if I'm not mistaken.

What is causing that number of fields? Are you using some type of entity identifier as field name?

Yes, flattening the data was the only workaround I found to tackle all the mapping errors I had. This flattening is what creates all the extra fields.

Can you provide an example? Would it not be possible to create a new field containing the identifier so you can filter on this and thereby dramatically reduce the number of fields? Having entity ids in field names is generally a bad practice.

Here is the original JSON and here is the the same document but flattened.

And maybe? What would that look like if, let's say you have this data?

{
    "toestand": {
        "@bwb-id": "BWBR0018883",
        "@inwerkingtreding": "2006-01-14",
        "wetgeving": {
            "@dtdversie": "2.0",
            "intitule": {
                "@bwb-ng-variabel-deel": "/Intitule",
                "meta-data": {
                    "brondata": {
                        "oorspronkelijk": {
                            "publicatie": {
                                "@effect": "nieuwe-regeling",
                                "uitgiftedatum": {
                                    "@isodatum": "2006-01-12",
                                    "#text": "12-01-2006"
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

Can you explain what you mean, I do not fully understand what you mean by "create a new field containing the identifier". What identifier do you mean?

Any suggestions? @dadoonet
How would you index a dataset like this?

I do not understand from the example what is driving the creation of 50 million fields. The example you provided looks quite simple. Can you please elaborate?

Ah of course,
So I am indexing laws, one law has around 3800 fields (just like this example, but longer).

Each law has around 50 versions that means 50 x 3800(fields) and I have 27 laws.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.