Structure elasticsearch for 50 million fields

Niels_van_Rijn · June 4, 2019, 9:40am

Hi!

I have a dataset of 43536000 fields (I can't make it smaller, I already removed fields).
With normal Elasticsearch settings (1000 shards per node and 1000 fields per index) I would have to make 44 nodes in my Elasticsearch cluster.
I can raise the amount of fields per index or the amount of shards per node but I read everywhere that this is not recommended.

Am I missing something? I see lots success stories with large datasets.

How would one structure Elasticsearch to accommodate such a large dataset?

wangqinghuan · June 4, 2019, 10:25am

Because fields are included in mappings and mappings are included in cluster state which will be distributed among nodes. Too many fields will lead to big traffic between nodes. Why do you need so many fields? Maybe you can restructure documents by nested document. Here is a helpful document : https://www.elastic.co/cn/blog/found-beginner-troubleshooting#keyvalue-woes

dadoonet · June 4, 2019, 10:28am

What do you mean by "fields"? Do you really mean fields or documents? Do you have a sample document to share?

Niels_van_Rijn · June 4, 2019, 10:36am

So my original data is XML converted to JSON here is an example, I flattened the data to this because the document had a lot of mapping errors (within the same document).

All the strings in this document need to be searchable. So far I've done this by creating a large text fields with all the strings with the copy_to functionality

PUT test
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1
  },
  "mappings": {
    "properties" : {
      "full_text": {
        "type": "text",
        "store": true
      },
      "suggest" : {
        "type" : "completion"
      }
    },
    "dynamic_templates": [
      {
        "strings": {
          "match_mapping_type": "string",
          "mapping": {
            "type": "keyword",
            "copy_to": [ "full_text", "suggest" ]
          }
        }
      }
    ]
  }
}

Niels_van_Rijn · June 4, 2019, 10:47am

"fields" as in "fields within 1 document here is an example of 1 small document but some documents have 20.000 fields.

My original data is heavily nested XML here is an example of one document/xml

My first step was to convert the XML to json, but the converted json had a lot of mapping errors. It would be impossible (time wise) to go through all 945 xml's to fix the mapping errors. So I flattened the json file (this is how I got so many fields).

All the strings and dates need to be searchable, I tested with the index: false enabled: false and store: false settings but to use the highlight and prefix suggestion functionalities I have to use the default settings, if I'm not mistaken.

Christian_Dahlqvist · June 4, 2019, 10:48am

What is causing that number of fields? Are you using some type of entity identifier as field name?

Niels_van_Rijn · June 4, 2019, 10:56am

Yes, flattening the data was the only workaround I found to tackle all the mapping errors I had. This flattening is what creates all the extra fields.

Christian_Dahlqvist · June 4, 2019, 11:25am

Can you provide an example? Would it not be possible to create a new field containing the identifier so you can filter on this and thereby dramatically reduce the number of fields? Having entity ids in field names is generally a bad practice.

Niels_van_Rijn · June 4, 2019, 11:38am

Here is the original JSON and here is the the same document but flattened.

And maybe? What would that look like if, let's say you have this data?

{
    "toestand": {
        "@bwb-id": "BWBR0018883",
        "@inwerkingtreding": "2006-01-14",
        "wetgeving": {
            "@dtdversie": "2.0",
            "intitule": {
                "@bwb-ng-variabel-deel": "/Intitule",
                "meta-data": {
                    "brondata": {
                        "oorspronkelijk": {
                            "publicatie": {
                                "@effect": "nieuwe-regeling",
                                "uitgiftedatum": {
                                    "@isodatum": "2006-01-12",
                                    "#text": "12-01-2006"
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

Niels_van_Rijn · June 7, 2019, 8:20am

Can you explain what you mean, I do not fully understand what you mean by "create a new field containing the identifier". What identifier do you mean?

Niels_van_Rijn · June 7, 2019, 8:22am

Any suggestions? @dadoonet
How would you index a dataset like this?

Christian_Dahlqvist · June 7, 2019, 10:20am

I do not understand from the example what is driving the creation of 50 million fields. The example you provided looks quite simple. Can you please elaborate?

Niels_van_Rijn · June 7, 2019, 10:41am

Ah of course,
So I am indexing laws, one law has around 3800 fields (just like this example, but longer).

Each law has around 50 versions that means 50 x 3800(fields) and I have 27 laws.

system · July 5, 2019, 10:44am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Best practices documents with large field count with an eye to performance Elasticsearch	1	400	June 10, 2019
How many fields is too many? Elasticsearch	6	5187	July 6, 2017
Maximum number of fields in an index mapping Elasticsearch	3	8828	July 20, 2017
Elasticsearch- Single Index vs Multiple Indexes Elasticsearch	45	12597	March 6, 2019
Very large number of fields in Index leading to slow index rate Elasticsearch	11	7281	June 15, 2017

Structure elasticsearch for 50 million fields

Related topics