Field with different types on same index

Currently, we are indexing information on elastic for each business that is our customer (B2B), this is causing us to have too many indexes, some with very small shards (500mb shards)

But due to the nature of the data, we can have a lot of type collision on the fields, for example
Company_A, has the following:

{
    "id": "8f22efb2-6a2a-4cb7-9d0b-01ca0d6cff2e",
    "notes" : {
        "created_at": "2022/01/01",
        "value": "This is some random note"
    } ,
    "first_name": "Some name",
    "integration": "CRM_A"
}

While Company_B will have:

{
    "id": 1,
    "notes":  "This is a random note text",
    "name": "Some random name",
    "integration": "CRM_A"
}

What we were thinking is to have them on a single index, by "integration", but on this case, they would both have different types for the field called notes, where one has a object, another one has text, this also happens on a lot of other fields, basically multiple cases of the same field name with different types

Is there any way to solve this with elastic?
If we can have a way, we would finally be able to have our shards with 20-30gb of data, as is recommended, instead of having TONS of small shards, forcing us to have way more memory to keep all the mappings and everything in place.

One thing we considered was to add the type to the name of the fields, and remove them prior to returning the response, but that might takes really confusing :confused:

Short answer is those two schemas cannot live in a single index the way that you have them.

As you know in one schema notes is an object and another schema notes is just a concrete field.

The way to solve this is choose one of the schemas to standardize to and use an ingest pipeline to standardize the fields.

In this case, it's probably easier to just standardize the concrete into the notes.value field.

My biggest issue is that happens with lots of different fields (some of the documents have more than 30 collisions :/)

Hmmmm that is unfortunate... no real simple answer there...

Unless you put each set under a top level Field... then you wont have collisions...

Just in case you are interest here is a sample ingest pipeline / simulate with various combinations

## Ingest Pipeline Concrete / Object
POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "rename": {
          "if": "ctx?.notes instanceof String", 
          "field": "notes",
          "target_field": "notes.value",
          "ignore_failure": false
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "id": "8f22efb2-6a2a-4cb7-9d0b-01ca0d6cff2e",
        "notes": {
          "created_at": "2022/01/01",
          "value": "This is some random note"
        },
        "first_name": "Some name",
        "integration": "CRM_A"
      }
    },
    {
      "_source": {
        "id": 1,
        "notes": "This is a random note text",
        "name": "Some random name",
        "integration": "CRM_A"
      }
    },
    {
      "_source": {
        "id": 1,
        "name": "Some random name",
        "integration": "CRM_A"
      }
    }
  ]
}

Results

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_version": "-3",
        "_source": {
          "integration": "CRM_A",
          "notes": {
            "created_at": "2022/01/01",
            "value": "This is some random note"
          },
          "id": "8f22efb2-6a2a-4cb7-9d0b-01ca0d6cff2e",
          "first_name": "Some name"
        },
        "_ingest": {
          "timestamp": "2022-09-23T23:23:13.461958276Z"
        }
      }
    },
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_version": "-3",
        "_source": {
          "name": "Some random name",
          "integration": "CRM_A",
          "id": 1,
          "notes": {
            "value": "This is a random note text"
          }
        },
        "_ingest": {
          "timestamp": "2022-09-23T23:23:13.461985402Z"
        }
      }
    },
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_version": "-3",
        "_source": {
          "name": "Some random name",
          "integration": "CRM_A",
          "id": 1
        },
        "_ingest": {
          "timestamp": "2022-09-23T23:23:13.461991898Z"
        }
      }
    }
  ]
}

Wow, thanks a lot stephenb! Gonna take a stab at it

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.