Changing mapping

Hi everyone,

Finally, my company will switch from ES 1.7 to ES 5.5 (in fact, the new one is for new projects ... and the old for that already exists ... anyway) !

"Old" Mapping
{
  "properties": {
    // some fields

    "tasks": {
      "properties": {
        "formKey": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "assignee": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "endTime": {
          "type": "date"
        },
        "id": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "startTime": {
          "type": "date"
        },
        "duration": {
          "type": "long"
        },
        "category": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "description": {
          "type": "string"
        },
        "priority": {
          "type": "integer"
        },
        "name": {
          "type": "string"
        },
        "taskDefinitionKey": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "owner": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "dueDate": {
          "type": "date"
        },
        "deleteReason": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        }
      },
      "type": "nested"
    },
    "variables": {
      "properties": {
        "id": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "attachmentValue": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "storedType": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "longValue": {
          "type": "long"
        },
        "name": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "executionId": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "binaryValue": {
          "type": "binary"
        },
        "stringValue": {
          "type": "string"
        },
        "rawValue": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "doubleValue": {
          "type": "double"
        },
        "dateValue": {
          "type": "date"
        },
        "booleanValue": {
          "type": "boolean"
        },
        "originalType": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        }
      },
      "type": "nested"
    }
  }
}
"New" Mapping
{
  "properties": {
    // same fields until tasks

    "tasks": {  // What can I do with tasks? I thought to put them in another mapping, as parent/child...  but I don't know if it would be good
      "properties": {
        "formKey": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "assignee": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "endTime": {
          "type": "date"
        },
        "id": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "startTime": {
          "type": "date"
        },
        "duration": {
          "type": "long"
        },
        "category": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "description": {
          "type": "string"
        },
        "priority": {
          "type": "integer"
        },
        "name": {
          "type": "string"
        },
        "taskDefinitionKey": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "owner": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "dueDate": {
          "type": "date"
        },
        "deleteReason": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        }
      },
      "type": "nested"
    },
    "variables": {  // Now variables has a more simple structure. I will use dynamic_template here
      "properties": {
        "protocol": {
            "type": "keyword"
        },

        // another fields, but with the same idea...

        "eps": {
            "type": "keyword"
        }
      },
      "type": "nested"  // I kept the nested type to keep them organized
    }
  }
}

About the old mapping:

  • We have problems filtering the contents of variables and tasks
    Example: to get a protocol variable, we need to do a "variables.name = protocol" AND "variables.stringValue = protocol_value". If we try to use _source include with "variables.name", it will not work, because every "variables.name" will return. The same idea for tasks.

  • We have no relationship between "processA" and "subProcessA" in a case like "one processA has many subProcessA"

About the new mapping:

  • Variables will be more direct. Something like "protocol = protocol_value". This way, _source include will work properly

    • Also, I will use dynamic_template for new variables
  • I will use parent/child for "processA" and "subProcessA" (both with a similar mapping)

  • But what can I do with the tasks?
    The structure is fixed, it does not have much to take out or put in the fields. But the way it is, the same problem (which I explained with the variables) would continue. But if you do a type for tasks, that would be very repetitive. I would have processA tasks and subProcessA tasks. Just thinking about these cases. I would still have processB, subProcessB and tasks for both ... and there are more process yet.

Do you have any suggestion?

thanks

Hello Vinicius,

Before we go into the task question, I would like to first discuss a little bit further you protocol structure change. What goes into protocol? Is this some sort of id that can grow unbounded or is it a known, fixed, list of values?

Cheers

protocol is an ID, but it is one of the possible variables. Others are "eps" (a subarea for a company), some booleans (for BPM control), some strings with internal meaning for the area...

a little more about procotol: it has a different format for each macro-area, for example telecom is a string 'XXXXXXX-YEAR', with N Xs.

So you should avoid using protocol = protocol_value otherwise every new protocol id will be a mapped field and you will have serious performance issues.

I recommend that you keep the old way and use nested objects or parent/child

But it's only one protocol per document... it's a id, but it's not the document id (this is for the process id).

Here is a example of the problem.
First a return with the old (current) mapping:

It's not possible to filter only the "protocol", because it's not a field (it's a value)

Now the new mapping:

  • Query
    Screenshot from 2017-08-10 10-30-21

It's easier to filter/query

And the result only has the "variables.protocol" field.

I'm using dynamic mapping and dynamic template to do this.

Ok, now I understand it better. I was with a different understanding, sorry.

How many types of variables exists in total?

N types. Depends of the project/area.

That is why I'm thinking about dynamic template...

If there is an unbounded number of variable types, then the old way is better. Otherwise, every time you index a document with different variables, new fields will be created and the bigger the mapping, the worse.

For instance, if you index one document like:

{
  "variables": {
      "protocolo":  "2106087-2017"
  }
}

And another one line

{
  "variables": {
      "another_var":  "another_value"
  }
}

Then this document's mapping will have created both "protocolo" field and "another_var" field and this is a mapping that grows unbounded, potentially. When that mapping hits +500 fields then cluster performance will degrade and you will need to scale up the servers (instead of scale out) to effectively solve the issue.

Bottom line: if you don't know in advance that the number of variable types can stay low (like 50-100 variable types), then you should stick with old mapping and use nested object or parent/child.

Thanks Thiago!

I just used the 'protocol' as a universal example.
For default, some variables are know (most of them in a project), like protocol, eps, shouldCreate, etc. They will be present in the mapping (not dynamic).

I want to use dynamic mapping and template for new variables, so the developer does not need to first change the mapping and then run the process with the new variable. I want to automate this.

Right now, the project with more variables has 86 variables. All them non-dynamic.

So if you keep the variables as low as 86 types, then it's all good and dynamic templates is the way to go :wink:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.