Changing mapping


(Vinícius Pereira) #1

Hi everyone,

Finally, my company will switch from ES 1.7 to ES 5.5 (in fact, the new one is for new projects ... and the old for that already exists ... anyway) !

"Old" Mapping
{
  "properties": {
    // some fields

    "tasks": {
      "properties": {
        "formKey": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "assignee": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "endTime": {
          "type": "date"
        },
        "id": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "startTime": {
          "type": "date"
        },
        "duration": {
          "type": "long"
        },
        "category": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "description": {
          "type": "string"
        },
        "priority": {
          "type": "integer"
        },
        "name": {
          "type": "string"
        },
        "taskDefinitionKey": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "owner": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "dueDate": {
          "type": "date"
        },
        "deleteReason": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        }
      },
      "type": "nested"
    },
    "variables": {
      "properties": {
        "id": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "attachmentValue": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "storedType": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "longValue": {
          "type": "long"
        },
        "name": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "executionId": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "binaryValue": {
          "type": "binary"
        },
        "stringValue": {
          "type": "string"
        },
        "rawValue": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "doubleValue": {
          "type": "double"
        },
        "dateValue": {
          "type": "date"
        },
        "booleanValue": {
          "type": "boolean"
        },
        "originalType": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        }
      },
      "type": "nested"
    }
  }
}
"New" Mapping
{
  "properties": {
    // same fields until tasks

    "tasks": {  // What can I do with tasks? I thought to put them in another mapping, as parent/child...  but I don't know if it would be good
      "properties": {
        "formKey": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "assignee": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "endTime": {
          "type": "date"
        },
        "id": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "startTime": {
          "type": "date"
        },
        "duration": {
          "type": "long"
        },
        "category": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "description": {
          "type": "string"
        },
        "priority": {
          "type": "integer"
        },
        "name": {
          "type": "string"
        },
        "taskDefinitionKey": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "owner": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        },
        "dueDate": {
          "type": "date"
        },
        "deleteReason": {
          "index": "not_analyzed",
          "doc_values": true,
          "type": "string"
        }
      },
      "type": "nested"
    },
    "variables": {  // Now variables has a more simple structure. I will use dynamic_template here
      "properties": {
        "protocol": {
            "type": "keyword"
        },

        // another fields, but with the same idea...

        "eps": {
            "type": "keyword"
        }
      },
      "type": "nested"  // I kept the nested type to keep them organized
    }
  }
}

About the old mapping:

  • We have problems filtering the contents of variables and tasks
    Example: to get a protocol variable, we need to do a "variables.name = protocol" AND "variables.stringValue = protocol_value". If we try to use _source include with "variables.name", it will not work, because every "variables.name" will return. The same idea for tasks.

  • We have no relationship between "processA" and "subProcessA" in a case like "one processA has many subProcessA"

About the new mapping:

  • Variables will be more direct. Something like "protocol = protocol_value". This way, _source include will work properly

    • Also, I will use dynamic_template for new variables
  • I will use parent/child for "processA" and "subProcessA" (both with a similar mapping)

  • But what can I do with the tasks?
    The structure is fixed, it does not have much to take out or put in the fields. But the way it is, the same problem (which I explained with the variables) would continue. But if you do a type for tasks, that would be very repetitive. I would have processA tasks and subProcessA tasks. Just thinking about these cases. I would still have processB, subProcessB and tasks for both ... and there are more process yet.

Do you have any suggestion?

thanks


(Thiago Souza) #2

Hello Vinicius,

Before we go into the task question, I would like to first discuss a little bit further you protocol structure change. What goes into protocol? Is this some sort of id that can grow unbounded or is it a known, fixed, list of values?

Cheers


(Vinícius Pereira) #3

protocol is an ID, but it is one of the possible variables. Others are "eps" (a subarea for a company), some booleans (for BPM control), some strings with internal meaning for the area...

a little more about procotol: it has a different format for each macro-area, for example telecom is a string 'XXXXXXX-YEAR', with N Xs.


(Thiago Souza) #4

So you should avoid using protocol = protocol_value otherwise every new protocol id will be a mapped field and you will have serious performance issues.

I recommend that you keep the old way and use nested objects or parent/child


(Vinícius Pereira) #5

But it's only one protocol per document... it's a id, but it's not the document id (this is for the process id).

Here is a example of the problem.
First a return with the old (current) mapping:

It's not possible to filter only the "protocol", because it's not a field (it's a value)

Now the new mapping:

  • Query
    Screenshot from 2017-08-10 10-30-21

It's easier to filter/query

And the result only has the "variables.protocol" field.

I'm using dynamic mapping and dynamic template to do this.


(Thiago Souza) #6

Ok, now I understand it better. I was with a different understanding, sorry.

How many types of variables exists in total?


(Vinícius Pereira) #7

N types. Depends of the project/area.

That is why I'm thinking about dynamic template...


(Thiago Souza) #8

If there is an unbounded number of variable types, then the old way is better. Otherwise, every time you index a document with different variables, new fields will be created and the bigger the mapping, the worse.

For instance, if you index one document like:

{
  "variables": {
      "protocolo":  "2106087-2017"
  }
}

And another one line

{
  "variables": {
      "another_var":  "another_value"
  }
}

Then this document's mapping will have created both "protocolo" field and "another_var" field and this is a mapping that grows unbounded, potentially. When that mapping hits +500 fields then cluster performance will degrade and you will need to scale up the servers (instead of scale out) to effectively solve the issue.

Bottom line: if you don't know in advance that the number of variable types can stay low (like 50-100 variable types), then you should stick with old mapping and use nested object or parent/child.


(Vinícius Pereira) #9

Thanks Thiago!

I just used the 'protocol' as a universal example.
For default, some variables are know (most of them in a project), like protocol, eps, shouldCreate, etc. They will be present in the mapping (not dynamic).

I want to use dynamic mapping and template for new variables, so the developer does not need to first change the mapping and then run the process with the new variable. I want to automate this.

Right now, the project with more variables has 86 variables. All them non-dynamic.


(Thiago Souza) #10

So if you keep the variables as low as 86 types, then it's all good and dynamic templates is the way to go :wink:


(system) #11

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.