Scalability: do types carry any built-in routing or do I add type to doc to route on it?


(Simon Reavely) #1

I am creating a bunch of types (1000s) in elasticsearch that roughly map to classes or tables. So my query looks like this: http://localhost:9200/myindex//_search.
My type might look like this:
{
"properties": {
"message": {
"type": "string"
}
}
}
From reading the documentation, types are distributed to all nodes in the cluster and this query (http://localhost:9200/myindex//_search) would hit all nodes independent of whether docs for that type were on that node.

I'm concluding that unless I separately add a type field on the type AND I use that field in a routing operation I can't limit the queries to just the nodes with records for that doc type
...is that right?

...so my type now has to look like this...
{
"properties": {
"message": {
"type": "string"
},
"docType": {
"type": "string"
},
}
}
...and I filter on "docType" with a range query????


(Christian Dahlqvist) #2

Don't use 1000s of types as this will result in potentially very large mappings and a lot of cluster state updates. Instead add the type as a field to the document and filter on it. If you primarily query for one of these 'types' at a time and want to make sure that all records belonging to a 'type' are held in the same shard, you can use routing. Be aware that this can lead to uneven shard distribution if you have few keys or some that are much larger than others.


(Simon Reavely) #3

Yeah, I thought about that but if the types have different fields then the only way I can support that is to somehow merge document types but then I can't support anything other than string when the field names overlap.

The only alternative I came up with was to create a generic type e.g.
{
"properties": {
"string-field-1": {
"type": "string"
},
"string-field-n": {
"type": "string"
},
"date-field-1": {
"type": "datetime"
},
"date-field-n": {
"type": "datetime"
}
}
...then i use an external mapping from a type outside index to the fields re-used on the index.

I hope (still need to benchmark) that as long as the types don't change frequently the pressure on the cluster will be manageable.

Thoughts???

S.


(Christian Dahlqvist) #4

In Elasticsearch 1.x it was possible for a field to be mapped differently across types in an index, but as that caused lots of problems, this is no longer possible from Elasticsearch 2.0 onwards.

One way to avoid mapping conflicts could be to rename your fields with a suffix that indicate the mapping type and use this together with dynamic mappings. You could then have a postcode_s field that is a string while postcode_i is an integer etc.


(Simon Reavely) #5

Thanks!

Is this a balance/tradeoff or do you think one approach is a clear winner over the other?

In my paticular case I'm trying to build a sort of multi-tenant/self-service system in which different users would specify their doc types before using them (or other peoples if permitted).

Specifically, any thoughts on the walls that each hits?
("Each" being 1) relying on multiple doc types, 2) using dynamic mappings on a single doc type, 3) putting together a key/value schema (string1, string n, date1, daten) that is typed on the client side)

One concern i have with using discrete fields on the same doc-type is that we'd eventually get a single doc type mapping that was very large depending on how many properties/fields are added to it.

Cheers,
Simin


(Christian Dahlqvist) #6

Having a large number of doc types in order to allow different mappings for fields underneath these will restrict you to Elasticsearch 1.x. As all types and field go into the mappings these can get quite large, and Elasticsearch 1.x does not support delta cluster state updates, which means the entire cluster state need to be propagated for every change. You could also encounter issues as there is a reason this 'flexibility' was removed in more recent versions.

Selecting between the other 2 options is hard. You will most likely have large mappings and sparse fields with the suffix based naming convention,but at least Elasticsearch 2.x and later handles large mappings better through delta cluster state updates. The last option is more compact but pushes more logic to your applications. Reusing fields for different types of data could affect relevancy calculations and depending on what your design looks like, it could be difficult to search across multiple types.


(Simon Reavely) #7

Interesting...

but sorry, I'm confused by "having a large number of doc types in order to allow different mappings for fields underneath these will restrict you to Elasticsearch 1.x". Maybe its due to my novice status on ES. Is there a difference between doc type and mappings? I thought that creating different mappings were basically how you achieved doc types in ES 2.5? As in:
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html#mapping-type
...so I didn't get how this confines option 1 to ES 1.x - we had a prototype on ES 2.5...there is a definite hole in my understanding here on doc type vs mappings and I wanted to make sure we were talking about the same thing.

Cheers,
Simon
p.s. This was an interesting read: How to scale an ES deployment to millions of tenants with different data schemas


(Christian Dahlqvist) #8

Sorry if I was not clear. In Elasticsearch 1.x it was possible for fields to be mapped differently for different types within a single index. You could e.g. have severity be a string under type1 and an integer under type2. This did cause problems, and in version 2.x and above the field mapping must be consistent across all types in an index. If you therefore define severity as a string in type1, it has to also be a string for all other types.

If you therefore were relying on this behaviour of Elasticsearch 1.x, you would have issues upgrading to 2.x and later versions.


(system) #9