Same field name with different data types in the same index


(Fabiocatalao) #1

Hi!

I'm starting to work with Elasticsearch and I'm having some issues designing a schema that enables the possibility of having different data types on the same field name. I’m planning to save my documents at index_name/customer_id/_id.

My customers will have the possibility to define custom fields that will be replicated to Elasticsearch. It could produce data type collisions as someone create a field with the same name and different data type of an existing field. A simple example could be the import of a CSV where a birthdaydate can be defined as string by a costumer and as date by other customer.

I’m thinking about adding a suffix to the field name, such as birthdaydate_s or birthdaydate_d. Do you think it is an acceptable solution to it? I've found a solution that recommends adding multiple sub-fields according to datatype but I think it will dramatically increase indexes size.

Thanks,
Fábio


(Isabel Drost-Fromm) #2

Can you elaborate why you think this would dramatically increase the index size?

Also it would be awesome if you could share a link to the resources where you found the recommended approach you sketched.

Isabel


(Fabiocatalao) #3

My solution will lead to a lot of sparsity as different customers could have a lot of different custom fields. The use of a solution that enables the possibility of having different multiple subfields to each found datatype will expand this problem. However, it is a fact that adding a suffix will also lead to a lot of sparsity on the data... I've just thought about this option as I've previously worked on a platform where there wasn't subfields and it has been chosen as the solution.

Do you think that the use of subfields is the most common and effective way to solve problems like mine?


(Isabel Drost-Fromm) #4

I'm actually not sure what is most common for your problem.

Using sub-fields/ multi fields makes sense when fields should be stored in multiple ways, e.g. using more than one analyzer. Re-reading your original post this is different from your use-case though: https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html

The information in here
https://www.elastic.co/blog/great-mapping-refactoring#conflicting-mappings on how to resolve conflicting mappings might make more sense in your case.

Also depending on how many customers you expect, you might want to read
https://www.elastic.co/guide/en/elasticsearch/guide/current/user-based.html and for contrast https://www.elastic.co/blog/found-multi-tenancy first.

Hope this helps,
Isabel


(Fabiocatalao) #5

Thank you for your reply. I've read some of the posts you cited, but the third one gave me a new perspective about some issues! :slight_smile:

The sparsity problems I was afraid were already fixed. They were almost related to a lucene issue: https://issues.apache.org/jira/browse/LUCENE-6863 .

I will try to do some tests about the use of sub-fields and I will later give my feedback.


(Mark Harwood) #6

I saw what sounds like a similar scenario recently with a business who had lots of customers each of whom could pick their own choices of field names but shared a common index.
Their solution was to have a reserved bank of elasticsearch fields (e.g. intField1, intField2.... stringField1...) in the mapping and each customer would have their field choices logically mapped to these physical fields e.g:

customer 1's "myPageViews" field == intField1
customer 2's "widgetSales" field == intField1
customer 2's "widgetSKU" field == stringField1

Then each customer query went through a custom mapping layer to translate their logical request into the physical elasticsearch query.
This helped ensure there was a sensible limit on the number of unique fields in the elasticsearch mapping and that there were no naming/type conflicts.

Cheers
Mark


(system) #7