I'm starting to work with Elasticsearch and I'm having some issues designing a schema that enables the possibility of having different data types on the same field name. I’m planning to save my documents at index_name/customer_id/_id.
My customers will have the possibility to define custom fields that will be replicated to Elasticsearch. It could produce data type collisions as someone create a field with the same name and different data type of an existing field. A simple example could be the import of a CSV where a birthdaydate can be defined as string by a costumer and as date by other customer.
I’m thinking about adding a suffix to the field name, such as birthdaydate_s or birthdaydate_d. Do you think it is an acceptable solution to it? I've found a solution that recommends adding multiple sub-fields according to datatype but I think it will dramatically increase indexes size.
My solution will lead to a lot of sparsity as different customers could have a lot of different custom fields. The use of a solution that enables the possibility of having different multiple subfields to each found datatype will expand this problem. However, it is a fact that adding a suffix will also lead to a lot of sparsity on the data... I've just thought about this option as I've previously worked on a platform where there wasn't subfields and it has been chosen as the solution.
Do you think that the use of subfields is the most common and effective way to solve problems like mine?
I saw what sounds like a similar scenario recently with a business who had lots of customers each of whom could pick their own choices of field names but shared a common index.
Their solution was to have a reserved bank of elasticsearch fields (e.g. intField1, intField2.... stringField1...) in the mapping and each customer would have their field choices logically mapped to these physical fields e.g:
customer 1's "myPageViews" field == intField1
customer 2's "widgetSales" field == intField1
customer 2's "widgetSKU" field == stringField1
Then each customer query went through a custom mapping layer to translate their logical request into the physical elasticsearch query.
This helped ensure there was a sensible limit on the number of unique fields in the elasticsearch mapping and that there were no naming/type conflicts.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.