Schemaless Support for Elastic Search Queries

Hello,

Our REST API allows users to add custom schemaless JSON to some of our REST resources, and we need it to be searchable in Elasticsearch. This custom data and its structure can be completely different across resources of the same type.

Consider this example document:

{
  "givenName": "Joe",
  "username": "joe",
  "email": "joe@mailinator.com",
  "customData": {
    "favoriteColor": "red",
    "someObject": {
      "someKey": "someValue"
    }
  } 
}

All fields except customData adhere to a schema. customData is always a JSON Object, but all the fields and values within that Object can vary dramatically from resource to resource. There is no guarantee that any given field name or value (or even value type) within customData is the same across any two resources as users can edit these fields however they wish.

What is the best way to support search for this?

We thought a solution would be to just not create any mapping for customData when the index is created, but then it becomes unqueryable (which is contrary to what the ES docs say). This would be the ideal solution if queries on non-mapped properties worked, and there were no performance problems with this approach. However, after running multiple tests for that matter we haven’t been able to get that to work.

Is this something that needs any special configuration? Or are the docs incorrect? Some clarification as to why it is not working would be greatly appreciated.

Since this is not currently working for us, we’ve thought of a couple alternative solutions:

1- Reindexing: this would be costly as we would need to reindex every index that contains that document and do so every time a user updates a property with a different value type. Really bad for performance, so this is likely not a real option.
2- Use multi-match query: we would do this by appending a random string to the customData field name every time there is a change in the customData object. For example, this is what the document being indexed would look like:

{
  “givenName": "Joe",
  "username": "joe",
  "email": "joe@mailinator.com",
  "customData_03ae8b95-2496-4c8d-9330-6d2058b1bbb9": {
    "favoriteColor": "red",
    "someObject": {
      "someKey": "someValue"
    }
  }
}

This means ES would create a new mapping for each ‘random’ field, and we would use phrase multi-match query using a "starts with" wild card for the field names when performing the queries. For example:

curl -XPOST 'eshost:9200/test/_search?pretty' -d '
{
  "query": {
    "multi_match": {
      "query" : "red",
      "type" :  "phrase",
      "fields" : ["customData_*.favoriteColor"]
    }
  }
}'

This could be a viable solution, but we are concerned that having too many mappings like this could affect performance. Are there any performance repercussions for having too many mappings on an index? Maybe periodic reindexing could alleviate having too many mappings?

This also just feels like a hack and something that should be handled by ES natively. Am I missing something?

Any suggestions about any of this would be much appreciated.

Thanks,
Elder