Hello all!
I have a rather complex requirement that I'd love some advice on. I am new to Elasticsearch, but I have spent a few weeks studying the docs, some case study blogs, and experimenting.
First, my documents' content consists of mostly of 4 types of tokens:
"attribute x": value
"attribute x": {"gte": 123, "lte": 456}
"object a": ["attribute x": value]
"object a": ["attribute x": {"gte": 123, "lte": 456}]
... where "object a" and "attribute x" are integers obtained from a controlling RDB. "Value" may be an integer, keyword, date range, etc. and is also obtained from the controlling RDB.
I chose to use a controlling RDB due to the hierarchical / join nature involved.
Without going into unneeded detail, these token are carefully constructed to mathematically represent quantized bucketed values and what would typically be text tags. For example, instead of using a tag word "mahogany", I would have an "attribute #": value that represents "mahogany" as a class of "wood", thus enabling a query to mathematically find similar types of wood to mahogany.
To this end, in Elasticsearch some indexes would contain fields that contain an array of values, i.e. "attribute #": [value]. Other indexes would contain dynamic attributes in the up to 100 range.
After studying the pros and cons of parent-child and nested documents as well as the flattened field names model, I've come up with the following:
- For numeric single "attribute #" : value cases where queries work on an exact match, convert "attribute #" : value to a composite value within a mapped field. For example:
Attribute # 100 with value 5 becomes (100 << 16) | 5 => _"attribs": [6553605]_
This way, I can easily query this "attribs" field to get documents that have a value of 5 for Attribute # 100.
- For all other cases, flatten (object #) & attribute # to a field name. For example (including a type code for dynamic mapping):
type_code || Object ID || Attribute ID => "_i4_Jw2_8g": 101
I have tested and Elasticsearch has no problem with such field names. I know this goes against convention (but this is a special case).
The mappings in part would look something like this:
PUT my_index
{
"mappings": {
"my_type": {
"_field_names": {
"enabled": false
},
"properties": {
"anchor": {
"type": "integer",
"doc_values": false
},
"name": {
"type": "text",
"norms": false,
"doc_values": false
}
},
"dynamic_templates": [
{
"int4": {
"match_mapping_type": "long",
"match": "_i4_*",
"mapping": {
"type": "integer",
"norms": false,
"doc_values": false
}
}
},
{
"string": {
"match_mapping_type": "string",
"match": "_tx_*",
"mapping": {
"type": "string",
"norms": false,
"doc_values": false,
"index": "not_analyzed"
}
}
},
{
"date_range": {
"match_mapping_type": "*",
"match": "_dr_*",
"mapping": {
"type": "date_range",
"norms": false,
"doc_values": false
}
}
}
]
}
}
}
PUT my_index/my_type/1
{
"anchor": 101,
"_dr_my_date": {
"gt": "2017-01-01",
"lte": "2018-01-01"
},
"_i4_o1_a1": 11928,
"_i4_o1_a2": 900000000
}
GET my_index/my_type/_search
{
"query": {
"bool": {
"filter": [
{
"range": {
"_dr_my_date": {
"gte": "2017-01-01",
"lte": "2017-01-02",
"boost": 2
}
}
},
{
"term": {
"_i4_o1_a1": 11928
}
}
]
}
}
}
Please note my use of "doc_values", "norms", and "_field_names".
Here are my questions:
-
I don't fully understand "norms". I thought they are for text lenght for use in scoring. How come I can add them to integer and date types in the dynamic mapping (but not in the static mapping)?
-
Are my approaches optimal for my use case? I concluded it better to flatten field names than to use nested documents since there could be up to 10+ nested documents per document. The only downside it a little more code in the app layer.
-
I won't be using the exists query, scoring, sorting, or running aggregates on dynamic fields, hence my disabling of "doc_values", "norms", and "_field_names".
-
I want to write a custom ranking algorithm that compares values in the query against values in these dynamic fields. However, I'd have to enable "doc_values", which is not advisable for dynamic fields. Further, I haven't seen any method of comparing values in a query against values in a document in order to determine a rank (Euclidean distance is a good illustration). Therefore, would I be better off doing this ranking at the application level?
Thanks much!