We are building a search feature for our existing product . The document structure is like
{
"_index" : "apr-2018-feed",
"_type" : "product",
"_id" : "8102637039f76d20a0adc1257c14ee08",
"_source" : {
"id" : "8102637039f76d20a0adc1257c14ee08",
"field1" : "value",
"field2" : 6495,
"field3" : "",
"field4" : "value",
"field5" : 23922,
"dateField" : "2018-03-02",
"valueField" : 10000000,
"clusters" : [{
"clusterId" : 4919,
"clusterName" : "XYZ",
"innerClusters" : [{
"innerClusterId" : 118760075,
"field1" : "value",
"field2" : 6495,
"field3" : "",
"field4" : "value",
"field5" : 23922,
"attributeStore1" : [{
"name" : "attr1",
"value" : "attrVal"
}, {
"name" : "attr2",
"value" : "attrVal"
}, {
"name" : "attr3",
"value" : "attrVal"
}, {
"name" : "attr4",
"value" : "attrVal"
}
],
"attributeStore2" : [{
"name" : "attr5",
"value" : "attrVal"
}, {
"name" : "attr6",
"value" : "attrVal"
}, {
"name" : "attr7",
"value" : "attrVal"
}, {
"name" : "attr8",
"value" : "attrVal"
}
],
},{
"innerClusterId" : 118760076,
"field1" : "value",
"field2" : 6495,
"field3" : "",
"field4" : "value",
"field5" : 23922,
"attributeStore1" : [{
"name" : "attr1",
"value" : "attrVal"
}, {
"name" : "attr2",
"value" : "attrVal"
}, {
"name" : "attr3",
"value" : "attrVal"
}, {
"name" : "attr4",
"value" : "attrVal"
}
],
"attributeStore2" : [{
"name" : "attr5",
"value" : "attrVal"
}, {
"name" : "attr6",
"value" : "attrVal"
}, {
"name" : "attr7",
"value" : "attrVal"
}, {
"name" : "attr8",
"value" : "attrVal"
}
],
}
]
}
]
}
}
This is the document structure that I am using.
Document
|__
Clusters
|__
InnerCLusters
|__
AttrStore1
|__
AttrStore2
We are clustering documents based on document similarity.
We have around 17 million grouped/clustered documents and index size is 105.3 GB. Total documents as per ES is 298.8 million.
We have configured 2 data nodes m5.large (2 vCPU * 2 = 4 vCPU), ( with SSD storage) . Index with 4 shard (1 shard per CPU core) and 0 replica , segments merged (1 segment per shard)
ES configuration
bootstrap.memory_lock: true
indices.requests.cache.size: 30%
thread_pool.search.size: 50
Heap
-Xms4g
-Xmx4g
Also we did a match_all query which takes around 5 sec (with cache cleared)
Also we tried with larger instances with total of 16 vCPU and 120 GB of RAM for Elasticsearch but the performance was similar
How should we store the documents so that we query the documents under 500ms ?