Hi,
I'm building a new cluster to allow search on our internal chat system.
Given the volume of data to index, and the budget for it, storage efficiency is a key here.
Message structure:
channel_id: integer
user_id: integer
message_id: GUID
message: text
timestamp: currently date object, can be transformed into unix timestamp
features: JSON
profile: JSON
reply_parent_message_id: GUID
role: JSON
score: integer between 0 and 100
The use cases will be searches based on timestamp, channel_id, user_id, full text search on the message field, filtering by feature, profile or role containing a specific keyword.
The rest is only for display in search results.
In terms of aggregations I would need to make histograms of the count of messages per unit of time per channel_id between 2 timestamps, and a average of the score field for all messages for one channel_id between 2 timestamps.
So I'm thinking of building my index template mappings the following way:
{
"mappings": {
"properties": {
"channel_id": {
"type": "integer",
"index": true
},
"user_id": {
"type": "integer",
"index": true
},
"message_id": {
"type": "keyword",
"norms": false,
"index_options": "freqs"
},
"message": {
"type": "text"
},
"timestamp": {
"type": "date"
},
"features": {
"type": "text"
},
"profile": {
"type": "text"
},
"reply_parent_message_id": {
"type": "keyword",
"norms": false,
"index_options": "freqs"
},
"role": {
"type": "text"
},
"score": {
"type": "byte",
"index": false
}
}
}
}
My main interrogations are about the GUIDs and the message itself.
I understand that for GUIDs since I will only store it but not filter or search that's the best way to use minimal space.
For the message I would only search for specific keywords and return the full message, so text is the best format.
Could you please confirm if my assumptions are correct and my mapping ok?