Group search results based on certain fields and return n groups

RovoMe · July 10, 2015, 1:52pm

I'm fairly new to ElasticSearch and currently have to refactor a basic setup which is indexing messages as well as their assigned states. A message can therefore have a number of attached states.

Our current approach has a mapping like this:

{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1,
    "analysis": {
      "filter": {
        "nGram_filter": {
          "type": "nGram",
          "min_gram": 3,
          "max_gram": 100,
          "token_chars": [
            "letter",
            "digit",
            "punctuation",
            "symbol"
          ]
        }
      },
      "analyzer": {
        "nGram_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "asciifolding",
            "nGram_filter"
          ]
        },
        "whitespace_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  },
  "mappings": {
    "message": {
      "_source": {
        "enabled": "true"
      },
      "_all": {
        "enabled": "true",
        "type": "string",
        "index": "not_analyzed",
        "index_analyzer": "simple",
        "search_analyzer": "simple"
      },
      "properties": {
        "lid": {
          "type": "string"
        },
        "mDatTim": {
          "type": "date",
          "format": "dateOptionalTime"
        },
        "test": {
          "type": "boolean",
          "index": "no"
        },
        "rN": {
          "type": "string",
          "index": "analyzed",
          "index_analyzer": "nGram_analyzer",
          "search_analyzer": "whitespace_analyzer"
        },
        "sN": {
          "type": "string",
          "index": "analyzed",
          "index_analyzer": "nGram_analyzer",
          "search_analyzer": "whitespace_analyzer"
        },
        "ts": {
          "type": "date",
          "format": "dateOptionalTime"
        },
        "suggest": {
          "type": "completion",
          "payloads": false
        }
      }
    },
    "status": {
      "_source": {
        "enabled": "true"
      },
      "_all": {
        "enabled": "true",
        "type": "string",
        "index": "not_analyzed",
        "index_analyzer": "simple",
        "search_analyzer": "simple"
      },
      "properties": {
        "test": {
          "type": "boolean",
          "index": "no"
        },
        "recvAt": {
          "type": "long",
          "index": "not_analyzed"
        },
        "size": {
          "type": "long",
          "index": "no",
          "include_in_all": false
        },
        "noDocs": {
          "type": "long",
          "index": "not_analyzed",
          "include_in_all": false
        },
        "origSize": {
          "type": "long",
          "index": "not_analyzed",
          "include_in_all": false
        },
        "rN": {
          "type": "string",
          "index": "analyzed",
          "index_analyzer": "nGram_analyzer",
          "search_analyzer": "whitespace_analyzer"
        },
        "port": {
          "type": "long",
          "index": "not_analyzed"
        },
        "sN": {
          "type": "string",
          "index": "analyzed",
          "index_analyzer": "nGram_analyzer",
          "search_analyzer": "whitespace_analyzer"
        },
        "uN": {
          "type": "string",
          "index": "analyzed",
          "index_analyzer": "nGram_analyzer",
          "search_analyzer": "whitespace_analyzer"
        },
        "usrAgt": {
          "type": "string",
          "index": "analyzed",
          "index_analyzer": "nGram_analyzer",
          "search_analyzer": "whitespace_analyzer"
        },
        "suggest": {
          "type": "completion",
          "payloads": false
        }
      }
    }
  }
}

whitch defines two types: message and status. We used queries like the one below to retrieve all entries which contain the query term either fully or partly (depending on the field; names (sN, rN, uN) should support partial search while the rest should only return results if the term matches fully):

{
  "size" : 20,
  "query" : {
    "match" : {
      "_all" : {
        "query" : "1436001",
        "operator" : "AND"
      }
    }
  },
  "sort" : [
    { "mId" : "asc" },
    { "_type" : "asc" },
    "_score"
  ]
}

to query for messages or states that contain all of the provided query fields (in this example there is just a lookup for a single field) in one of their fields. However, we'd like to build certain groups containing of a message and its states and return 20 groups instead of 20 messages or status entities. This should prevent cutting results in the middle of entries which logically belong to each other while keeping a pagination in tact.

Is this possible in ES without having to change the actual mapping? Currently it seems that states should be nested inside of message entries, though message entities and states are persisted at different times. Also a lookup of fields should be possible regardless of the type (message or state).

Any tipps on improving the mapping further are also welcome.