Designing an Index with high Cardinality for a field distributed across multiple shards

Hello,

I am designing an index for logs related to some search API which has information like type/count of searches AND type/count of results served in a search response.

Sample Information in the Log:
{
SearchId: "12345678",
SearchLocation: "India",
Results: [
{ position: 1,
result_accuracy: "high"
},
{ position: 2,
result_accuracy: "medium"
}
]
}

For this information, I want to create an index to visualize the following in Kibana:

Use Cases:

  1. Count of Searches.
  2. Count of Results with medium accuracy at position 1.
  3. Count of Searches with medium accuracy results at position 1.

Approach

  1. Can't use Object type field for "results" because it will serve false information for use case 2 and 3.
  2. Can't use nested/parent-child field types for results because it is not yet supported in Kibana visualization.
  3. One possible solution could be to flatten the results and add search related information in each result doc Like:

{
resultId: "87654"
SearchId: "12345678",
SearchLocation: "India",
position: 1,
result_accuracy: "high"
}
{
resultId: "234"
SearchId: "12345678",
SearchLocation: "India",
position: 2,
result_accuracy: "medium"
}

How Use Cases will be solved:

  1. Count of Searches with some filtering criteria --> Count Unique of SearchId
  2. Count of Results with medium accuracy at position 1 --> Count of docs with filtering on result_accuracy and position
  3. Count of Searches with medium accuracy results at position 1 --> Count Unique of SearchId with some filters

Now, this field SearchId will have high cardinality. (for 2 million searches per day, and 10 results per search --> 20 million docs with 2 million unique SearchIds)

Challenge 1
Some percentage error with count unique in Cardinality aggregation

Challenge 2
Now when these result documents are distributed across multiple shards, then the no of unique searches will be inflated because unqiue counts will be calculated wrt each shard

Any help or validation around designing the same?

hey @Jatin_Garg1 :

I believe using the nested field type would help addressing points 2 and 3, something like:

{
  "mappings": {
    "properties": {
      "search_id": "unsigned_long",
      "search_location": "keyword",
      "results": {
        "type": "nested",
        "properties": {
          "position": "integer",
          "accuracy": "keyword"
        }
      }
    }
  }
}

That way, you'll be able to filter using a nested query:

{
  "query": {
    "nested": {
      "path": "results",
      "query": {
        "bool": {
          "filter": [
            { "match": { "results.position": 1 } },
            { "match": { "results.accuracy": "medium" } }
          ]
        }
      }
    }
  }
}

That query will effectively match against any document that contains an element in the results array that matches the query.

I hope that helps!

Hi @Carlos_D,

The end goal of the index is to visualize these on KIbana.
Nested Fields aren't supported on Kibana yet

It sounds to me like you want to search and analyse 2 different things - searches and results. One way to do this without requiring parent-child or nested mappings (not supported by Kibana) might be to create 2 separate indices, one for searches and one for results.

The searches index might look something like this:

{
  "SearchId": "12345678",
  "SearchLocation": "India",
  "Results": ["1 high", "2 medium"]
}

Here I concatenated each result into a single string, which would be analyzed using a whitespace analyzer as well as mapped as a keyword field. You can use different types of queries depending on what you are looking for.

In the results index you would index each result separately with the search information denormalised as in your example:

{
  "resultId": "87654"
  "SearchId": "12345678",
  "SearchLocation": "India",
  "position": 1,
  "result_accuracy": "high"
}

Use cases 1 and 3 would use the searches index while use case 2 would run against the results index.

Oh I see, sorry about that! I thought the question was about index design and missed the Kibana requirement :frowning_face:

Some alternatives for this have been discussed here - basically using a flattened structure, or using a custom Vega visualization over a nested query.