Designing an Index with high Cardinality for a field distributed across multiple shards

Jatin_Garg1 · February 10, 2025, 10:13am

Hello,

I am designing an index for logs related to some search API which has information like type/count of searches AND type/count of results served in a search response.

Sample Information in the Log:
{
SearchId: "12345678",
SearchLocation: "India",
Results: [
{ position: 1,
result_accuracy: "high"
},
{ position: 2,
result_accuracy: "medium"
}
]
}

For this information, I want to create an index to visualize the following in Kibana:

Use Cases:

Count of Searches.
Count of Results with medium accuracy at position 1.
Count of Searches with medium accuracy results at position 1.

Approach

Can't use Object type field for "results" because it will serve false information for use case 2 and 3.
Can't use nested/parent-child field types for results because it is not yet supported in Kibana visualization.
One possible solution could be to flatten the results and add search related information in each result doc Like:

{
resultId: "87654"
SearchId: "12345678",
SearchLocation: "India",
position: 1,
result_accuracy: "high"
}
{
resultId: "234"
SearchId: "12345678",
SearchLocation: "India",
position: 2,
result_accuracy: "medium"
}

How Use Cases will be solved:

Count of Searches with some filtering criteria --> Count Unique of SearchId
Count of Results with medium accuracy at position 1 --> Count of docs with filtering on result_accuracy and position
Count of Searches with medium accuracy results at position 1 --> Count Unique of SearchId with some filters

Now, this field SearchId will have high cardinality. (for 2 million searches per day, and 10 results per search --> 20 million docs with 2 million unique SearchIds)

Challenge 1
Some percentage error with count unique in Cardinality aggregation

Challenge 2
Now when these result documents are distributed across multiple shards, then the no of unique searches will be inflated because unqiue counts will be calculated wrt each shard

Any help or validation around designing the same?

Carlos_D · February 13, 2025, 8:27am

hey @Jatin_Garg1 :

I believe using the nested field type would help addressing points 2 and 3, something like:

{
  "mappings": {
    "properties": {
      "search_id": "unsigned_long",
      "search_location": "keyword",
      "results": {
        "type": "nested",
        "properties": {
          "position": "integer",
          "accuracy": "keyword"
        }
      }
    }
  }
}

That way, you'll be able to filter using a nested query:

{
  "query": {
    "nested": {
      "path": "results",
      "query": {
        "bool": {
          "filter": [
            { "match": { "results.position": 1 } },
            { "match": { "results.accuracy": "medium" } }
          ]
        }
      }
    }
  }
}

That query will effectively match against any document that contains an element in the results array that matches the query.

I hope that helps!

Jatin_Garg1 · February 13, 2025, 9:17am

Hi @Carlos_D,

The end goal of the index is to visualize these on KIbana.
Nested Fields aren't supported on Kibana yet

github.com/elastic/kibana

Nested field support

opened 05:10PM - 21 Mar 14 UTC

Alex-Ikanow

release_note:enhancement high hanging fruit Feature:Aggregations loe:x-large Team:Visualizations impact:low Feature:New Field Type

This is sort of a duplicate of some other issues I searched for but I haven't se…en this particular aspect discussed, so I thought this was worth a separate issue. You read the _mapping field, so you should know when a particular field is nested, so can it not automatically apply the correct nested facet/query when such a field is selected in queries or facets? (Alternatively/in addition as suggested by #532, you could have a checkbox to allow users to select it themselves, perhaps as an interim measure) I'm sure there are some cases where this gets complicated, but there are also a bunch of cases where it's a straightforward changing of one block of JSON to another. Latest update: https://github.com/elastic/kibana/issues/1084#issuecomment-585178079

Christian_Dahlqvist · February 13, 2025, 9:41am

It sounds to me like you want to search and analyse 2 different things - searches and results. One way to do this without requiring parent-child or nested mappings (not supported by Kibana) might be to create 2 separate indices, one for searches and one for results.

The searches index might look something like this:

{
  "SearchId": "12345678",
  "SearchLocation": "India",
  "Results": ["1 high", "2 medium"]
}

Here I concatenated each result into a single string, which would be analyzed using a whitespace analyzer as well as mapped as a keyword field. You can use different types of queries depending on what you are looking for.

In the results index you would index each result separately with the search information denormalised as in your example:

{
  "resultId": "87654"
  "SearchId": "12345678",
  "SearchLocation": "India",
  "position": 1,
  "result_accuracy": "high"
}

Use cases 1 and 3 would use the searches index while use case 2 would run against the results index.

Carlos_D · February 13, 2025, 9:42am

Oh I see, sorry about that! I thought the question was about index design and missed the Kibana requirement

Some alternatives for this have been discussed here - basically using a flattened structure, or using a custom Vega visualization over a nested query.

Topic		Replies	Views
Understanding field limit across index pattern and solution to mapping explosion Elasticsearch	5	1695	September 23, 2019
[Kibana]: Nested child fields can not be visualized in discovery and be used like a filter aggregation Kibana	7	4118	March 13, 2020
Elasticsearch- Single Index vs Multiple Indexes Elasticsearch	45	12710	March 6, 2019
Dealing with many unique fields Elasticsearch	3	752	March 4, 2019
Nested type documents not accessible thorugh kibana Kibana	2	1011	July 6, 2017

Designing an Index with high Cardinality for a field distributed across multiple shards

Related topics