Elasticsearch- Single Index vs Multiple Indexes

Thanks @Mark_Harwood

We will surely work on Benchmarking and update the results on the same thread.

Thank you again for your constant guidance.

@Mark_Harwood

Considering the same quoted scenarios, I am not worried much about performance.
I understand a mapping explosion is not a specific event but a general condition of having a lot of fields.

My main concern is if I have 30 indexes each with 100 fields, and then I search on all those 30 indexes using a single query. This will lead to searching on all 3000 fields (30 (Index) * 100 (fields in each index)). Can there be a mapping explosion in this case?

The reason I am asking this question is each index in itself contains a low number of fields but the search query is spread across multiple indexes, which cumulates to a large number of fields being searched

So you're asking "is 30 x 100 a lot"?

May be I did not put my question correctly at first time.
The only confusion which I have is whether mapping explosion is restricted to search on a single index or is it true for search across cumulative fields of multiple indexes as well ?

They all add up

Thanks @Mark_Harwood for your continuous help
I have read through this link https://issues.apache.org/jira/browse/LUCENE-6842 which talks about similar situation in Lucene.

Thanks @Mark_Harwood for your continuous help. We are working on Benchmarking for our use cases.

I still have one question,
If I am searching only for a single specified field on an index having 1000 fields, can mapping explosion occur ?

This situation is different from my previous use cases where I was searching on all fields. Now the search will only be on a specified field on index containing 1000 fields.

Having a very large number of fields can as Mark points out lead to a lot of performance problems. If the number of fields however is static, it is in my opinion wrong to use the term mapping explosion. As outlined in this rather old but still useful blog post, mapping explosion is when the number of mapped fields continuously increases due to how the data model is structured. Each change requires the cluster state to be updated and propagated, and this typically gets slower the larger it gets, at some point causing severe performance and stability problems.

Thanks for your response

@Mark_Harwood @Christian_Dahlqvist
As per your suggestion, we have proceeded with benchmarking.
Although we haven't faced any mapping explosion, our search has become drastically slow on a Free text search. Here is a summary.

Total Field Number               : Time        : Calls 
  2000                           : 2-5s        : Single call
  2000                           : 2-15s       : Multiple calls
  5000                           : 15-40s      : Single call
  5000                           : 70-90s      : Multiple calls  
  10000                          : 25-35s      : Single call
  10000                          : 2-4 mins    : Multiple calls

By Multiple calls, I mean 5 user searches simultaneously.

More indexed fields = more data structures.
More data structures = more random disk seeks.
More random disk seeks = more time.

1 Like

As per our benchmarking results, we can conclude by saying, extensive number of fields will slow down my free text search speed.

This leads to few questions:

  1. Suppose we have two different indexes but they contain same set and number(let's say 20) of fields( similar field names and data types). When two users searches simultaneously on these two indexes, will 20 fields be loaded on Cluster State as it has common fields or 40 fields will be loaded on Cluster State. If 40 fields are loaded on Cluster State, Is there a provision to make these fields common among indexes as they have similar properties(name and data type)

  2. When 100,000 users simultaneously search on single index which contains 20 fields. Is it safe to assume, 100,000*20 = 2,000,000 fields will be loaded in Cluster State?

No and no.
Cluster state changes with mapping changes not active searches.

Thanks for the response

When 10 different indexes are searched simultaneously, does Cluster state load mapping of all 10 indexes?

Cluster state is always loaded.

Have you tried benchmarking our suggestion of a single "copy_to" field?
Focus your efforts on minimising disk seeks. If you search a lot of indexed fields you are searching a lot of independent term collections and posting lists each requiring disk seeks. Minimise the number of indexed fields you search. Try SSDs to minimise the cost of each disk seek.

1 Like

Thanks for your inputs Mark,

"copy_to" doesn't support features like Highlight which is a must in my use case.
Although, I have benchmarked with the following parameters.

  1. A single index with 5000 fields: A free text or rather search on all fields at once, takes time around 15s
  2. 10 indexes with 500 each (unique at index level): A free text search on all these 10 indexes at once takes around 3s.
    I see almost 5x time increase in a single index with 5000 fields.

Although they are adding up, there is a difference in search speed in these two use cases.
Can you please provide information on why there is a difference? Or am I missing something?

It does. You just need to set require_field_match to false on the fields you highlight.
Highlighting is another operation that not surprisingly takes longer to do the more fields you throw at it.

One is searching multiple indices in parallel. Of course dividing fields into multiple indices will not help you if all the fields can exist in the same document and users need to search for docs with field A AND field B type queries.

Thanks Mark for the quick response.

Both of the above scenarios are not valid in our case. A single document is specific to only one index. So based on our benchmarking results as described earlier, we are planning to divide our single index containing 5000 unique fields into multiple indices having 500 unique fields each. The only reason we were earlier going with a single index containing 5000 fields was we didn't want to have a lot of indices.

However, I am still not able to understand the 15s figure. I will explain my situation. The following is my Hardware configuration, Single Cluster with one master node and two data nodes. Each instance has 16GB RAM and 8GB Heap size.

  1. We have created an index with 5000 fields. Each document indexed ranges from 100 fields to 150 fields at the most. We have indexed a total of 500 documents only. But the results (search on all fields- Free text search) obtained are not satisfactory. To be precise, it's around 15s which is very slow. Frankly speaking, I expected it to be in milliseconds. I am still curious why this much amount of time is taken for such a simple query.

  2. On the same index as mentioned above, my search was slow even after mentioning the exact document ID as a parameter in the filter. A document ID is unique per document. What I understand is, this filter should bring down documents to be searched to only a single document and so the time taken should be in milliseconds only. however what I got was around 9s. I would like to know why there is not much variance in with and without a filter?

Given you were talking about highlighting earlier my money's on that.
Probably needs some JSON query examples to know more.

Thanks Mark for the response.
We are not doing any highlighting as part of this Benchmarking exercise.

{
"from" : 0, "size" : 10, 
"query": {
    "bool": 
   {"should": [   {
    "bool": {"should": [{
          "query_string": {
            "query":"\"test\"",
            "fields":[],
            "boost":"2000"
                           }      }
                     , {
          "multi_match": {
            "query":"test",
            "type":"phrase",
            "boost":"500",
            "slop":"1",
            "fields":[]
}}, {
          "multi_match": {
            "query":"test",
            "type":"phrase",
            "boost":"200",
            "slop":"10000",
            "fields":[]
}},{
             
          "multi_match": {
           "query":"test ",
           "boost": "10",                   "operator":"and",                                 "analyzer":"whitespace",
                  "fuzziness": "AUTO:4,7",
                  "prefix_length":1,
                  "max_expansions":2,
                         "fields":[]
}},{
             
          "multi_match": {
           "query":"test ",
           "boost": "10",                 "analyzer":"whitespace",
                         "fields":[]
}}],"minimum_should_match":1}}],"minimum_should_match":1,"filter": [ { "terms":  { "field1.keyword" : ["30907cd80a174ad68dd0a2a2acdcd80e"]} } ]}}

This is a json query that is used for my benchmarking exercise. My requirement states, exact match should have highest priority, followed by phrase with one slop, phrase with high slop and fuzzy matches