Search query is slow and first query always takes too much time

Hi, search queries are slow when i do should match with multiple search terms and also for matching nested documents, basically it is taking 7-10 sec for first query and 5-6 sec later on due to elasticsearch cache, but queries for non nested objects with just match works fast i.e within 100ms .

i'm running elastic search in AWS instance with 250GB RAM and 500GB disk space, i have one template and 204 indexes with total of around 107 Million document indexed with 2 shards per index in a single node, and i have kept 30GB heap size.

i can have nested objects more than 50k so i have increased length to 500k, searching on this nested objects is taking too much time and any OR (should match) operations on fields other than nested also taking time, it there any way i can boost my query performance for nested objects? or is there anything wrong in my configuration?
And is there any way i can make first query also faster?

following is my sample mapping.

      {
      "index_patterns": [
        "product_*"
      ],
      "template": {
        "settings": {
          "index.store.type": "mmapfs",
          "number_of_shards":2,
          "number_of_replicas": 0,
          "index": {
            "store.preload": [
              "*"
            ],
            "mapping.nested_objects.limit": 500000,
            "analysis": {
              "analyzer": {
                "cust_product_name": {
                  "type": "custom",
                  "tokenizer": "standard",
                  "filter": [
                    "lowercase",
                    "english_stop",
                    "name_wordforms",
                    "business_wordforms",
                    "english_stemmer",
                    "min_value"
                  ],
                  "char_filter": [
                    "html_strip"
                  ]
                },
                "entity_name": {
                  "type": "custom",
                  "tokenizer": "standard",
                  "filter": [
                    "lowercase",
                    "english_stop",
                    "business_wordforms",
                    "name_wordforms",
                    "english_stemmer"
                  ],
                  "char_filter": [
                    "html_strip"
                  ]
                },
                "cust_text": {
                  "type": "custom",
                  "tokenizer": "standard",
                  "filter": [
                    "lowercase",
                    "english_stop",
                    "name_wordforms",
                    "english_stemmer",
                    "min_value"
                  ],
                  "char_filter": [
                    "html_strip"
                  ]
                }
              },
              "filter": {
                "min_value": {
                  "type": "length",
                  "min": 2
                },
                "english_stop": {
                  "type": "stop",
                  "stopwords": "_english_"
                },
                "business_wordforms": {
                  "type": "synonym",
                  "synonyms_path": "<some path>/business_wordforms.txt"
                },
                "name_wordforms": {
                  "type": "synonym",
                  "synonyms_path": "<some path>/name_wordforms.txt"
                },
                "english_stemmer": {
                  "type": "stemmer",
                  "language": "english"
                }
              }
            }
          }
        },
        "mappings": {
          "dynamic": "strict",
          "properties": {
            "product_number": {
              "type": "text",
              "analyzer": "keyword"
            },
            "product_name": {
              "type": "text",
              "analyzer": "cust_case_name"
            },
            "first_fetch_date": {
              "type": "date",
              "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||yyyy-MM||yyyy"
            },
            "last_fetch_date": {
              "type": "date",
              "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||yyyy-MM||yyyy"
            },
            "review": {
              "type": "nested",
              "properties": {
                "text": {
                  "type": "text",
                  "analyzer": "cust_text"
                },
                "review_date": {
                  "type": "date",
                  "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||yyyy-MM||yyyy"
                }
              }
            }
          }
        },
        "aliases": {
          "all_products": {}
        }
      },
      "priority": 200,
      "version": 1,
    }

if i search for any specific term in review text the response is taking too much time.

{
    "_source":{
        "excludes":["review"]
    },
    "size":1,
    "track_total_hits":true,
    "query":{
        "nested":{
            "path":"review",
            "query":{
                "match":{
                    "review.text":{
                        "query":"good",
                        "zero_terms_query":"none"
                    }
                }
            }
        }
    },
    "highlight":{
        "pre_tags":[
            "<b>"
        ],
        "post_tags":[
            "</b>"
        ],
        "fields":{
            "product_name":{
                
            }
        }
    }
}

I'm sure I'm missing something obvious!

Perhaps worth revisiting your rationale for using nested docs:

The example query is not one that warrants the use of nested docs.

Thanks for the reply, for our data nested object fits, but you are saying that the example query is not one that warrants the use of nested docs, may i know why?

Nested docs+queries are only ever needed when you query 2 fields or more - e.g. text:good AND author:john. Without nested docs we have a problem called "cross matching". This problem is explained in these slides that first proposed adding support to Lucene.

If you only query one field at a time then the type object can be used instead of nested, saving resources and query complexity.

We do query 2 or more fields for example review:text:good AND review:review_date:[now-2d TO now] AND product_name:abc for this best approach seems to be nested query and we do have some other fields which we kept as object whish is not shown in sample data above.

Might that be a case for keeping a “reviews” index separate from the products index?

I had thought about it but i see redundant data of product in review, because i can have >50k review for a product and if i separate index as review index and product index, i need to have product detail in each review otherwise i can't search with product detail along with review, i don't feel that's a good approach, hope you got my point!

Your original question was about improving speed. Denormalisation improves query speed but is paid for with added disk space and cost of updates.
It comes down to physics. You just can't magic certain things to be faster without physically re-organising things.

That's a scary number of review objects in a single JSON document.

I agree, what is the optimal way for my use case? when you say re-organizing things!

That's for you to determine. "Optimal" depends on the content of your queries/docs, any SLAs, disk costs, update latencies, rates of product changes etc.

Denormalizing where appropriate - copying small subsets of product data onto review documents in a "reviews" index.

It's hard to advise without knowing the costs of the various trade-offs required.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.