Do _refresh operation is very slow on my index

DENGBO_SUN · October 17, 2019, 8:51am

es version: 6.5.1 docker
cluster machine: 3* 16c64G

kibana monitor show:

status:
I have a index, which has 196mill docs and the total size is 377GB, mapping data:

{
  "mapping": {
    "t": {
      "properties": {
....
        "briefIntroduction": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "business": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "businessId": {
          "type": "long"
        },
        "businessScope": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "capitalUnit": {
          "type": "keyword"
        },
        "cityCode": {
          "type": "keyword"
        },
        "clueSource": {
          "type": "long"
        },
        "clue_relation": {
          "type": "join",
          "eager_global_ordinals": true,
          "relations": {
            "company": [
              "product",
              "to_clue",
              "statistics",
              "browse"
            ]
          }
        },
        "companyId": {
          "type": "long"
        },
        "companyName": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword"
            }
          },
          "analyzer": "index_ansj"
        },
        "companyNature": {
          "type": "long"
        },
        "companyOrgType": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "companyScale": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "companyScore": {
          "type": "long"
        },
        "competingCount": {
          "type": "long"
        },
        "createTime": {
          "type": "date"
        },
        "dataType": {
          "type": "long"
        },
        "detailUrl": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "enterpriseId": {
          "type": "long"
        },
        "establishTime": {
          "type": "date"
        },
        "fromTime": {
          "type": "date"
        },
        "fromUserId": {
          "type": "long"
        },
        "fromUserName": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "id": {
          "type": "long"
        },
        "industryCode": {
          "type": "keyword"
        },
        "intro": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
....
      }
    }
  }
}

problem:
After insert about 30 docs, I find that the doc cannot be searched in 10 second, so i invoke the 'clue-test/_refresh ' api in manually, I found that it take 10+ seconds to response, sometimes it takes 20+ second, in meaning time I only see the log " overhead, spent [490ms] collecting in the last [1s]", I don't know what happened in meaning time.

DENGBO_SUN · October 17, 2019, 9:02am

I guess the join type field effect the performance, but I don't know why, anyone can give me a way to know what happened when do the refresh?

other info:
I find the cpu is higher in the meaning time, disk is ok.

Please help, thanks very much

ywelsch · October 17, 2019, 10:20am

The docs have some info on this: https://www.elastic.co/guide/en/elasticsearch/reference/master/parent-join.html#_global_ordinals

DENGBO_SUN · October 18, 2019, 2:02am

I see it. thanks very much. I have turn off 'eager_global_ordinals'.
But I still has another question.
In the document ([https://www.elastic.co/guide/en/elasticsearch/reference/master/eager-global-ordinals.html#eager-global-ordinals], it says
"To support aggregations and other operations that require looking up field values on a per-document basis"
In my view, for parent-child relation, it's just like the parent doc has a list fields which are the children doc, when we do the insert doc operation, it should effect the given parent doc. I think it's unnecessary to update all doc. From another perspective, when query use the has_parent or has_child, I think the operations what es do should are do the parent filter and do the child filter then do the cartesian product, I don't know when we need 'eager_global_ordinals', and what't the effective after I turn off it.

Maybe a stupid question, thanks very much anyway.

jimczi · October 18, 2019, 7:32pm

We use the global ordinals to quickly check if two documents have the same parent/child or if they are linked. When querying global ordinals are used to avoid the big map that would be needed to keep track of all the parent/child we've seen. We do resolve the parent first and then the child like you suggested but to keep track of the ids of the parent we use a bitset that records which global ordinals have been visited during the first phase. Using a map would make some query that matches a lot of document very costly in terms of memory so we compute this extra data-structures to ensure that queries can run seamlessly. Global ordinals are not used at index time so eager_global_ordinals only means that we eagerly build them when publishing a new searcher (on refresh). The rebuild is a costly operation since we need to restart from scratch every time there is a refresh in the index. One thing I can think of to speed up this process would be to allow incremental rebuild of this data structure but that's not possible currently since we guarantee that a value greater than another will also be assigned with a greater global ordinal. This is not really needed since we don't use this property in search but only in terms aggregation. That sounds much more appealing than the map execution mainly because we would retain the performance that we have today with them while speeding up the build after the initial one.

system · November 15, 2019, 7:33pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Has_child query performance Elasticsearch	14	2642	July 5, 2017
What would cause refresh=wait_for to regularly take 2-5 seconds? Elasticsearch	9	7087	December 26, 2017
Global ordinal cache invalidation Elasticsearch	5	933	July 5, 2017
Has_child query slow due to global ordinals - either at refresh or query time, looking for workaround Elasticsearch	5	824	January 4, 2017
Parent-child relationship performance and global ordinals Elasticsearch	2	1074	January 27, 2017

Do _refresh operation is very slow on my index

Related topics