Do _refresh operation is very slow on my index

es version: 6.5.1 docker
cluster machine: 3* 16c64G

kibana monitor show:

status:
I have a index, which has 196mill docs and the total size is 377GB, mapping data:

{
  "mapping": {
    "t": {
      "properties": {
....
        "briefIntroduction": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "business": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "businessId": {
          "type": "long"
        },
        "businessScope": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "capitalUnit": {
          "type": "keyword"
        },
        "cityCode": {
          "type": "keyword"
        },
        "clueSource": {
          "type": "long"
        },
        "clue_relation": {
          "type": "join",
          "eager_global_ordinals": true,
          "relations": {
            "company": [
              "product",
              "to_clue",
              "statistics",
              "browse"
            ]
          }
        },
        "companyId": {
          "type": "long"
        },
        "companyName": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword"
            }
          },
          "analyzer": "index_ansj"
        },
        "companyNature": {
          "type": "long"
        },
        "companyOrgType": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "companyScale": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "companyScore": {
          "type": "long"
        },
        "competingCount": {
          "type": "long"
        },
        "createTime": {
          "type": "date"
        },
        "dataType": {
          "type": "long"
        },
        "detailUrl": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "enterpriseId": {
          "type": "long"
        },
        "establishTime": {
          "type": "date"
        },
        "fromTime": {
          "type": "date"
        },
        "fromUserId": {
          "type": "long"
        },
        "fromUserName": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "id": {
          "type": "long"
        },
        "industryCode": {
          "type": "keyword"
        },
        "intro": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
....
      }
    }
  }
}

problem:
After insert about 30 docs, I find that the doc cannot be searched in 10 second, so i invoke the 'clue-test/_refresh ' api in manually, I found that it take 10+ seconds to response, sometimes it takes 20+ second, in meaning time I only see the log " overhead, spent [490ms] collecting in the last [1s]", I don't know what happened in meaning time.

I guess the join type field effect the performance, but I don't know why, anyone can give me a way to know what happened when do the refresh?

other info:
I find the cpu is higher in the meaning time, disk is ok.

Please help, thanks very much

The docs have some info on this: https://www.elastic.co/guide/en/elasticsearch/reference/master/parent-join.html#_global_ordinals

1 Like

I see it. thanks very much. I have turn off 'eager_global_ordinals'.
But I still has another question.
In the document ([https://www.elastic.co/guide/en/elasticsearch/reference/master/eager-global-ordinals.html#eager-global-ordinals], it says
"To support aggregations and other operations that require looking up field values on a per-document basis"
In my view, for parent-child relation, it's just like the parent doc has a list fields which are the children doc, when we do the insert doc operation, it should effect the given parent doc. I think it's unnecessary to update all doc. From another perspective, when query use the has_parent or has_child, I think the operations what es do should are do the parent filter and do the child filter then do the cartesian product, I don't know when we need 'eager_global_ordinals', and what't the effective after I turn off it.

Maybe a stupid question, thanks very much anyway.

We use the global ordinals to quickly check if two documents have the same parent/child or if they are linked. When querying global ordinals are used to avoid the big map that would be needed to keep track of all the parent/child we've seen. We do resolve the parent first and then the child like you suggested but to keep track of the ids of the parent we use a bitset that records which global ordinals have been visited during the first phase. Using a map would make some query that matches a lot of document very costly in terms of memory so we compute this extra data-structures to ensure that queries can run seamlessly. Global ordinals are not used at index time so eager_global_ordinals only means that we eagerly build them when publishing a new searcher (on refresh). The rebuild is a costly operation since we need to restart from scratch every time there is a refresh in the index. One thing I can think of to speed up this process would be to allow incremental rebuild of this data structure but that's not possible currently since we guarantee that a value greater than another will also be assigned with a greater global ordinal. This is not really needed since we don't use this property in search but only in terms aggregation. That sounds much more appealing than the map execution mainly because we would retain the performance that we have today with them while speeding up the build after the initial one.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.