Write Performance at Scale

Problem:

Updating 100,000 records at once takes 30 minutes to complete

Environment:

2 Elastic VMs
16VCPU 60GB RAM each
1 TB SSD each
10G connection between nodes
Xms Xmx set to 24G
"refresh_interval": "30s",
"number_of_shards": "5",
"number_of_replicas": "1",
Index_buffer_size =30%
Memory_lock =true

Data:

1.8M records
1,030 field mappings per record
80kb average record size

Things we've tried but doesn't work:

  1. Using the Bulk API for all 100,000 records
  2. Using the Bulk API at increments of 500, 1,000, 10,000
  3. Adjusting refresh interval from 1s to 1 minute (5s, 15s, 30s)
  4. update_by_query - with slices and no slices (this is slower than the Bulk API)

Questions/Comments:

  1. Is there any amount of tuning that can fix these performance issues? Or do we have to split the index out into read-only and dynamic field mappings?
  2. We are enabling the source field in order to allow dynamic updates - some people have suggested turning this off but it's not an option for us.
  3. We have gone through the Elasticsearch documentation and followed all of the tuning steps

Specifically you are talking about update performance at scale, correct? Versus indexing documents for the first time?

In my experience, updates do not perform at scale. If you are going to update a lot, then updating the smallest possible document will help. If you are updating the same document multiple times, do this in memory and write out the change less often.

Yes, updating, as opposed to indexing for the first time.

If we split the index out into 2 different indexes - one with the fields that will (almost) never change, and the other with fields that will change often - what's the best way to query both indexes at the same time? Server side joins?

We considered modeling it as parent-child in elastic, with the read-only data as the parent and the writable as the child but the documentation said that both can't be queried at the same time.

I guess you could do server-side joins but I've never done this in practice. You'd take something like this:

PUT never_updated/doc/1
{
  "field1": "foo",
  "field2": "bar"
}
PUT often_updated/doc/1
{
  "count": 3,
  "total": 100
}

GET never_updated,often_updated/doc/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "term": {
            "field1": "foo"
          }
        },
        {
          "term": {
            "count": 3
          }
        }
      ]
    }
  }
}

and then zip up the results _source by _id:

  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "often_updated",
        "_type": "doc",
        "_id": "1",
        "_score": 1,
        "_source": {
          "count": 3,
          "total": 100
        }
      },
      {
        "_index": "never_updated",
        "_type": "doc",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "field1": "foo",
          "field2": "bar"
        }
      }
    ]
  }

Other than throwing hardware at the problem (lots of smaller SSDs), I can't think of a better way using Elasticsearch.

Makes sense. Thanks for the info

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.