Write Performance at Scale

jdebois · May 22, 2018, 8:50pm

Problem:

Updating 100,000 records at once takes 30 minutes to complete

Environment:

2 Elastic VMs
16VCPU 60GB RAM each
1 TB SSD each
10G connection between nodes
Xms Xmx set to 24G
"refresh_interval": "30s",
"number_of_shards": "5",
"number_of_replicas": "1",
Index_buffer_size =30%
Memory_lock =true

Data:

1.8M records
1,030 field mappings per record
80kb average record size

Things we've tried but doesn't work:

Using the Bulk API for all 100,000 records
Using the Bulk API at increments of 500, 1,000, 10,000
Adjusting refresh interval from 1s to 1 minute (5s, 15s, 30s)
update_by_query - with slices and no slices (this is slower than the Bulk API)

Questions/Comments:

Is there any amount of tuning that can fix these performance issues? Or do we have to split the index out into read-only and dynamic field mappings?
We are enabling the source field in order to allow dynamic updates - some people have suggested turning this off but it's not an option for us.
We have gone through the Elasticsearch documentation and followed all of the tuning steps

loren · May 23, 2018, 12:30am

Specifically you are talking about update performance at scale, correct? Versus indexing documents for the first time?

In my experience, updates do not perform at scale. If you are going to update a lot, then updating the smallest possible document will help. If you are updating the same document multiple times, do this in memory and write out the change less often.

jdebois · May 23, 2018, 1:02am

Yes, updating, as opposed to indexing for the first time.

If we split the index out into 2 different indexes - one with the fields that will (almost) never change, and the other with fields that will change often - what's the best way to query both indexes at the same time? Server side joins?

We considered modeling it as parent-child in elastic, with the read-only data as the parent and the writable as the child but the documentation said that both can't be queried at the same time.

loren · May 23, 2018, 3:24pm

I guess you could do server-side joins but I've never done this in practice. You'd take something like this:

PUT never_updated/doc/1
{
  "field1": "foo",
  "field2": "bar"
}
PUT often_updated/doc/1
{
  "count": 3,
  "total": 100
}

GET never_updated,often_updated/doc/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "term": {
            "field1": "foo"
          }
        },
        {
          "term": {
            "count": 3
          }
        }
      ]
    }
  }
}

and then zip up the results _source by _id:

  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "often_updated",
        "_type": "doc",
        "_id": "1",
        "_score": 1,
        "_source": {
          "count": 3,
          "total": 100
        }
      },
      {
        "_index": "never_updated",
        "_type": "doc",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "field1": "foo",
          "field2": "bar"
        }
      }
    ]
  }

Other than throwing hardware at the problem (lots of smaller SSDs), I can't think of a better way using Elasticsearch.

jdebois · May 24, 2018, 6:49pm

Makes sense. Thanks for the info

system · June 21, 2018, 6:56pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Bulk update is too slow elasticsearch 6.2 Elasticsearch	25	6987	June 4, 2018
Elasticsearch bulk update is extremely slow Elasticsearch	11	11973	April 10, 2017
Indexing Performance, Threads + Bulk Size Elasticsearch	2	409	July 6, 2017
Bulk update performance Elasticsearch	1	945	January 9, 2019
Poor performance updating Elasticsearch	7	1292	July 6, 2017

Write Performance at Scale

Related topics