Upsert and Script on large index cause the cluster to timeout

Christophe_Verbinnen · December 8, 2014, 7:49pm

Hello,

We have a small cluster with 3 nodes running 1.3.6.

I have an index setup with only two fields.

      {
        index: index_name,
        body: {
          settings: {
              number_of_shards: 3,
              store: {
                type: :mmapfs
              }
          },
          mappings: {
            mapping_name => {
              properties: {
                :value => {type: 'string', analyzer: 'keyword'},
                :post_ids => {type: 'long', index: 'not_analyzed'}
              }
            }
          }
        }
      }

We are basically storing strings and all the post they are related to.

The problem is that this data is not stored this way in the database so I
don't have an id to represent each string nor do I have all the post_ids
from the start.

So I use the sha1 of the string value as id and I use and script to append
to the post_ids.

Here is my code that I use to index using the bulk api end point.

def index!
posts_ids = Post.where...
bulk_data = []
strings.uniq.each do |string|
string_id = Digest::SHA1.hexdigest string
bulk_data <<
{
update:
{
_index: 'post_strings',
_type: 'post_string',
_id: string_id,
data: {
script: "ctx._source.post_ids += additional_post_ids",
params: {
additional_post_ids: post_ids
},
upsert: {
value: string,
post_ids: post_ids
}
}
}
}
if bulk_data.count == 100
$elasticsearch.bulk :body => bulk_data
bulk_data = []
end
end
$elasticsearch.bulk :body => bulk_data if bulk_data.any?
end

So this worked fine for the first 75 Million strings but It was getting
slower and slower until it reached an indexing rate of only 50 doc per sec.

After that the cluster just killed itself because the nodes couldn't take
to each other.

I'm gessing all the threads were blocked trying to index and nodes had no
available threads to respond.

At first I tought it would be related to the sha1 id being not very
efficient but with my test with sequencial ids it was not getting better.

I'm out of ideas right now. Any help would be greatly appreciated.

Cheers.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/82c27f2c-bf56-4064-80bc-b348203edcb5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

nik9000 · December 8, 2014, 8:03pm

I'm not sure what is up but remember that post_ids in the script is a list
not a set. You might be growing it without bounds.
On Dec 8, 2014 2:49 PM, "Christophe Verbinnen" djpate@gmail.com wrote:

Hello,

We have a small cluster with 3 nodes running 1.3.6.

I have an index setup with only two fields.
      {
        index: index_name,
        body: {
          settings: {
              number_of_shards: 3,
              store: {
                type: :mmapfs
              }
          },
          mappings: {
            mapping_name => {
              properties: {
                :value => {type: 'string', analyzer: 'keyword'},
                :post_ids => {type: 'long', index: 'not_analyzed'}
              }
            }
          }
        }
      }
We are basically storing strings and all the post they are related to.

The problem is that this data is not stored this way in the database so I
don't have an id to represent each string nor do I have all the post_ids
from the start.

So I use the sha1 of the string value as id and I use and script to append
to the post_ids.

Here is my code that I use to index using the bulk api end point.

def index!
posts_ids = Post.where...
bulk_data =
strings.uniq.each do |string|
string_id = Digest::SHA1.hexdigest string
bulk_data <<
{
update:
{
_index: 'post_strings',
_type: 'post_string',
_id: string_id,
data: {
script: "ctx._source.post_ids += additional_post_ids",
params: {
additional_post_ids: post_ids
},
upsert: {
value: string,
post_ids: post_ids
}
}
}
}
if bulk_data.count == 100
$elasticsearch.bulk :body => bulk_data
bulk_data =
end
end
$elasticsearch.bulk :body => bulk_data if bulk_data.any?
end

So this worked fine for the first 75 Million strings but It was getting
slower and slower until it reached an indexing rate of only 50 doc per sec.

After that the cluster just killed itself because the nodes couldn't take
to each other.

I'm gessing all the threads were blocked trying to index and nodes had no
available threads to respond.

At first I tought it would be related to the sha1 id being not very
efficient but with my test with sequencial ids it was not getting better.

I'm out of ideas right now. Any help would be greatly appreciated.

Cheers.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/82c27f2c-bf56-4064-80bc-b348203edcb5%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/82c27f2c-bf56-4064-80bc-b348203edcb5%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd0_qr%2B-jU%2BYgPiN-hA283aGgoy-UtH3j5-0wEJBCuP2Mg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Christophe_Verbinnen · December 8, 2014, 10:08pm

I see what you mean but the way my records are it cannot happen unless I
reindex it.

Le lundi 8 décembre 2014 12:05:13 UTC-8, Nikolas Everett a écrit :

I'm not sure what is up but remember that post_ids in the script is a list
not a set. You might be growing it without bounds.
On Dec 8, 2014 2:49 PM, "Christophe Verbinnen" <djp...@gmail.com
<javascript:>> wrote:
Hello,

We have a small cluster with 3 nodes running 1.3.6.

I have an index setup with only two fields.
      {
        index: index_name,
        body: {
          settings: {
              number_of_shards: 3,
              store: {
                type: :mmapfs
              }
          },
          mappings: {
            mapping_name => {
              properties: {
                :value => {type: 'string', analyzer: 'keyword'},
                :post_ids => {type: 'long', index: 'not_analyzed'}
              }
            }
          }
        }
      }
We are basically storing strings and all the post they are related to.

The problem is that this data is not stored this way in the database so I
don't have an id to represent each string nor do I have all the post_ids
from the start.

So I use the sha1 of the string value as id and I use and script to
append to the post_ids.

Here is my code that I use to index using the bulk api end point.

def index!
posts_ids = Post.where...
bulk_data =
strings.uniq.each do |string|
string_id = Digest::SHA1.hexdigest string
bulk_data <<
{
update:
{
_index: 'post_strings',
_type: 'post_string',
_id: string_id,
data: {
script: "ctx._source.post_ids += additional_post_ids",
params: {
additional_post_ids: post_ids
},
upsert: {
value: string,
post_ids: post_ids
}
}
}
}
if bulk_data.count == 100
$elasticsearch.bulk :body => bulk_data
bulk_data =
end
end
$elasticsearch.bulk :body => bulk_data if bulk_data.any?
end

So this worked fine for the first 75 Million strings but It was getting
slower and slower until it reached an indexing rate of only 50 doc per sec.

After that the cluster just killed itself because the nodes couldn't take
to each other.

I'm gessing all the threads were blocked trying to index and nodes had no
available threads to respond.

At first I tought it would be related to the sha1 id being not very
efficient but with my test with sequencial ids it was not getting better.

I'm out of ideas right now. Any help would be greatly appreciated.

Cheers.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/82c27f2c-bf56-4064-80bc-b348203edcb5%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/82c27f2c-bf56-4064-80bc-b348203edcb5%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/740ab5ce-eaef-4ae4-9a00-f50be5aa45c3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Slow upserts Elasticsearch	2	1945	July 5, 2017
Update/Upsert Performance Improvements Elasticsearch	8	9488	July 5, 2017
Python/Elasticsearch Bulk problem Elasticsearch language-clients	3	420	March 20, 2020
Update with script and upsert issues Elasticsearch	4	2014	July 25, 2017
Single node, large database index performance Elasticsearch	9	581	June 23, 2021

Upsert and Script on large index cause the cluster to timeout

Related topics