Upsert and Script on large index cause the cluster to timeout

Hello,

We have a small cluster with 3 nodes running 1.3.6.

I have an index setup with only two fields.

      {
        index: index_name,
        body: {
          settings: {
              number_of_shards: 3,
              store: {
                type: :mmapfs
              }
          },
          mappings: {
            mapping_name => {
              properties: {
                :value => {type: 'string', analyzer: 'keyword'},
                :post_ids => {type: 'long', index: 'not_analyzed'}
              }
            }
          }
        }
      }

We are basically storing strings and all the post they are related to.

The problem is that this data is not stored this way in the database so I
don't have an id to represent each string nor do I have all the post_ids
from the start.

So I use the sha1 of the string value as id and I use and script to append
to the post_ids.

Here is my code that I use to index using the bulk api end point.

def index!
posts_ids = Post.where...
bulk_data = []
strings.uniq.each do |string|
string_id = Digest::SHA1.hexdigest string
bulk_data <<
{
update:
{
_index: 'post_strings',
_type: 'post_string',
_id: string_id,
data: {
script: "ctx._source.post_ids += additional_post_ids",
params: {
additional_post_ids: post_ids
},
upsert: {
value: string,
post_ids: post_ids
}
}
}
}
if bulk_data.count == 100
$elasticsearch.bulk :body => bulk_data
bulk_data = []
end
end
$elasticsearch.bulk :body => bulk_data if bulk_data.any?
end

So this worked fine for the first 75 Million strings but It was getting
slower and slower until it reached an indexing rate of only 50 doc per sec.

After that the cluster just killed itself because the nodes couldn't take
to each other.

I'm gessing all the threads were blocked trying to index and nodes had no
available threads to respond.

At first I tought it would be related to the sha1 id being not very
efficient but with my test with sequencial ids it was not getting better.

I'm out of ideas right now. Any help would be greatly appreciated.

Cheers.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/82c27f2c-bf56-4064-80bc-b348203edcb5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I'm not sure what is up but remember that post_ids in the script is a list
not a set. You might be growing it without bounds.
On Dec 8, 2014 2:49 PM, "Christophe Verbinnen" djpate@gmail.com wrote:

Hello,

We have a small cluster with 3 nodes running 1.3.6.

I have an index setup with only two fields.

      {
        index: index_name,
        body: {
          settings: {
              number_of_shards: 3,
              store: {
                type: :mmapfs
              }
          },
          mappings: {
            mapping_name => {
              properties: {
                :value => {type: 'string', analyzer: 'keyword'},
                :post_ids => {type: 'long', index: 'not_analyzed'}
              }
            }
          }
        }
      }

We are basically storing strings and all the post they are related to.

The problem is that this data is not stored this way in the database so I
don't have an id to represent each string nor do I have all the post_ids
from the start.

So I use the sha1 of the string value as id and I use and script to append
to the post_ids.

Here is my code that I use to index using the bulk api end point.

def index!
posts_ids = Post.where...
bulk_data = []
strings.uniq.each do |string|
string_id = Digest::SHA1.hexdigest string
bulk_data <<
{
update:
{
_index: 'post_strings',
_type: 'post_string',
_id: string_id,
data: {
script: "ctx._source.post_ids += additional_post_ids",
params: {
additional_post_ids: post_ids
},
upsert: {
value: string,
post_ids: post_ids
}
}
}
}
if bulk_data.count == 100
$elasticsearch.bulk :body => bulk_data
bulk_data = []
end
end
$elasticsearch.bulk :body => bulk_data if bulk_data.any?
end

So this worked fine for the first 75 Million strings but It was getting
slower and slower until it reached an indexing rate of only 50 doc per sec.

After that the cluster just killed itself because the nodes couldn't take
to each other.

I'm gessing all the threads were blocked trying to index and nodes had no
available threads to respond.

At first I tought it would be related to the sha1 id being not very
efficient but with my test with sequencial ids it was not getting better.

I'm out of ideas right now. Any help would be greatly appreciated.

Cheers.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/82c27f2c-bf56-4064-80bc-b348203edcb5%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/82c27f2c-bf56-4064-80bc-b348203edcb5%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd0_qr%2B-jU%2BYgPiN-hA283aGgoy-UtH3j5-0wEJBCuP2Mg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

I see what you mean but the way my records are it cannot happen unless I
reindex it.

Le lundi 8 décembre 2014 12:05:13 UTC-8, Nikolas Everett a écrit :

I'm not sure what is up but remember that post_ids in the script is a list
not a set. You might be growing it without bounds.
On Dec 8, 2014 2:49 PM, "Christophe Verbinnen" <djp...@gmail.com
<javascript:>> wrote:

Hello,

We have a small cluster with 3 nodes running 1.3.6.

I have an index setup with only two fields.

      {
        index: index_name,
        body: {
          settings: {
              number_of_shards: 3,
              store: {
                type: :mmapfs
              }
          },
          mappings: {
            mapping_name => {
              properties: {
                :value => {type: 'string', analyzer: 'keyword'},
                :post_ids => {type: 'long', index: 'not_analyzed'}
              }
            }
          }
        }
      }

We are basically storing strings and all the post they are related to.

The problem is that this data is not stored this way in the database so I
don't have an id to represent each string nor do I have all the post_ids
from the start.

So I use the sha1 of the string value as id and I use and script to
append to the post_ids.

Here is my code that I use to index using the bulk api end point.

def index!
posts_ids = Post.where...
bulk_data = []
strings.uniq.each do |string|
string_id = Digest::SHA1.hexdigest string
bulk_data <<
{
update:
{
_index: 'post_strings',
_type: 'post_string',
_id: string_id,
data: {
script: "ctx._source.post_ids += additional_post_ids",
params: {
additional_post_ids: post_ids
},
upsert: {
value: string,
post_ids: post_ids
}
}
}
}
if bulk_data.count == 100
$elasticsearch.bulk :body => bulk_data
bulk_data = []
end
end
$elasticsearch.bulk :body => bulk_data if bulk_data.any?
end

So this worked fine for the first 75 Million strings but It was getting
slower and slower until it reached an indexing rate of only 50 doc per sec.

After that the cluster just killed itself because the nodes couldn't take
to each other.

I'm gessing all the threads were blocked trying to index and nodes had no
available threads to respond.

At first I tought it would be related to the sha1 id being not very
efficient but with my test with sequencial ids it was not getting better.

I'm out of ideas right now. Any help would be greatly appreciated.

Cheers.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/82c27f2c-bf56-4064-80bc-b348203edcb5%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/82c27f2c-bf56-4064-80bc-b348203edcb5%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/740ab5ce-eaef-4ae4-9a00-f50be5aa45c3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.