Adding 1mln+ aliases is really slow


(Michał Zgliczyński) #1

First of all, thank you for building ElasticSearch. It is a truly awesome
product.

I am trying to use the "User data flow". For this, I create a single index
and multiple aliases inside of it. In my use case, I have about 5mln
aliases to add.

The alias structure roughly looks like this:
{
'index' : 'index_name',
'alias' : 'user_' + user_id,
'filter' : {
'term' : {
'user' : user_id,
},
),
'routing' => 'r' + user_id,
}

I create a server with this setup:
{
"index_name" : {
"settings" : {
"index.number_of_replicas" : "1",
"index.number_of_shards" : "100",
}
}
}

Adding aliases works reasonably well for up to about 100k aliases, but it
slows down for later updates.

The following timings are shown after creating an index and then adding
aliases. No other operations were performed during that time on the
cluster and index.
These are the times needed to send and add aliases in batches of 5000:
batch: 5000 - time: 2311ms
batch: 5000 - time: 4096ms
batch: 5000 - time: 6022ms
batch: 5000 - time: 8127ms
batch: 5000 - time: 10174ms
batch: 5000 - time: 11403ms
batch: 5000 - time: 13126ms
batch: 5000 - time: 14335ms
batch: 5000 - time: 16500ms
batch: 5000 - time: 20663ms
batch: 5000 - time: 23002ms
batch: 5000 - time: 24457ms
batch: 5000 - time: 26375ms
batch: 5000 - time: 28984ms
batch: 5000 - time: 30559ms
batch: 5000 - time: 32234ms
batch: 5000 - time: 35098ms
batch: 5000 - time: 38922ms
batch: 5000 - time: 41776ms
batch: 5000 - time: 53402ms
batch: 5000 - time: 58600ms
batch: 5000 - time: 65567ms
batch: 5000 - time: 79885ms
batch: 5000 - time: 89900ms
batch: 5000 - time: 89368ms
batch: 5000 - time: 104109ms

As you can see, it gradually slows down. Is this expected? Looks like the
addition time grows linearly to the amount of aliases. Is that correct?
Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d8a03b09-9ed2-49c7-9ca8-a2285478d933%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Pilato) #2

I have never seen that number of aliases. That means you have 5 million users?
Nice project :wink:

I guess here that the cluster state is getting so big that it takes more and more time to update it and copy it to all nodes.

BTW how many nodes you have for those 200 shards?

Do you see anything in logs?

Thinking it loud.
Wondering if creating some alias template could help here to minimize the cluster state size?
Something like what you exactly describe:
{
'index' : 'index_name',
'alias' : 'user_{user_id}',
'filter' : {
'term' : {
'user' : '{user_id}',
},
),
'routing' => 'r{user_id}'
}

It looks somehow similar to what Luca just did with https://github.com/elasticsearch/elasticsearch/pull/5180

Someone else has an idea?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 6 mars 2014 à 22:15, Michał Zgliczyński mzglicz@gmail.com a écrit :

First of all, thank you for building ElasticSearch. It is a truly awesome product.

I am trying to use the "User data flow". For this, I create a single index and multiple aliases inside of it. In my use case, I have about 5mln aliases to add.

The alias structure roughly looks like this:
{
'index' : 'index_name',
'alias' : 'user_' + user_id,
'filter' : {
'term' : {
'user' : user_id,
},
),
'routing' => 'r' + user_id,
}

I create a server with this setup:
{
"index_name" : {
"settings" : {
"index.number_of_replicas" : "1",
"index.number_of_shards" : "100",
}
}
}

Adding aliases works reasonably well for up to about 100k aliases, but it slows down for later updates.

The following timings are shown after creating an index and then adding aliases. No other operations were performed during that time on the cluster and index.
These are the times needed to send and add aliases in batches of 5000:
batch: 5000 - time: 2311ms
batch: 5000 - time: 4096ms
batch: 5000 - time: 6022ms
batch: 5000 - time: 8127ms
batch: 5000 - time: 10174ms
batch: 5000 - time: 11403ms
batch: 5000 - time: 13126ms
batch: 5000 - time: 14335ms
batch: 5000 - time: 16500ms
batch: 5000 - time: 20663ms
batch: 5000 - time: 23002ms
batch: 5000 - time: 24457ms
batch: 5000 - time: 26375ms
batch: 5000 - time: 28984ms
batch: 5000 - time: 30559ms
batch: 5000 - time: 32234ms
batch: 5000 - time: 35098ms
batch: 5000 - time: 38922ms
batch: 5000 - time: 41776ms
batch: 5000 - time: 53402ms
batch: 5000 - time: 58600ms
batch: 5000 - time: 65567ms
batch: 5000 - time: 79885ms
batch: 5000 - time: 89900ms
batch: 5000 - time: 89368ms
batch: 5000 - time: 104109ms

As you can see, it gradually slows down. Is this expected? Looks like the addition time grows linearly to the amount of aliases. Is that correct? Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d8a03b09-9ed2-49c7-9ca8-a2285478d933%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2F1FB8EC-91D1-4764-A39A-48D79E615F11%40pilato.fr.
For more options, visit https://groups.google.com/groups/opt_out.


(Michał Zgliczyński) #3

Currently this is ran on 4 nodes, with the possibility of adding new nodes.
I get nothing in the logs.
I don't completely understand this for my use case:
https://github.com/elasticsearch/elasticsearch/pull/5180 .
Ideally what I would like, would be not to create so many aliases, but to
have a template doing the work for me. The template could hold the data
pertaining the filtering and routing. The template could be very simple,
for a request:
host:9200/user_{id} => this would automatically match the template:
"user_*" and use its options. Also, this would very much simplify my work
later on. As the server is alive and a new user would appear, the template
would automatically use the templates settings, instead of me checking if
the alias exists and then adding the alias.

This would allow me to create 1 template instead of so many similar
aliases. Or maybe this is already implemented?

Thanks!

W dniu czwartek, 6 marca 2014 13:49:29 UTC-8 użytkownik David Pilato
napisał:

I have never seen that number of aliases. That means you have 5 million
users?
Nice project :wink:

I guess here that the cluster state is getting so big that it takes more
and more time to update it and copy it to all nodes.

BTW how many nodes you have for those 200 shards?

Do you see anything in logs?

Thinking it loud.
Wondering if creating some alias template could help here to minimize the
cluster state size?
Something like what you exactly describe:
{
'index' : 'index_name',
'alias' : 'user_{user_id}',
'filter' : {
'term' : {
'user' : '{user_id}',
},
),
'routing' => 'r{user_id}'
}

It looks somehow similar to what Luca just did with
https://github.com/elasticsearch/elasticsearch/pull/5180

Someone else has an idea?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 6 mars 2014 à 22:15, Michał Zgliczyński <mzg...@gmail.com <javascript:>>
a écrit :

First of all, thank you for building ElasticSearch. It is a truly awesome
product.

I am trying to use the "User data flow". For this, I create a single index
and multiple aliases inside of it. In my use case, I have about 5mln
aliases to add.

The alias structure roughly looks like this:
{
'index' : 'index_name',
'alias' : 'user_' + user_id,
'filter' : {
'term' : {
'user' : user_id,
},
),
'routing' => 'r' + user_id,
}

I create a server with this setup:
{
"index_name" : {
"settings" : {
"index.number_of_replicas" : "1",
"index.number_of_shards" : "100",
}
}
}

Adding aliases works reasonably well for up to about 100k aliases, but it
slows down for later updates.

The following timings are shown after creating an index and then adding
aliases. No other operations were performed during that time on the
cluster and index.
These are the times needed to send and add aliases in batches of 5000:
batch: 5000 - time: 2311ms
batch: 5000 - time: 4096ms
batch: 5000 - time: 6022ms
batch: 5000 - time: 8127ms
batch: 5000 - time: 10174ms
batch: 5000 - time: 11403ms
batch: 5000 - time: 13126ms
batch: 5000 - time: 14335ms
batch: 5000 - time: 16500ms
batch: 5000 - time: 20663ms
batch: 5000 - time: 23002ms
batch: 5000 - time: 24457ms
batch: 5000 - time: 26375ms
batch: 5000 - time: 28984ms
batch: 5000 - time: 30559ms
batch: 5000 - time: 32234ms
batch: 5000 - time: 35098ms
batch: 5000 - time: 38922ms
batch: 5000 - time: 41776ms
batch: 5000 - time: 53402ms
batch: 5000 - time: 58600ms
batch: 5000 - time: 65567ms
batch: 5000 - time: 79885ms
batch: 5000 - time: 89900ms
batch: 5000 - time: 89368ms
batch: 5000 - time: 104109ms

As you can see, it gradually slows down. Is this expected? Looks like the
addition time grows linearly to the amount of aliases. Is that correct?
Thanks!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/d8a03b09-9ed2-49c7-9ca8-a2285478d933%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/d8a03b09-9ed2-49c7-9ca8-a2285478d933%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/de1e8dce-a559-4c3a-98c1-e87a5eed46c9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(David Pilato) #4

This is exactly what I tried to describe with:

Something like what you exactly describe:
{
'index' : 'index_name',
'alias' : 'user_{user_id}',
'filter' : {
'term' : {
'user' : '{user_id}',
},
),
'routing' => 'r{user_id}'
}

It does not exist yet (or I missed it) but I think it could be a nice feature request.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 7 mars 2014 à 00:11, Michał Zgliczyński mzglicz@gmail.com a écrit :

Currently this is ran on 4 nodes, with the possibility of adding new nodes. I get nothing in the logs.
I don't completely understand this for my use case: https://github.com/elasticsearch/elasticsearch/pull/5180 .
Ideally what I would like, would be not to create so many aliases, but to have a template doing the work for me. The template could hold the data pertaining the filtering and routing. The template could be very simple, for a request:
host:9200/user_{id} => this would automatically match the template: "user_*" and use its options. Also, this would very much simplify my work later on. As the server is alive and a new user would appear, the template would automatically use the templates settings, instead of me checking if the alias exists and then adding the alias.

This would allow me to create 1 template instead of so many similar aliases. Or maybe this is already implemented?

Thanks!

W dniu czwartek, 6 marca 2014 13:49:29 UTC-8 użytkownik David Pilato napisał:

I have never seen that number of aliases. That means you have 5 million users?
Nice project :wink:

I guess here that the cluster state is getting so big that it takes more and more time to update it and copy it to all nodes.

BTW how many nodes you have for those 200 shards?

Do you see anything in logs?

Thinking it loud.
Wondering if creating some alias template could help here to minimize the cluster state size?
Something like what you exactly describe:
{
'index' : 'index_name',
'alias' : 'user_{user_id}',
'filter' : {
'term' : {
'user' : '{user_id}',
},
),
'routing' => 'r{user_id}'
}

It looks somehow similar to what Luca just did with https://github.com/elasticsearch/elasticsearch/pull/5180

Someone else has an idea?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 6 mars 2014 à 22:15, Michał Zgliczyński mzg...@gmail.com a écrit :

First of all, thank you for building ElasticSearch. It is a truly awesome product.

I am trying to use the "User data flow". For this, I create a single index and multiple aliases inside of it. In my use case, I have about 5mln aliases to add.

The alias structure roughly looks like this:
{
'index' : 'index_name',
'alias' : 'user_' + user_id,
'filter' : {
'term' : {
'user' : user_id,
},
),
'routing' => 'r' + user_id,
}

I create a server with this setup:
{
"index_name" : {
"settings" : {
"index.number_of_replicas" : "1",
"index.number_of_shards" : "100",
}
}
}

Adding aliases works reasonably well for up to about 100k aliases, but it slows down for later updates.

The following timings are shown after creating an index and then adding aliases. No other operations were performed during that time on the cluster and index.
These are the times needed to send and add aliases in batches of 5000:
batch: 5000 - time: 2311ms
batch: 5000 - time: 4096ms
batch: 5000 - time: 6022ms
batch: 5000 - time: 8127ms
batch: 5000 - time: 10174ms
batch: 5000 - time: 11403ms
batch: 5000 - time: 13126ms
batch: 5000 - time: 14335ms
batch: 5000 - time: 16500ms
batch: 5000 - time: 20663ms
batch: 5000 - time: 23002ms
batch: 5000 - time: 24457ms
batch: 5000 - time: 26375ms
batch: 5000 - time: 28984ms
batch: 5000 - time: 30559ms
batch: 5000 - time: 32234ms
batch: 5000 - time: 35098ms
batch: 5000 - time: 38922ms
batch: 5000 - time: 41776ms
batch: 5000 - time: 53402ms
batch: 5000 - time: 58600ms
batch: 5000 - time: 65567ms
batch: 5000 - time: 79885ms
batch: 5000 - time: 89900ms
batch: 5000 - time: 89368ms
batch: 5000 - time: 104109ms

As you can see, it gradually slows down. Is this expected? Looks like the addition time grows linearly to the amount of aliases. Is that correct? Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d8a03b09-9ed2-49c7-9ca8-a2285478d933%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/de1e8dce-a559-4c3a-98c1-e87a5eed46c9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/463C9E80-7D07-4B80-8F34-04335A3D3EFF%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.


(system) #5