Importing Large Amounts of Data to Production Indices

webish · January 28, 2015, 12:45am

I have some production indices that needs to a large amount of data
imported into them fairly frequently. Each time we import data the ES
nodes become a huge bottleneck. I honestly expected a lot better
performance out of them. Regardless, I would like to import data in a
production ES setup with the least amount of interruption or performance
issues.

What are some options I can take to import large quantities of data without
affecting data that is already being used by applications?

I was thinking I could use a combination of aliases or temp indices to
migrate the data over...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/410c454e-7e8d-4f1b-b70a-68e18fa7c732%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

warkolm · January 28, 2015, 1:02am

How much data are you talking? Are you using bulk API? What is your bulk
sizing?

You can also set an index to not refresh while you ingest it (refresh =
-1), then once it's been sent to ES turn indexing back on.

On 28 January 2015 at 11:45, webish gregory@yoursports.com wrote:

I have some production indices that needs to a large amount of data
imported into them fairly frequently. Each time we import data the ES
nodes become a huge bottleneck. I honestly expected a lot better
performance out of them. Regardless, I would like to import data in a
production ES setup with the least amount of interruption or performance
issues.

What are some options I can take to import large quantities of data
without affecting data that is already being used by applications?

I was thinking I could use a combination of aliases or temp indices to
migrate the data over...

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/410c454e-7e8d-4f1b-b70a-68e18fa7c732%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/410c454e-7e8d-4f1b-b70a-68e18fa7c732%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X8MuHkUdVoznQAiZFVx45nqhNngqGRrw-NxiSZH6opAvg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

webish · January 29, 2015, 8:13am

Hi Mark,

Right now. 28 GB across two indices 5 shards 1 replica per index on 3 AWS
large servers.

Frequently 1-10 million records or more get imported. During this time all
ES nodes hit a CPU usage of over 75%. We want to break the index down and
add routing at some point.

Refresh is using default (1) and based on coupling to some old imports
system the bulk API is NOT used... Problem is the index get's accessed and
written to constantly by users. So disabling refresh would delay their
content from being indexed.

I was debating using a separate index per import and grouping all the
indices by an alias. Not certain how that will affect performance.

On Tuesday, January 27, 2015 at 8:03:16 PM UTC-5, Mark Walkom wrote:

How much data are you talking? Are you using bulk API? What is your bulk
sizing?

You can also set an index to not refresh while you ingest it (refresh =
-1), then once it's been sent to ES turn indexing back on.

On 28 January 2015 at 11:45, webish <gre...@yoursports.com <javascript:>>
wrote:

I have some production indices that needs to a large amount of data
imported into them fairly frequently. Each time we import data the ES
nodes become a huge bottleneck. I honestly expected a lot better
performance out of them. Regardless, I would like to import data in a
production ES setup with the least amount of interruption or performance
issues.

What are some options I can take to import large quantities of data
without affecting data that is already being used by applications?

I was thinking I could use a combination of aliases or temp indices to
migrate the data over...

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/410c454e-7e8d-4f1b-b70a-68e18fa7c732%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/410c454e-7e8d-4f1b-b70a-68e18fa7c732%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/40010b4f-852b-4c2a-81a0-4187f9b5990b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

warkolm · January 29, 2015, 10:34am

You should be using the bulk API, that's what it exists for!

On 29 January 2015 at 19:13, webish gregory@yoursports.com wrote:

Hi Mark,

Right now. 28 GB across two indices 5 shards 1 replica per index on 3 AWS
large servers.

Frequently 1-10 million records or more get imported. During this time
all ES nodes hit a CPU usage of over 75%. We want to break the index down
and add routing at some point.

Refresh is using default (1) and based on coupling to some old imports
system the bulk API is NOT used... Problem is the index get's accessed and
written to constantly by users. So disabling refresh would delay their
content from being indexed.

I was debating using a separate index per import and grouping all the
indices by an alias. Not certain how that will affect performance.

On Tuesday, January 27, 2015 at 8:03:16 PM UTC-5, Mark Walkom wrote:

How much data are you talking? Are you using bulk API? What is your bulk
sizing?

You can also set an index to not refresh while you ingest it (refresh =
-1), then once it's been sent to ES turn indexing back on.

On 28 January 2015 at 11:45, webish gre...@yoursports.com wrote:

I have some production indices that needs to a large amount of data
imported into them fairly frequently. Each time we import data the ES
nodes become a huge bottleneck. I honestly expected a lot better
performance out of them. Regardless, I would like to import data in a
production ES setup with the least amount of interruption or performance
issues.

What are some options I can take to import large quantities of data
without affecting data that is already being used by applications?

I was thinking I could use a combination of aliases or temp indices to
migrate the data over...

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/410c454e-7e8d-4f1b-b70a-68e18fa7c732%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/410c454e-7e8d-4f1b-b70a-68e18fa7c732%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/40010b4f-852b-4c2a-81a0-4187f9b5990b%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/40010b4f-852b-4c2a-81a0-4187f9b5990b%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X8LJUcKUfuT4eGY5Tgo7smknTPr9guJZUx55D1VOR6_Yw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
How to deal with building huge bulk load indices fast without impacting prod queries or paying a fortune to over-provision the cluster Elasticsearch	10	3647	July 5, 2017
Many small indices vs one large index? Elasticsearch	10	5459	July 6, 2017
ES takes too much time to index data Elasticsearch	8	563	July 6, 2017
How to improve the data import speed Elasticsearch	11	5221	March 19, 2018
Elasticsearch bulk index Elasticsearch	5	362	July 6, 2017

Importing Large Amounts of Data to Production Indices

Related topics