Bulk Indexing Problems

Hi there!

I'm trying to do a one-time index of about 800,000 records into an instance
of elasticsearch. But I'm having a bit of trouble. It continually fails
around 200,000 records. Looking at in the Elasticsearch Head Plugin, my
index goes offline and becomes unrecoverable.

For now, I have it running on a VM on my personal machine.

VM Config:
Ubuntu Server 14.04 64-Bit
8 GB RAM
2 Processors
32 GB SSD

Java
java version "1.7.0_65"
OpenJDK Runtime Environment (IcedTea 2.5.1) (7u65-2.5.1-4ubuntu1~0.14.04.2)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

Elasticsearch is using mostly the defaults. This is the output of:
curl http://localhost:9200/_nodes/process?pretty
{
"cluster_name" : "property_transaction_data",
"nodes" : {
"KlFkO_qgSOKmV_jjj5xeVw" : {
"name" : "Marvin Flumm",
"transport_address" : "inet[/192.168.133.131:9300]",
"host" : "ubuntu-es",
"ip" : "127.0.1.1",
"version" : "1.3.2",
"build" : "dee175d",
"http_address" : "inet[/192.168.133.131:9200]",
"process" : {
"refresh_interval_in_millis" : 1000,
"id" : 1092,
"max_file_descriptors" : 65535,
"mlockall" : true
}
}
}
}

I adjusted ES_HEAP_SIZE to 512mb.

I'm using the following code to pull data from SQL Server and index it.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3f-462f-bdcf-df717cbc6269%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hello Joshuva ,

I have a feeling this has something to do with the threadpool.
There is a limit on number of feeds to be queued for indexing.

Try increasing the size of threadpool queue of index and bulk to a large
number.
Also through cluster node API on threadpool, you can see if any request has
failed.
Monitor this API for any failed request due to large volume.

Threadpool -

Threadpool stats -

Having said that , i wont recommend bulk indexing that much information at
a time and 512 MB is not going to help much.

Thanks
Vineeth

On Tue, Sep 9, 2014 at 7:48 PM, Joshua P jpetersen841@gmail.com wrote:

Hi there!

I'm trying to do a one-time index of about 800,000 records into an
instance of elasticsearch. But I'm having a bit of trouble. It continually
fails around 200,000 records. Looking at in the Elasticsearch Head Plugin,
my index goes offline and becomes unrecoverable.

For now, I have it running on a VM on my personal machine.

VM Config:
Ubuntu Server 14.04 64-Bit
8 GB RAM
2 Processors
32 GB SSD

Java
java version "1.7.0_65"
OpenJDK Runtime Environment (IcedTea 2.5.1) (7u65-2.5.1-4ubuntu1~0.14.04.2)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

Elasticsearch is using mostly the defaults. This is the output of:
curl http://localhost:9200/_nodes/process?pretty
{
"cluster_name" : "property_transaction_data",
"nodes" : {
"KlFkO_qgSOKmV_jjj5xeVw" : {
"name" : "Marvin Flumm",
"transport_address" : "inet[/192.168.133.131:9300]",
"host" : "ubuntu-es",
"ip" : "127.0.1.1",
"version" : "1.3.2",
"build" : "dee175d",
"http_address" : "inet[/192.168.133.131:9200]",
"process" : {
"refresh_interval_in_millis" : 1000,
"id" : 1092,
"max_file_descriptors" : 65535,
"mlockall" : true
}
}
}
}

I adjusted ES_HEAP_SIZE to 512mb.

I'm using the following code to pull data from SQL Server and index it.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3f-462f-bdcf-df717cbc6269%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3f-462f-bdcf-df717cbc6269%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGdPd5%3DH1TFTnf41gB43tQkLghVXbD5K6_qXUcCD1PVqWfOhLQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks for the reply, Vineeth!

What's a practical heap size? I've seen some people saying they set it to
30gb but this confuses me because in the /etc/default/elasticsearch file,
the comment suggests the max is only 1gb?

I'll look into the threadpool issue. Is there a Java API for monitoring
Cluster Node health? Can you point me at an example or give me a link to
that?

Thanks!

On Tuesday, September 9, 2014 10:52:35 AM UTC-4, vineeth mohan wrote:

Hello Joshuva ,

I have a feeling this has something to do with the threadpool.
There is a limit on number of feeds to be queued for indexing.

Try increasing the size of threadpool queue of index and bulk to a large
number.
Also through cluster node API on threadpool, you can see if any request
has failed.
Monitor this API for any failed request due to large volume.

Threadpool -
Elasticsearch Platform — Find real-time answers at scale | Elastic
Threadpool stats -
Elasticsearch Platform — Find real-time answers at scale | Elastic

Having said that , i wont recommend bulk indexing that much information at
a time and 512 MB is not going to help much.

Thanks
Vineeth

On Tue, Sep 9, 2014 at 7:48 PM, Joshua P <jpeter...@gmail.com
<javascript:>> wrote:

Hi there!

I'm trying to do a one-time index of about 800,000 records into an
instance of elasticsearch. But I'm having a bit of trouble. It continually
fails around 200,000 records. Looking at in the Elasticsearch Head Plugin,
my index goes offline and becomes unrecoverable.

For now, I have it running on a VM on my personal machine.

VM Config:
Ubuntu Server 14.04 64-Bit
8 GB RAM
2 Processors
32 GB SSD

Java
java version "1.7.0_65"
OpenJDK Runtime Environment (IcedTea 2.5.1)
(7u65-2.5.1-4ubuntu1~0.14.04.2)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

Elasticsearch is using mostly the defaults. This is the output of:
curl http://localhost:9200/_nodes/process?pretty
{
"cluster_name" : "property_transaction_data",
"nodes" : {
"KlFkO_qgSOKmV_jjj5xeVw" : {
"name" : "Marvin Flumm",
"transport_address" : "inet[/192.168.133.131:9300]",
"host" : "ubuntu-es",
"ip" : "127.0.1.1",
"version" : "1.3.2",
"build" : "dee175d",
"http_address" : "inet[/192.168.133.131:9200]",
"process" : {
"refresh_interval_in_millis" : 1000,
"id" : 1092,
"max_file_descriptors" : 65535,
"mlockall" : true
}
}
}
}

I adjusted ES_HEAP_SIZE to 512mb.

I'm using the following code to pull data from SQL Server and index it.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3f-462f-bdcf-df717cbc6269%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3f-462f-bdcf-df717cbc6269%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/946e4b83-30c6-4513-ad4c-132c568cb7c7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

You also said you wouldn't recommend indexing that much information at
once. How would you suggest breaking it up and what status should I look
for before doing another batch? I have to come up with some process that is
repeatable and mostly automated.

On Tuesday, September 9, 2014 11:12:59 AM UTC-4, Joshua P wrote:

Thanks for the reply, Vineeth!

What's a practical heap size? I've seen some people saying they set it to
30gb but this confuses me because in the /etc/default/elasticsearch file,
the comment suggests the max is only 1gb?

I'll look into the threadpool issue. Is there a Java API for monitoring
Cluster Node health? Can you point me at an example or give me a link to
that?

Thanks!

On Tuesday, September 9, 2014 10:52:35 AM UTC-4, vineeth mohan wrote:

Hello Joshuva ,

I have a feeling this has something to do with the threadpool.
There is a limit on number of feeds to be queued for indexing.

Try increasing the size of threadpool queue of index and bulk to a large
number.
Also through cluster node API on threadpool, you can see if any request
has failed.
Monitor this API for any failed request due to large volume.

Threadpool -
Elasticsearch Platform — Find real-time answers at scale | Elastic
Threadpool stats -
Elasticsearch Platform — Find real-time answers at scale | Elastic

Having said that , i wont recommend bulk indexing that much information
at a time and 512 MB is not going to help much.

Thanks
Vineeth

On Tue, Sep 9, 2014 at 7:48 PM, Joshua P jpeter...@gmail.com wrote:

Hi there!

I'm trying to do a one-time index of about 800,000 records into an
instance of elasticsearch. But I'm having a bit of trouble. It continually
fails around 200,000 records. Looking at in the Elasticsearch Head Plugin,
my index goes offline and becomes unrecoverable.

For now, I have it running on a VM on my personal machine.

VM Config:
Ubuntu Server 14.04 64-Bit
8 GB RAM
2 Processors
32 GB SSD

Java
java version "1.7.0_65"
OpenJDK Runtime Environment (IcedTea 2.5.1)
(7u65-2.5.1-4ubuntu1~0.14.04.2)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

Elasticsearch is using mostly the defaults. This is the output of:
curl http://localhost:9200/_nodes/process?pretty
{
"cluster_name" : "property_transaction_data",
"nodes" : {
"KlFkO_qgSOKmV_jjj5xeVw" : {
"name" : "Marvin Flumm",
"transport_address" : "inet[/192.168.133.131:9300]",
"host" : "ubuntu-es",
"ip" : "127.0.1.1",
"version" : "1.3.2",
"build" : "dee175d",
"http_address" : "inet[/192.168.133.131:9200]",
"process" : {
"refresh_interval_in_millis" : 1000,
"id" : 1092,
"max_file_descriptors" : 65535,
"mlockall" : true
}
}
}
}

I adjusted ES_HEAP_SIZE to 512mb.

I'm using the following code to pull data from SQL Server and index it.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3f-462f-bdcf-df717cbc6269%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3f-462f-bdcf-df717cbc6269%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0dcac495-a071-4644-9349-109071fb1855%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hello Joshua ,

I am not sure which variable you are referring to on the memory settings in
the config file , please paste the comment and config.
I usually change the config from init.d script.

Best approach would be to bulk index say 10,000 feeds in sync mode , wait
until is everything is indexed and then proceed to the next batch.
I am not sure about the java API , but long back i used to curl to this
stats API and see how much request was rejected.

Thanks
Vineeth

On Tue, Sep 9, 2014 at 8:58 PM, Joshua P jpetersen841@gmail.com wrote:

You also said you wouldn't recommend indexing that much information at
once. How would you suggest breaking it up and what status should I look
for before doing another batch? I have to come up with some process that is
repeatable and mostly automated.

On Tuesday, September 9, 2014 11:12:59 AM UTC-4, Joshua P wrote:

Thanks for the reply, Vineeth!

What's a practical heap size? I've seen some people saying they set it to
30gb but this confuses me because in the /etc/default/elasticsearch file,
the comment suggests the max is only 1gb?

I'll look into the threadpool issue. Is there a Java API for monitoring
Cluster Node health? Can you point me at an example or give me a link to
that?

Thanks!

On Tuesday, September 9, 2014 10:52:35 AM UTC-4, vineeth mohan wrote:

Hello Joshuva ,

I have a feeling this has something to do with the threadpool.
There is a limit on number of feeds to be queued for indexing.

Try increasing the size of threadpool queue of index and bulk to a large
number.
Also through cluster node API on threadpool, you can see if any request
has failed.
Monitor this API for any failed request due to large volume.

Threadpool - Elasticsearch Platform — Find real-time answers at scale | Elastic
reference/current/modules-threadpool.html
Threadpool stats - Elasticsearch Platform — Find real-time answers at scale | Elastic
reference/current/cluster-nodes-stats.html

Having said that , i wont recommend bulk indexing that much information
at a time and 512 MB is not going to help much.

Thanks
Vineeth

On Tue, Sep 9, 2014 at 7:48 PM, Joshua P jpeter...@gmail.com wrote:

Hi there!

I'm trying to do a one-time index of about 800,000 records into an
instance of elasticsearch. But I'm having a bit of trouble. It continually
fails around 200,000 records. Looking at in the Elasticsearch Head Plugin,
my index goes offline and becomes unrecoverable.

For now, I have it running on a VM on my personal machine.

VM Config:
Ubuntu Server 14.04 64-Bit
8 GB RAM
2 Processors
32 GB SSD

Java
java version "1.7.0_65"
OpenJDK Runtime Environment (IcedTea 2.5.1)
(7u65-2.5.1-4ubuntu1~0.14.04.2)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

Elasticsearch is using mostly the defaults. This is the output of:
curl http://localhost:9200/_nodes/process?pretty
{
"cluster_name" : "property_transaction_data",
"nodes" : {
"KlFkO_qgSOKmV_jjj5xeVw" : {
"name" : "Marvin Flumm",
"transport_address" : "inet[/192.168.133.131:9300]",
"host" : "ubuntu-es",
"ip" : "127.0.1.1",
"version" : "1.3.2",
"build" : "dee175d",
"http_address" : "inet[/192.168.133.131:9200]",
"process" : {
"refresh_interval_in_millis" : 1000,
"id" : 1092,
"max_file_descriptors" : 65535,
"mlockall" : true
}
}
}
}

I adjusted ES_HEAP_SIZE to 512mb.

I'm using the following code to pull data from SQL Server and index it.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/f94f96d4-8c3f-462f-bdcf-df717cbc6269%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3f-462f-bdcf-df717cbc6269%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0dcac495-a071-4644-9349-109071fb1855%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0dcac495-a071-4644-9349-109071fb1855%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGdPd5myvEj22pDn%3DetpS1gL-6cwthg2Cv6m_omy6_fe2YFFgw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Here is /etc/default/elasticsearch

Run Elasticsearch as this user ID and group ID

#ES_USER=elasticsearch
#ES_GROUP=elasticsearch

Heap Size (defaults to 256m min, 1g max)

ES_HEAP_SIZE=512m

Heap new generation

#ES_HEAP_NEWSIZE=

max direct memory

#ES_DIRECT_SIZE=

Maximum number of open files, defaults to 65535.

MAX_OPEN_FILES=65535

Maximum locked memory size. Set to "unlimited" if you use the

bootstrap.mlockall option in elasticsearch.yml. You must also set

ES_HEAP_SIZE.

MAX_LOCKED_MEMORY=unlimited

Maximum number of VMA (Virtual Memory Areas) a process can own

#MAX_MAP_COUNT=262144

Elasticsearch log directory

#LOG_DIR=/var/log/elasticsearch

Elasticsearch data directory

#DATA_DIR=/var/lib/elasticsearch

Elasticsearch work directory

#WORK_DIR=/tmp/elasticsearch

Elasticsearch configuration directory

#CONF_DIR=/etc/elasticsearch

Elasticsearch configuration file (elasticsearch.yml)

#CONF_FILE=/etc/elasticsearch/elasticsearch.yml

Additional Java OPTS

#ES_JAVA_OPTS=

Configure restart on package upgrade (true, every other setting will lead

to not restarting)
#RESTART_ON_UPGRADE=true

I also see the same setting in /etc/init.d/elasticsearch. Do you know which
file takes priority? And what a good size would be?

On Tuesday, September 9, 2014 11:32:19 AM UTC-4, vineeth mohan wrote:

Hello Joshua ,

I am not sure which variable you are referring to on the memory settings
in the config file , please paste the comment and config.
I usually change the config from init.d script.

Best approach would be to bulk index say 10,000 feeds in sync mode , wait
until is everything is indexed and then proceed to the next batch.
I am not sure about the java API , but long back i used to curl to this
stats API and see how much request was rejected.

Thanks
Vineeth

On Tue, Sep 9, 2014 at 8:58 PM, Joshua P <jpeter...@gmail.com
<javascript:>> wrote:

You also said you wouldn't recommend indexing that much information at
once. How would you suggest breaking it up and what status should I look
for before doing another batch? I have to come up with some process that is
repeatable and mostly automated.

On Tuesday, September 9, 2014 11:12:59 AM UTC-4, Joshua P wrote:

Thanks for the reply, Vineeth!

What's a practical heap size? I've seen some people saying they set it
to 30gb but this confuses me because in the /etc/default/elasticsearch
file, the comment suggests the max is only 1gb?

I'll look into the threadpool issue. Is there a Java API for monitoring
Cluster Node health? Can you point me at an example or give me a link to
that?

Thanks!

On Tuesday, September 9, 2014 10:52:35 AM UTC-4, vineeth mohan wrote:

Hello Joshuva ,

I have a feeling this has something to do with the threadpool.
There is a limit on number of feeds to be queued for indexing.

Try increasing the size of threadpool queue of index and bulk to a
large number.
Also through cluster node API on threadpool, you can see if any request
has failed.
Monitor this API for any failed request due to large volume.

Threadpool - Elasticsearch Platform — Find real-time answers at scale | Elastic
reference/current/modules-threadpool.html
Threadpool stats - Elasticsearch Platform — Find real-time answers at scale | Elastic
reference/current/cluster-nodes-stats.html

Having said that , i wont recommend bulk indexing that much information
at a time and 512 MB is not going to help much.

Thanks
Vineeth

On Tue, Sep 9, 2014 at 7:48 PM, Joshua P jpeter...@gmail.com wrote:

Hi there!

I'm trying to do a one-time index of about 800,000 records into an
instance of elasticsearch. But I'm having a bit of trouble. It continually
fails around 200,000 records. Looking at in the Elasticsearch Head Plugin,
my index goes offline and becomes unrecoverable.

For now, I have it running on a VM on my personal machine.

VM Config:
Ubuntu Server 14.04 64-Bit
8 GB RAM
2 Processors
32 GB SSD

Java
java version "1.7.0_65"
OpenJDK Runtime Environment (IcedTea 2.5.1)
(7u65-2.5.1-4ubuntu1~0.14.04.2)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

Elasticsearch is using mostly the defaults. This is the output of:
curl http://localhost:9200/_nodes/process?pretty
{
"cluster_name" : "property_transaction_data",
"nodes" : {
"KlFkO_qgSOKmV_jjj5xeVw" : {
"name" : "Marvin Flumm",
"transport_address" : "inet[/192.168.133.131:9300]",
"host" : "ubuntu-es",
"ip" : "127.0.1.1",
"version" : "1.3.2",
"build" : "dee175d",
"http_address" : "inet[/192.168.133.131:9200]",
"process" : {
"refresh_interval_in_millis" : 1000,
"id" : 1092,
"max_file_descriptors" : 65535,
"mlockall" : true
}
}
}
}

I adjusted ES_HEAP_SIZE to 512mb.

I'm using the following code to pull data from SQL Server and index
it.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/f94f96d4-8c3f-462f-bdcf-df717cbc6269%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3f-462f-bdcf-df717cbc6269%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0dcac495-a071-4644-9349-109071fb1855%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0dcac495-a071-4644-9349-109071fb1855%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b439af3d-69b0-4301-bf07-22b37767a17c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Let ES_HEAP_SIZE at least to 1 GB, for smaller heaps like 512m and indexing
around 1 million docs, you need some more fine tuning, which is
complicated. Your machine is ok to set the heap to 4 GB which is 50% of 8
GB RAM.

Jörg

On Tue, Sep 9, 2014 at 5:39 PM, Joshua P jpetersen841@gmail.com wrote:

Here is /etc/default/elasticsearch

Run Elasticsearch as this user ID and group ID

#ES_USER=elasticsearch
#ES_GROUP=elasticsearch

Heap Size (defaults to 256m min, 1g max)

ES_HEAP_SIZE=512m

Heap new generation

#ES_HEAP_NEWSIZE=

max direct memory

#ES_DIRECT_SIZE=

Maximum number of open files, defaults to 65535.

MAX_OPEN_FILES=65535

Maximum locked memory size. Set to "unlimited" if you use the

bootstrap.mlockall option in elasticsearch.yml. You must also set

ES_HEAP_SIZE.

MAX_LOCKED_MEMORY=unlimited

Maximum number of VMA (Virtual Memory Areas) a process can own

#MAX_MAP_COUNT=262144

Elasticsearch log directory

#LOG_DIR=/var/log/elasticsearch

Elasticsearch data directory

#DATA_DIR=/var/lib/elasticsearch

Elasticsearch work directory

#WORK_DIR=/tmp/elasticsearch

Elasticsearch configuration directory

#CONF_DIR=/etc/elasticsearch

Elasticsearch configuration file (elasticsearch.yml)

#CONF_FILE=/etc/elasticsearch/elasticsearch.yml

Additional Java OPTS

#ES_JAVA_OPTS=

Configure restart on package upgrade (true, every other setting will

lead to not restarting)
#RESTART_ON_UPGRADE=true

I also see the same setting in /etc/init.d/elasticsearch. Do you know
which file takes priority? And what a good size would be?

On Tuesday, September 9, 2014 11:32:19 AM UTC-4, vineeth mohan wrote:

Hello Joshua ,

I am not sure which variable you are referring to on the memory settings
in the config file , please paste the comment and config.
I usually change the config from init.d script.

Best approach would be to bulk index say 10,000 feeds in sync mode , wait
until is everything is indexed and then proceed to the next batch.
I am not sure about the java API , but long back i used to curl to this
stats API and see how much request was rejected.

Thanks
Vineeth

On Tue, Sep 9, 2014 at 8:58 PM, Joshua P jpeter...@gmail.com wrote:

You also said you wouldn't recommend indexing that much information at
once. How would you suggest breaking it up and what status should I look
for before doing another batch? I have to come up with some process that is
repeatable and mostly automated.

On Tuesday, September 9, 2014 11:12:59 AM UTC-4, Joshua P wrote:

Thanks for the reply, Vineeth!

What's a practical heap size? I've seen some people saying they set it
to 30gb but this confuses me because in the /etc/default/elasticsearch
file, the comment suggests the max is only 1gb?

I'll look into the threadpool issue. Is there a Java API for monitoring
Cluster Node health? Can you point me at an example or give me a link to
that?

Thanks!

On Tuesday, September 9, 2014 10:52:35 AM UTC-4, vineeth mohan wrote:

Hello Joshuva ,

I have a feeling this has something to do with the threadpool.
There is a limit on number of feeds to be queued for indexing.

Try increasing the size of threadpool queue of index and bulk to a
large number.
Also through cluster node API on threadpool, you can see if any
request has failed.
Monitor this API for any failed request due to large volume.

Threadpool - Elasticsearch Platform — Find real-time answers at scale | Elastic
reference/current/modules-threadpool.html
Threadpool stats - http://www.elasticsearch.org
/guide/en/elasticsearch/reference/current/cluster-nodes-stats.html

Having said that , i wont recommend bulk indexing that much
information at a time and 512 MB is not going to help much.

Thanks
Vineeth

On Tue, Sep 9, 2014 at 7:48 PM, Joshua P jpeter...@gmail.com wrote:

Hi there!

I'm trying to do a one-time index of about 800,000 records into an
instance of elasticsearch. But I'm having a bit of trouble. It continually
fails around 200,000 records. Looking at in the Elasticsearch Head Plugin,
my index goes offline and becomes unrecoverable.

For now, I have it running on a VM on my personal machine.

VM Config:
Ubuntu Server 14.04 64-Bit
8 GB RAM
2 Processors
32 GB SSD

Java
java version "1.7.0_65"
OpenJDK Runtime Environment (IcedTea 2.5.1)
(7u65-2.5.1-4ubuntu1~0.14.04.2)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

Elasticsearch is using mostly the defaults. This is the output of:
curl http://localhost:9200/_nodes/process?pretty
{
"cluster_name" : "property_transaction_data",
"nodes" : {
"KlFkO_qgSOKmV_jjj5xeVw" : {
"name" : "Marvin Flumm",
"transport_address" : "inet[/192.168.133.131:9300]",
"host" : "ubuntu-es",
"ip" : "127.0.1.1",
"version" : "1.3.2",
"build" : "dee175d",
"http_address" : "inet[/192.168.133.131:9200]",
"process" : {
"refresh_interval_in_millis" : 1000,
"id" : 1092,
"max_file_descriptors" : 65535,
"mlockall" : true
}
}
}
}

I adjusted ES_HEAP_SIZE to 512mb.

I'm using the following code to pull data from SQL Server and index
it.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/f94f96d4-8c3f-462f-bdcf-df717cbc6269%40goo
glegroups.com
https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3f-462f-bdcf-df717cbc6269%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/0dcac495-a071-4644-9349-109071fb1855%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0dcac495-a071-4644-9349-109071fb1855%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b439af3d-69b0-4301-bf07-22b37767a17c%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b439af3d-69b0-4301-bf07-22b37767a17c%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE38FnrB-4k59PdF86cQVX-FGv-%2BH9eT%2B4L2eyT8NXu1w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi Jörg,

Can you elaborate on what you mean by I still need more fine tuning?

I've upped the heap size to 4g (in both places I mentioned before because
it's not clear to me which one ES actually uses). I haven't tried to index
again yet.
Other than throttling my indexing, what are some other things I need to be
thinking about?

On Tuesday, September 9, 2014 12:53:35 PM UTC-4, Jörg Prante wrote:

Let ES_HEAP_SIZE at least to 1 GB, for smaller heaps like 512m and
indexing around 1 million docs, you need some more fine tuning, which is
complicated. Your machine is ok to set the heap to 4 GB which is 50% of 8
GB RAM.

Jörg

On Tue, Sep 9, 2014 at 5:39 PM, Joshua P <jpeter...@gmail.com
<javascript:>> wrote:

Here is /etc/default/elasticsearch

Run Elasticsearch as this user ID and group ID

#ES_USER=elasticsearch
#ES_GROUP=elasticsearch

Heap Size (defaults to 256m min, 1g max)

ES_HEAP_SIZE=512m

Heap new generation

#ES_HEAP_NEWSIZE=

max direct memory

#ES_DIRECT_SIZE=

Maximum number of open files, defaults to 65535.

MAX_OPEN_FILES=65535

Maximum locked memory size. Set to "unlimited" if you use the

bootstrap.mlockall option in elasticsearch.yml. You must also set

ES_HEAP_SIZE.

MAX_LOCKED_MEMORY=unlimited

Maximum number of VMA (Virtual Memory Areas) a process can own

#MAX_MAP_COUNT=262144

Elasticsearch log directory

#LOG_DIR=/var/log/elasticsearch

Elasticsearch data directory

#DATA_DIR=/var/lib/elasticsearch

Elasticsearch work directory

#WORK_DIR=/tmp/elasticsearch

Elasticsearch configuration directory

#CONF_DIR=/etc/elasticsearch

Elasticsearch configuration file (elasticsearch.yml)

#CONF_FILE=/etc/elasticsearch/elasticsearch.yml

Additional Java OPTS

#ES_JAVA_OPTS=

Configure restart on package upgrade (true, every other setting will

lead to not restarting)
#RESTART_ON_UPGRADE=true

I also see the same setting in /etc/init.d/elasticsearch. Do you know
which file takes priority? And what a good size would be?

On Tuesday, September 9, 2014 11:32:19 AM UTC-4, vineeth mohan wrote:

Hello Joshua ,

I am not sure which variable you are referring to on the memory settings
in the config file , please paste the comment and config.
I usually change the config from init.d script.

Best approach would be to bulk index say 10,000 feeds in sync mode ,
wait until is everything is indexed and then proceed to the next batch.
I am not sure about the java API , but long back i used to curl to this
stats API and see how much request was rejected.

Thanks
Vineeth

On Tue, Sep 9, 2014 at 8:58 PM, Joshua P jpeter...@gmail.com wrote:

You also said you wouldn't recommend indexing that much information at
once. How would you suggest breaking it up and what status should I look
for before doing another batch? I have to come up with some process that is
repeatable and mostly automated.

On Tuesday, September 9, 2014 11:12:59 AM UTC-4, Joshua P wrote:

Thanks for the reply, Vineeth!

What's a practical heap size? I've seen some people saying they set it
to 30gb but this confuses me because in the /etc/default/elasticsearch
file, the comment suggests the max is only 1gb?

I'll look into the threadpool issue. Is there a Java API for
monitoring Cluster Node health? Can you point me at an example or give me a
link to that?

Thanks!

On Tuesday, September 9, 2014 10:52:35 AM UTC-4, vineeth mohan wrote:

Hello Joshuva ,

I have a feeling this has something to do with the threadpool.
There is a limit on number of feeds to be queued for indexing.

Try increasing the size of threadpool queue of index and bulk to a
large number.
Also through cluster node API on threadpool, you can see if any
request has failed.
Monitor this API for any failed request due to large volume.

Threadpool - Elasticsearch Platform — Find real-time answers at scale | Elastic
reference/current/modules-threadpool.html
Threadpool stats - http://www.elasticsearch.org
/guide/en/elasticsearch/reference/current/cluster-nodes-stats.html

Having said that , i wont recommend bulk indexing that much
information at a time and 512 MB is not going to help much.

Thanks
Vineeth

On Tue, Sep 9, 2014 at 7:48 PM, Joshua P jpeter...@gmail.com wrote:

Hi there!

I'm trying to do a one-time index of about 800,000 records into an
instance of elasticsearch. But I'm having a bit of trouble. It continually
fails around 200,000 records. Looking at in the Elasticsearch Head Plugin,
my index goes offline and becomes unrecoverable.

For now, I have it running on a VM on my personal machine.

VM Config:
Ubuntu Server 14.04 64-Bit
8 GB RAM
2 Processors
32 GB SSD

Java
java version "1.7.0_65"
OpenJDK Runtime Environment (IcedTea 2.5.1)
(7u65-2.5.1-4ubuntu1~0.14.04.2)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

Elasticsearch is using mostly the defaults. This is the output of:
curl http://localhost:9200/_nodes/process?pretty
{
"cluster_name" : "property_transaction_data",
"nodes" : {
"KlFkO_qgSOKmV_jjj5xeVw" : {
"name" : "Marvin Flumm",
"transport_address" : "inet[/192.168.133.131:9300]",
"host" : "ubuntu-es",
"ip" : "127.0.1.1",
"version" : "1.3.2",
"build" : "dee175d",
"http_address" : "inet[/192.168.133.131:9200]",
"process" : {
"refresh_interval_in_millis" : 1000,
"id" : 1092,
"max_file_descriptors" : 65535,
"mlockall" : true
}
}
}
}

I adjusted ES_HEAP_SIZE to 512mb.

I'm using the following code to pull data from SQL Server and index
it.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3
f-462f-bdcf-df717cbc6269%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3f-462f-bdcf-df717cbc6269%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/0dcac495-a071-4644-9349-109071fb1855%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0dcac495-a071-4644-9349-109071fb1855%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b439af3d-69b0-4301-bf07-22b37767a17c%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b439af3d-69b0-4301-bf07-22b37767a17c%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a3680944-54fc-4d01-bb30-3a9465760cae%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Just reran the indexer and found this error coming up. I'm running out of
disk space on the partition ES wants to write to.

F38KqHhnRDWtiJCss5Wz0g -- INTERNAL_SERVER_ERROR --
TranslogException[[index_type][0] Failed to write operation
[org.elasticsearch.index.translog.Translog$Create@6f1f6b1e]]; nested:
IOException[No space left on device]; -- index_type

Where would I change the write location? Which config file?

On Tuesday, September 9, 2014 1:28:21 PM UTC-4, Joshua P wrote:

Hi Jörg,

Can you elaborate on what you mean by I still need more fine tuning?

I've upped the heap size to 4g (in both places I mentioned before because
it's not clear to me which one ES actually uses). I haven't tried to index
again yet.
Other than throttling my indexing, what are some other things I need to be
thinking about?

On Tuesday, September 9, 2014 12:53:35 PM UTC-4, Jörg Prante wrote:

Let ES_HEAP_SIZE at least to 1 GB, for smaller heaps like 512m and
indexing around 1 million docs, you need some more fine tuning, which is
complicated. Your machine is ok to set the heap to 4 GB which is 50% of 8
GB RAM.

Jörg

On Tue, Sep 9, 2014 at 5:39 PM, Joshua P jpeter...@gmail.com wrote:

Here is /etc/default/elasticsearch

Run Elasticsearch as this user ID and group ID

#ES_USER=elasticsearch
#ES_GROUP=elasticsearch

Heap Size (defaults to 256m min, 1g max)

ES_HEAP_SIZE=512m

Heap new generation

#ES_HEAP_NEWSIZE=

max direct memory

#ES_DIRECT_SIZE=

Maximum number of open files, defaults to 65535.

MAX_OPEN_FILES=65535

Maximum locked memory size. Set to "unlimited" if you use the

bootstrap.mlockall option in elasticsearch.yml. You must also set

ES_HEAP_SIZE.

MAX_LOCKED_MEMORY=unlimited

Maximum number of VMA (Virtual Memory Areas) a process can own

#MAX_MAP_COUNT=262144

Elasticsearch log directory

#LOG_DIR=/var/log/elasticsearch

Elasticsearch data directory

#DATA_DIR=/var/lib/elasticsearch

Elasticsearch work directory

#WORK_DIR=/tmp/elasticsearch

Elasticsearch configuration directory

#CONF_DIR=/etc/elasticsearch

Elasticsearch configuration file (elasticsearch.yml)

#CONF_FILE=/etc/elasticsearch/elasticsearch.yml

Additional Java OPTS

#ES_JAVA_OPTS=

Configure restart on package upgrade (true, every other setting will

lead to not restarting)
#RESTART_ON_UPGRADE=true

I also see the same setting in /etc/init.d/elasticsearch. Do you know
which file takes priority? And what a good size would be?

On Tuesday, September 9, 2014 11:32:19 AM UTC-4, vineeth mohan wrote:

Hello Joshua ,

I am not sure which variable you are referring to on the memory
settings in the config file , please paste the comment and config.
I usually change the config from init.d script.

Best approach would be to bulk index say 10,000 feeds in sync mode ,
wait until is everything is indexed and then proceed to the next batch.
I am not sure about the java API , but long back i used to curl to this
stats API and see how much request was rejected.

Thanks
Vineeth

On Tue, Sep 9, 2014 at 8:58 PM, Joshua P jpeter...@gmail.com wrote:

You also said you wouldn't recommend indexing that much information at
once. How would you suggest breaking it up and what status should I look
for before doing another batch? I have to come up with some process that is
repeatable and mostly automated.

On Tuesday, September 9, 2014 11:12:59 AM UTC-4, Joshua P wrote:

Thanks for the reply, Vineeth!

What's a practical heap size? I've seen some people saying they set
it to 30gb but this confuses me because in the /etc/default/elasticsearch
file, the comment suggests the max is only 1gb?

I'll look into the threadpool issue. Is there a Java API for
monitoring Cluster Node health? Can you point me at an example or give me a
link to that?

Thanks!

On Tuesday, September 9, 2014 10:52:35 AM UTC-4, vineeth mohan wrote:

Hello Joshuva ,

I have a feeling this has something to do with the threadpool.
There is a limit on number of feeds to be queued for indexing.

Try increasing the size of threadpool queue of index and bulk to a
large number.
Also through cluster node API on threadpool, you can see if any
request has failed.
Monitor this API for any failed request due to large volume.

Threadpool - Elasticsearch Platform — Find real-time answers at scale | Elastic
reference/current/modules-threadpool.html
Threadpool stats - http://www.elasticsearch.org
/guide/en/elasticsearch/reference/current/cluster-nodes-stats.html

Having said that , i wont recommend bulk indexing that much
information at a time and 512 MB is not going to help much.

Thanks
Vineeth

On Tue, Sep 9, 2014 at 7:48 PM, Joshua P jpeter...@gmail.com
wrote:

Hi there!

I'm trying to do a one-time index of about 800,000 records into an
instance of elasticsearch. But I'm having a bit of trouble. It continually
fails around 200,000 records. Looking at in the Elasticsearch Head Plugin,
my index goes offline and becomes unrecoverable.

For now, I have it running on a VM on my personal machine.

VM Config:
Ubuntu Server 14.04 64-Bit
8 GB RAM
2 Processors
32 GB SSD

Java
java version "1.7.0_65"
OpenJDK Runtime Environment (IcedTea 2.5.1)
(7u65-2.5.1-4ubuntu1~0.14.04.2)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

Elasticsearch is using mostly the defaults. This is the output of:
curl http://localhost:9200/_nodes/process?pretty
{
"cluster_name" : "property_transaction_data",
"nodes" : {
"KlFkO_qgSOKmV_jjj5xeVw" : {
"name" : "Marvin Flumm",
"transport_address" : "inet[/192.168.133.131:9300]",
"host" : "ubuntu-es",
"ip" : "127.0.1.1",
"version" : "1.3.2",
"build" : "dee175d",
"http_address" : "inet[/192.168.133.131:9200]",
"process" : {
"refresh_interval_in_millis" : 1000,
"id" : 1092,
"max_file_descriptors" : 65535,
"mlockall" : true
}
}
}
}

I adjusted ES_HEAP_SIZE to 512mb.

I'm using the following code to pull data from SQL Server and index
it.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3
f-462f-bdcf-df717cbc6269%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3f-462f-bdcf-df717cbc6269%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/0dcac495-a071-4644-9349-109071fb1855%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0dcac495-a071-4644-9349-109071fb1855%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b439af3d-69b0-4301-bf07-22b37767a17c%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b439af3d-69b0-4301-bf07-22b37767a17c%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1765489f-d2f5-47c5-a499-9633c9be54e2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

You mentioned problems around 200.000 docs. What are these problems and how
do you think you can fix them? How does your bulk indexing procedure look
like?

By finetuning I mean slimming down all ES settings to the absolute minimum
to slow down indexing and allocate less resources. But in your case, unless
you are tied to 512mb, you really don't need to think about that.

Jörg

On Tue, Sep 9, 2014 at 7:28 PM, Joshua P jpetersen841@gmail.com wrote:

Hi Jörg,

Can you elaborate on what you mean by I still need more fine tuning?

I've upped the heap size to 4g (in both places I mentioned before because
it's not clear to me which one ES actually uses). I haven't tried to index
again yet.
Other than throttling my indexing, what are some other things I need to be
thinking about?

On Tuesday, September 9, 2014 12:53:35 PM UTC-4, Jörg Prante wrote:

Let ES_HEAP_SIZE at least to 1 GB, for smaller heaps like 512m and
indexing around 1 million docs, you need some more fine tuning, which is
complicated. Your machine is ok to set the heap to 4 GB which is 50% of 8
GB RAM.

Jörg

On Tue, Sep 9, 2014 at 5:39 PM, Joshua P jpeter...@gmail.com wrote:

Here is /etc/default/elasticsearch

Run Elasticsearch as this user ID and group ID

#ES_USER=elasticsearch
#ES_GROUP=elasticsearch

Heap Size (defaults to 256m min, 1g max)

ES_HEAP_SIZE=512m

Heap new generation

#ES_HEAP_NEWSIZE=

max direct memory

#ES_DIRECT_SIZE=

Maximum number of open files, defaults to 65535.

MAX_OPEN_FILES=65535

Maximum locked memory size. Set to "unlimited" if you use the

bootstrap.mlockall option in elasticsearch.yml. You must also set

ES_HEAP_SIZE.

MAX_LOCKED_MEMORY=unlimited

Maximum number of VMA (Virtual Memory Areas) a process can own

#MAX_MAP_COUNT=262144

Elasticsearch log directory

#LOG_DIR=/var/log/elasticsearch

Elasticsearch data directory

#DATA_DIR=/var/lib/elasticsearch

Elasticsearch work directory

#WORK_DIR=/tmp/elasticsearch

Elasticsearch configuration directory

#CONF_DIR=/etc/elasticsearch

Elasticsearch configuration file (elasticsearch.yml)

#CONF_FILE=/etc/elasticsearch/elasticsearch.yml

Additional Java OPTS

#ES_JAVA_OPTS=

Configure restart on package upgrade (true, every other setting will

lead to not restarting)
#RESTART_ON_UPGRADE=true

I also see the same setting in /etc/init.d/elasticsearch. Do you know
which file takes priority? And what a good size would be?

On Tuesday, September 9, 2014 11:32:19 AM UTC-4, vineeth mohan wrote:

Hello Joshua ,

I am not sure which variable you are referring to on the memory
settings in the config file , please paste the comment and config.
I usually change the config from init.d script.

Best approach would be to bulk index say 10,000 feeds in sync mode ,
wait until is everything is indexed and then proceed to the next batch.
I am not sure about the java API , but long back i used to curl to this
stats API and see how much request was rejected.

Thanks
Vineeth

On Tue, Sep 9, 2014 at 8:58 PM, Joshua P jpeter...@gmail.com wrote:

You also said you wouldn't recommend indexing that much information at
once. How would you suggest breaking it up and what status should I look
for before doing another batch? I have to come up with some process that is
repeatable and mostly automated.

On Tuesday, September 9, 2014 11:12:59 AM UTC-4, Joshua P wrote:

Thanks for the reply, Vineeth!

What's a practical heap size? I've seen some people saying they set
it to 30gb but this confuses me because in the /etc/default/elasticsearch
file, the comment suggests the max is only 1gb?

I'll look into the threadpool issue. Is there a Java API for
monitoring Cluster Node health? Can you point me at an example or give me a
link to that?

Thanks!

On Tuesday, September 9, 2014 10:52:35 AM UTC-4, vineeth mohan wrote:

Hello Joshuva ,

I have a feeling this has something to do with the threadpool.
There is a limit on number of feeds to be queued for indexing.

Try increasing the size of threadpool queue of index and bulk to a
large number.
Also through cluster node API on threadpool, you can see if any
request has failed.
Monitor this API for any failed request due to large volume.

Threadpool - Elasticsearch Platform — Find real-time answers at scale | Elastic
reference/current/modules-threadpool.html
Threadpool stats - http://www.elasticsearch.org
/guide/en/elasticsearch/reference/current/cluster-nodes-stats.html

Having said that , i wont recommend bulk indexing that much
information at a time and 512 MB is not going to help much.

Thanks
Vineeth

On Tue, Sep 9, 2014 at 7:48 PM, Joshua P jpeter...@gmail.com
wrote:

Hi there!

I'm trying to do a one-time index of about 800,000 records into an
instance of elasticsearch. But I'm having a bit of trouble. It continually
fails around 200,000 records. Looking at in the Elasticsearch Head Plugin,
my index goes offline and becomes unrecoverable.

For now, I have it running on a VM on my personal machine.

VM Config:
Ubuntu Server 14.04 64-Bit
8 GB RAM
2 Processors
32 GB SSD

Java
java version "1.7.0_65"
OpenJDK Runtime Environment (IcedTea 2.5.1)
(7u65-2.5.1-4ubuntu1~0.14.04.2)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

Elasticsearch is using mostly the defaults. This is the output of:
curl http://localhost:9200/_nodes/process?pretty
{
"cluster_name" : "property_transaction_data",
"nodes" : {
"KlFkO_qgSOKmV_jjj5xeVw" : {
"name" : "Marvin Flumm",
"transport_address" : "inet[/192.168.133.131:9300]",
"host" : "ubuntu-es",
"ip" : "127.0.1.1",
"version" : "1.3.2",
"build" : "dee175d",
"http_address" : "inet[/192.168.133.131:9200]",
"process" : {
"refresh_interval_in_millis" : 1000,
"id" : 1092,
"max_file_descriptors" : 65535,
"mlockall" : true
}
}
}
}

I adjusted ES_HEAP_SIZE to 512mb.

I'm using the following code to pull data from SQL Server and index
it.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3
f-462f-bdcf-df717cbc6269%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3f-462f-bdcf-df717cbc6269%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/0dcac495-a071-4644-9349-109071fb1855%40goo
glegroups.com
https://groups.google.com/d/msgid/elasticsearch/0dcac495-a071-4644-9349-109071fb1855%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/b439af3d-69b0-4301-bf07-22b37767a17c%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b439af3d-69b0-4301-bf07-22b37767a17c%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a3680944-54fc-4d01-bb30-3a9465760cae%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/a3680944-54fc-4d01-bb30-3a9465760cae%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGZD-9XvBskpAv2T%2BCiQqK5V6UaJH0opMCeNkk%2B7aXvYw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Check the path.data setting in config/elasticsearch.yml

Jörg

On Tue, Sep 9, 2014 at 7:50 PM, Joshua P jpetersen841@gmail.com wrote:

Just reran the indexer and found this error coming up. I'm running out of
disk space on the partition ES wants to write to.

F38KqHhnRDWtiJCss5Wz0g -- INTERNAL_SERVER_ERROR --
TranslogException[[index_type][0] Failed to write operation
[org.elasticsearch.index.translog.Translog$Create@6f1f6b1e]]; nested:
IOException[No space left on device]; -- index_type

Where would I change the write location? Which config file?

On Tuesday, September 9, 2014 1:28:21 PM UTC-4, Joshua P wrote:

Hi Jörg,

Can you elaborate on what you mean by I still need more fine tuning?

I've upped the heap size to 4g (in both places I mentioned before because
it's not clear to me which one ES actually uses). I haven't tried to index
again yet.
Other than throttling my indexing, what are some other things I need to
be thinking about?

On Tuesday, September 9, 2014 12:53:35 PM UTC-4, Jörg Prante wrote:

Let ES_HEAP_SIZE at least to 1 GB, for smaller heaps like 512m and
indexing around 1 million docs, you need some more fine tuning, which is
complicated. Your machine is ok to set the heap to 4 GB which is 50% of 8
GB RAM.

Jörg

On Tue, Sep 9, 2014 at 5:39 PM, Joshua P jpeter...@gmail.com wrote:

Here is /etc/default/elasticsearch

Run Elasticsearch as this user ID and group ID

#ES_USER=elasticsearch
#ES_GROUP=elasticsearch

Heap Size (defaults to 256m min, 1g max)

ES_HEAP_SIZE=512m

Heap new generation

#ES_HEAP_NEWSIZE=

max direct memory

#ES_DIRECT_SIZE=

Maximum number of open files, defaults to 65535.

MAX_OPEN_FILES=65535

Maximum locked memory size. Set to "unlimited" if you use the

bootstrap.mlockall option in elasticsearch.yml. You must also set

ES_HEAP_SIZE.

MAX_LOCKED_MEMORY=unlimited

Maximum number of VMA (Virtual Memory Areas) a process can own

#MAX_MAP_COUNT=262144

Elasticsearch log directory

#LOG_DIR=/var/log/elasticsearch

Elasticsearch data directory

#DATA_DIR=/var/lib/elasticsearch

Elasticsearch work directory

#WORK_DIR=/tmp/elasticsearch

Elasticsearch configuration directory

#CONF_DIR=/etc/elasticsearch

Elasticsearch configuration file (elasticsearch.yml)

#CONF_FILE=/etc/elasticsearch/elasticsearch.yml

Additional Java OPTS

#ES_JAVA_OPTS=

Configure restart on package upgrade (true, every other setting will

lead to not restarting)
#RESTART_ON_UPGRADE=true

I also see the same setting in /etc/init.d/elasticsearch. Do you know
which file takes priority? And what a good size would be?

On Tuesday, September 9, 2014 11:32:19 AM UTC-4, vineeth mohan wrote:

Hello Joshua ,

I am not sure which variable you are referring to on the memory
settings in the config file , please paste the comment and config.
I usually change the config from init.d script.

Best approach would be to bulk index say 10,000 feeds in sync mode ,
wait until is everything is indexed and then proceed to the next batch.
I am not sure about the java API , but long back i used to curl to
this stats API and see how much request was rejected.

Thanks
Vineeth

On Tue, Sep 9, 2014 at 8:58 PM, Joshua P jpeter...@gmail.com wrote:

You also said you wouldn't recommend indexing that much information
at once. How would you suggest breaking it up and what status should I look
for before doing another batch? I have to come up with some process that is
repeatable and mostly automated.

On Tuesday, September 9, 2014 11:12:59 AM UTC-4, Joshua P wrote:

Thanks for the reply, Vineeth!

What's a practical heap size? I've seen some people saying they set
it to 30gb but this confuses me because in the /etc/default/elasticsearch
file, the comment suggests the max is only 1gb?

I'll look into the threadpool issue. Is there a Java API for
monitoring Cluster Node health? Can you point me at an example or give me a
link to that?

Thanks!

On Tuesday, September 9, 2014 10:52:35 AM UTC-4, vineeth mohan wrote:

Hello Joshuva ,

I have a feeling this has something to do with the threadpool.
There is a limit on number of feeds to be queued for indexing.

Try increasing the size of threadpool queue of index and bulk to a
large number.
Also through cluster node API on threadpool, you can see if any
request has failed.
Monitor this API for any failed request due to large volume.

Threadpool - Elasticsearch Platform — Find real-time answers at scale | Elastic
reference/current/modules-threadpool.html
Threadpool stats - http://www.elasticsearch.org
/guide/en/elasticsearch/reference/current/cluster-nodes-stats.html

Having said that , i wont recommend bulk indexing that much
information at a time and 512 MB is not going to help much.

Thanks
Vineeth

On Tue, Sep 9, 2014 at 7:48 PM, Joshua P jpeter...@gmail.com
wrote:

Hi there!

I'm trying to do a one-time index of about 800,000 records into an
instance of elasticsearch. But I'm having a bit of trouble. It continually
fails around 200,000 records. Looking at in the Elasticsearch Head Plugin,
my index goes offline and becomes unrecoverable.

For now, I have it running on a VM on my personal machine.

VM Config:
Ubuntu Server 14.04 64-Bit
8 GB RAM
2 Processors
32 GB SSD

Java
java version "1.7.0_65"
OpenJDK Runtime Environment (IcedTea 2.5.1)
(7u65-2.5.1-4ubuntu1~0.14.04.2)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

Elasticsearch is using mostly the defaults. This is the output of:
curl http://localhost:9200/_nodes/process?pretty
{
"cluster_name" : "property_transaction_data",
"nodes" : {
"KlFkO_qgSOKmV_jjj5xeVw" : {
"name" : "Marvin Flumm",
"transport_address" : "inet[/192.168.133.131:9300]",
"host" : "ubuntu-es",
"ip" : "127.0.1.1",
"version" : "1.3.2",
"build" : "dee175d",
"http_address" : "inet[/192.168.133.131:9200]",
"process" : {
"refresh_interval_in_millis" : 1000,
"id" : 1092,
"max_file_descriptors" : 65535,
"mlockall" : true
}
}
}
}

I adjusted ES_HEAP_SIZE to 512mb.

I'm using the following code to pull data from SQL Server and
index it.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3
f-462f-bdcf-df717cbc6269%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3f-462f-bdcf-df717cbc6269%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/0dcac495-a071-4644-9349-109071fb1855%40goo
glegroups.com
https://groups.google.com/d/msgid/elasticsearch/0dcac495-a071-4644-9349-109071fb1855%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/b439af3d-69b0-4301-bf07-22b37767a17c%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b439af3d-69b0-4301-bf07-22b37767a17c%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1765489f-d2f5-47c5-a499-9633c9be54e2%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/1765489f-d2f5-47c5-a499-9633c9be54e2%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE-5CMJU6Tk72KcKgMcsat3phgXXfQS-qfFeU-YVbzodQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

This is the code I've been using to index:

I'm going to try to fix the running out of space issue and then try
slimming down settings. Thank you.

public class Indexer {

private static final Logger logger = LogManager.getLogger(

"ESBulkUploader");

public static void main(String[] args) throws IOException, 

NoSuchFieldException {

    DBConnection dbConn = new DBConnection("");

    String query = "SELECT TOP 300000 * FROM vw_PropertyGeneralInfo 

WHERE Country_id = 1 ORDER BY Property_id DESC";

    System.out.println("getting data");
    List<PropertyGeneralInfoRow> pgiTable =  dbConn.

ExecuteQueryWithoutParameters(query);
System.out.println("got data");

    ObjectMapper mapper = new ObjectMapper();

    Settings settings = ImmutableSettings.settingsBuilder().put(

"cluster.name", "property_transaction_data").build();

    Client client = new TransportClient(settings).addTransportAddress(

new InetSocketTransportAddress("192.168.133.131", 9300));

    BulkProcessor bulkProcessor = BulkProcessor.builder(client, new 

BulkProcessor.Listener() {
@Override
public void beforeBulk(long executionId, BulkRequest request) {
System.out.println("About to index " + request.
numberOfActions() + " records of size " + request.estimatedSizeInBytes() +
".");
}

        @Override
        public void afterBulk(long executionId, BulkRequest request, 

BulkResponse response) {
if( response.hasFailures() ){
for( BulkItemResponse item : response.getItems() ){
BulkItemResponse.Failure failure = item.getFailure
();
if( failure != null ){
System.out.println(failure.getId() + " -- " +
failure.getStatus().name() + " -- " + failure.getMessage() + " -- " +
failure.getType());
}
}
}

            System.out.println("Successfully indexed " + request.

numberOfActions() + " records in " + response.getTook() + ".");
}

        @Override
        public void afterBulk(long executionId, BulkRequest request, 

Throwable failure) {
System.out.println("failure somewhere on " + request.
toString());
failure.printStackTrace();
logger.warn("failure on " + request.toString());
}
}).setBulkActions(500).setConcurrentRequests(1).build();

    for( int i = 0; i < pgiTable.size(); i++ ){
        //prep location field
        PropertyGeneralInfoRow pgiRow = pgiTable.get(i);

        Double[] location = {pgiRow.getLon_dbl(), pgiRow.getLat_dbl()};

        geocode geocode = new geocode();

        geocode.setLocation(location);

        pgiRow.setGeocode(geocode);

        // prep full address string
        pgiRow.setFulladdressstring(pgiRow.getPropertykey_tx() + ", " +
                pgiRow.getCity_tx() + ", " + pgiRow.getStateprov_cd() +
                ", " + pgiRow.getCountry_tx() + ", " + pgiRow.

getPostalcode_tx());

        String jsonRow = mapper.writeValueAsString(pgiRow);

        if( jsonRow != null && !jsonRow.isEmpty() && !jsonRow.equals(

"{}") ){
bulkProcessor.add(new IndexRequest("rcapropertydata",
"rcaproperty").source(jsonRow.getBytes()));
// bulkProcessor.add(client.prepareIndex("rcapropertydata",
"rcaproperty").setSource(jsonRow));
}
else{
// don't add null strings..
try{
System.out.println(pgiRow.toString());
}
catch (Exception e){
System.out.println("Some error in the toString()
method...");
}
System.out.println("Some json output was null. -- " + pgiRow
.getProperty_id().toString());
}

    }

    bulkProcessor.flush();
    bulkProcessor.close();

}

}

On Tuesday, September 9, 2014 1:57:54 PM UTC-4, Jörg Prante wrote:

Check the path.data setting in config/elasticsearch.yml

Jörg

On Tue, Sep 9, 2014 at 7:50 PM, Joshua P <jpeter...@gmail.com
<javascript:>> wrote:

Just reran the indexer and found this error coming up. I'm running out of
disk space on the partition ES wants to write to.

F38KqHhnRDWtiJCss5Wz0g -- INTERNAL_SERVER_ERROR --
TranslogException[[index_type][0] Failed to write operation
[org.elasticsearch.index.translog.Translog$Create@6f1f6b1e]]; nested:
IOException[No space left on device]; -- index_type

Where would I change the write location? Which config file?

On Tuesday, September 9, 2014 1:28:21 PM UTC-4, Joshua P wrote:

Hi Jörg,

Can you elaborate on what you mean by I still need more fine tuning?

I've upped the heap size to 4g (in both places I mentioned before
because it's not clear to me which one ES actually uses). I haven't tried
to index again yet.
Other than throttling my indexing, what are some other things I need to
be thinking about?

On Tuesday, September 9, 2014 12:53:35 PM UTC-4, Jörg Prante wrote:

Let ES_HEAP_SIZE at least to 1 GB, for smaller heaps like 512m and
indexing around 1 million docs, you need some more fine tuning, which is
complicated. Your machine is ok to set the heap to 4 GB which is 50% of 8
GB RAM.

Jörg

On Tue, Sep 9, 2014 at 5:39 PM, Joshua P jpeter...@gmail.com wrote:

Here is /etc/default/elasticsearch

Run Elasticsearch as this user ID and group ID

#ES_USER=elasticsearch
#ES_GROUP=elasticsearch

Heap Size (defaults to 256m min, 1g max)

ES_HEAP_SIZE=512m

Heap new generation

#ES_HEAP_NEWSIZE=

max direct memory

#ES_DIRECT_SIZE=

Maximum number of open files, defaults to 65535.

MAX_OPEN_FILES=65535

Maximum locked memory size. Set to "unlimited" if you use the

bootstrap.mlockall option in elasticsearch.yml. You must also set

ES_HEAP_SIZE.

MAX_LOCKED_MEMORY=unlimited

Maximum number of VMA (Virtual Memory Areas) a process can own

#MAX_MAP_COUNT=262144

Elasticsearch log directory

#LOG_DIR=/var/log/elasticsearch

Elasticsearch data directory

#DATA_DIR=/var/lib/elasticsearch

Elasticsearch work directory

#WORK_DIR=/tmp/elasticsearch

Elasticsearch configuration directory

#CONF_DIR=/etc/elasticsearch

Elasticsearch configuration file (elasticsearch.yml)

#CONF_FILE=/etc/elasticsearch/elasticsearch.yml

Additional Java OPTS

#ES_JAVA_OPTS=

Configure restart on package upgrade (true, every other setting will

lead to not restarting)
#RESTART_ON_UPGRADE=true

I also see the same setting in /etc/init.d/elasticsearch. Do you know
which file takes priority? And what a good size would be?

On Tuesday, September 9, 2014 11:32:19 AM UTC-4, vineeth mohan wrote:

Hello Joshua ,

I am not sure which variable you are referring to on the memory
settings in the config file , please paste the comment and config.
I usually change the config from init.d script.

Best approach would be to bulk index say 10,000 feeds in sync mode ,
wait until is everything is indexed and then proceed to the next batch.
I am not sure about the java API , but long back i used to curl to
this stats API and see how much request was rejected.

Thanks
Vineeth

On Tue, Sep 9, 2014 at 8:58 PM, Joshua P jpeter...@gmail.com wrote:

You also said you wouldn't recommend indexing that much information
at once. How would you suggest breaking it up and what status should I look
for before doing another batch? I have to come up with some process that is
repeatable and mostly automated.

On Tuesday, September 9, 2014 11:12:59 AM UTC-4, Joshua P wrote:

Thanks for the reply, Vineeth!

What's a practical heap size? I've seen some people saying they set
it to 30gb but this confuses me because in the /etc/default/elasticsearch
file, the comment suggests the max is only 1gb?

I'll look into the threadpool issue. Is there a Java API for
monitoring Cluster Node health? Can you point me at an example or give me a
link to that?

Thanks!

On Tuesday, September 9, 2014 10:52:35 AM UTC-4, vineeth mohan
wrote:

Hello Joshuva ,

I have a feeling this has something to do with the threadpool.
There is a limit on number of feeds to be queued for indexing.

Try increasing the size of threadpool queue of index and bulk to a
large number.
Also through cluster node API on threadpool, you can see if any
request has failed.
Monitor this API for any failed request due to large volume.

Threadpool - Elasticsearch Platform — Find real-time answers at scale | Elastic
reference/current/modules-threadpool.html
Threadpool stats - http://www.elasticsearch.org
/guide/en/elasticsearch/reference/current/cluster-nodes-stats.html

Having said that , i wont recommend bulk indexing that much
information at a time and 512 MB is not going to help much.

Thanks
Vineeth

On Tue, Sep 9, 2014 at 7:48 PM, Joshua P jpeter...@gmail.com
wrote:

Hi there!

I'm trying to do a one-time index of about 800,000 records into
an instance of elasticsearch. But I'm having a bit of trouble. It
continually fails around 200,000 records. Looking at in the Elasticsearch
Head Plugin, my index goes offline and becomes unrecoverable.

For now, I have it running on a VM on my personal machine.

VM Config:
Ubuntu Server 14.04 64-Bit
8 GB RAM
2 Processors
32 GB SSD

Java
java version "1.7.0_65"
OpenJDK Runtime Environment (IcedTea 2.5.1)
(7u65-2.5.1-4ubuntu1~0.14.04.2)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

Elasticsearch is using mostly the defaults. This is the output
of:
curl http://localhost:9200/_nodes/process?pretty
{
"cluster_name" : "property_transaction_data",
"nodes" : {
"KlFkO_qgSOKmV_jjj5xeVw" : {
"name" : "Marvin Flumm",
"transport_address" : "inet[/192.168.133.131:9300]",
"host" : "ubuntu-es",
"ip" : "127.0.1.1",
"version" : "1.3.2",
"build" : "dee175d",
"http_address" : "inet[/192.168.133.131:9200]",
"process" : {
"refresh_interval_in_millis" : 1000,
"id" : 1092,
"max_file_descriptors" : 65535,
"mlockall" : true
}
}
}
}

I adjusted ES_HEAP_SIZE to 512mb.

I'm using the following code to pull data from SQL Server and
index it.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3
f-462f-bdcf-df717cbc6269%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3f-462f-bdcf-df717cbc6269%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0dcac495-a07
1-4644-9349-109071fb1855%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0dcac495-a071-4644-9349-109071fb1855%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/b439af3d-69b0-4301-bf07-22b37767a17c%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b439af3d-69b0-4301-bf07-22b37767a17c%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1765489f-d2f5-47c5-a499-9633c9be54e2%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/1765489f-d2f5-47c5-a499-9633c9be54e2%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7618bd05-9a0f-4248-8f16-0950198473db%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Code looks okay, so it might be just the full volume that is in the way

Jörg

On Tue, Sep 9, 2014 at 8:44 PM, Joshua P jpetersen841@gmail.com wrote:

This is the code I've been using to index:

I'm going to try to fix the running out of space issue and then try
slimming down settings. Thank you.

public class Indexer {

private static final Logger logger = LogManager.getLogger(

"ESBulkUploader");

public static void main(String[] args) throws IOException,

NoSuchFieldException {

    DBConnection dbConn = new DBConnection("");

    String query = "SELECT TOP 300000 * FROM vw_PropertyGeneralInfo

WHERE Country_id = 1 ORDER BY Property_id DESC";

    System.out.println("getting data");
    List<PropertyGeneralInfoRow> pgiTable =  dbConn.

ExecuteQueryWithoutParameters(query);
System.out.println("got data");

    ObjectMapper mapper = new ObjectMapper();

    Settings settings = ImmutableSettings.settingsBuilder().put("

cluster.name", "property_transaction_data").build();

    Client client = new TransportClient(settings).addTransportAddress(

new InetSocketTransportAddress("192.168.133.131", 9300));

    BulkProcessor bulkProcessor = BulkProcessor.builder(client, new

BulkProcessor.Listener() {
@Override
public void beforeBulk(long executionId, BulkRequest request)
{
System.out.println("About to index " + request.
numberOfActions() + " records of size " + request.estimatedSizeInBytes() +
".");
}

        @Override
        public void afterBulk(long executionId, BulkRequest request,

BulkResponse response) {
if( response.hasFailures() ){
for( BulkItemResponse item : response.getItems() ){
BulkItemResponse.Failure failure = item.getFailure
();
if( failure != null ){
System.out.println(failure.getId() + " -- " +
failure.getStatus().name() + " -- " + failure.getMessage() + " -- " +
failure.getType());
}
}
}

            System.out.println("Successfully indexed " + request.

numberOfActions() + " records in " + response.getTook() + ".");
}

        @Override
        public void afterBulk(long executionId, BulkRequest request,

Throwable failure) {
System.out.println("failure somewhere on " + request.
toString());
failure.printStackTrace();
logger.warn("failure on " + request.toString());
}
}).setBulkActions(500).setConcurrentRequests(1).build();

    for( int i = 0; i < pgiTable.size(); i++ ){
        //prep location field
        PropertyGeneralInfoRow pgiRow = pgiTable.get(i);

        Double[] location = {pgiRow.getLon_dbl(), pgiRow.getLat_dbl

()};

        geocode geocode = new geocode();

        geocode.setLocation(location);

        pgiRow.setGeocode(geocode);

        // prep full address string
        pgiRow.setFulladdressstring(pgiRow.getPropertykey_tx() + ", "
  •               pgiRow.getCity_tx() + ", " + pgiRow.getStateprov_cd()
    
  •               ", " + pgiRow.getCountry_tx() + ", " + pgiRow.
    

getPostalcode_tx());

        String jsonRow = mapper.writeValueAsString(pgiRow);

        if( jsonRow != null && !jsonRow.isEmpty() && !jsonRow.equals(

"{}") ){
bulkProcessor.add(new IndexRequest("rcapropertydata",
"rcaproperty").source(jsonRow.getBytes()));
//
bulkProcessor.add(client.prepareIndex("rcapropertydata",
"rcaproperty").setSource(jsonRow));
}
else{
// don't add null strings..
try{
System.out.println(pgiRow.toString());
}
catch (Exception e){
System.out.println("Some error in the toString()
method...");
}
System.out.println("Some json output was null. -- " +
pgiRow.getProperty_id().toString());
}

    }

    bulkProcessor.flush();
    bulkProcessor.close();

}

}

On Tuesday, September 9, 2014 1:57:54 PM UTC-4, Jörg Prante wrote:

Check the path.data setting in config/elasticsearch.yml

Jörg

On Tue, Sep 9, 2014 at 7:50 PM, Joshua P jpeter...@gmail.com wrote:

Just reran the indexer and found this error coming up. I'm running out
of disk space on the partition ES wants to write to.

F38KqHhnRDWtiJCss5Wz0g -- INTERNAL_SERVER_ERROR --
TranslogException[[index_type][0] Failed to write operation
[org.elasticsearch.index.translog.Translog$Create@6f1f6b1e]]; nested:
IOException[No space left on device]; -- index_type

Where would I change the write location? Which config file?

On Tuesday, September 9, 2014 1:28:21 PM UTC-4, Joshua P wrote:

Hi Jörg,

Can you elaborate on what you mean by I still need more fine tuning?

I've upped the heap size to 4g (in both places I mentioned before
because it's not clear to me which one ES actually uses). I haven't tried
to index again yet.
Other than throttling my indexing, what are some other things I need to
be thinking about?

On Tuesday, September 9, 2014 12:53:35 PM UTC-4, Jörg Prante wrote:

Let ES_HEAP_SIZE at least to 1 GB, for smaller heaps like 512m and
indexing around 1 million docs, you need some more fine tuning, which is
complicated. Your machine is ok to set the heap to 4 GB which is 50% of 8
GB RAM.

Jörg

On Tue, Sep 9, 2014 at 5:39 PM, Joshua P jpeter...@gmail.com wrote:

Here is /etc/default/elasticsearch

Run Elasticsearch as this user ID and group ID

#ES_USER=elasticsearch
#ES_GROUP=elasticsearch

Heap Size (defaults to 256m min, 1g max)

ES_HEAP_SIZE=512m

Heap new generation

#ES_HEAP_NEWSIZE=

max direct memory

#ES_DIRECT_SIZE=

Maximum number of open files, defaults to 65535.

MAX_OPEN_FILES=65535

Maximum locked memory size. Set to "unlimited" if you use the

bootstrap.mlockall option in elasticsearch.yml. You must also set

ES_HEAP_SIZE.

MAX_LOCKED_MEMORY=unlimited

Maximum number of VMA (Virtual Memory Areas) a process can own

#MAX_MAP_COUNT=262144

Elasticsearch log directory

#LOG_DIR=/var/log/elasticsearch

Elasticsearch data directory

#DATA_DIR=/var/lib/elasticsearch

Elasticsearch work directory

#WORK_DIR=/tmp/elasticsearch

Elasticsearch configuration directory

#CONF_DIR=/etc/elasticsearch

Elasticsearch configuration file (elasticsearch.yml)

#CONF_FILE=/etc/elasticsearch/elasticsearch.yml

Additional Java OPTS

#ES_JAVA_OPTS=

Configure restart on package upgrade (true, every other setting

will lead to not restarting)
#RESTART_ON_UPGRADE=true

I also see the same setting in /etc/init.d/elasticsearch. Do you know
which file takes priority? And what a good size would be?

On Tuesday, September 9, 2014 11:32:19 AM UTC-4, vineeth mohan wrote:

Hello Joshua ,

I am not sure which variable you are referring to on the memory
settings in the config file , please paste the comment and config.
I usually change the config from init.d script.

Best approach would be to bulk index say 10,000 feeds in sync mode ,
wait until is everything is indexed and then proceed to the next batch.
I am not sure about the java API , but long back i used to curl to
this stats API and see how much request was rejected.

Thanks
Vineeth

On Tue, Sep 9, 2014 at 8:58 PM, Joshua P jpeter...@gmail.com
wrote:

You also said you wouldn't recommend indexing that much information
at once. How would you suggest breaking it up and what status should I look
for before doing another batch? I have to come up with some process that is
repeatable and mostly automated.

On Tuesday, September 9, 2014 11:12:59 AM UTC-4, Joshua P wrote:

Thanks for the reply, Vineeth!

What's a practical heap size? I've seen some people saying they
set it to 30gb but this confuses me because in the
/etc/default/elasticsearch file, the comment suggests the max is only 1gb?

I'll look into the threadpool issue. Is there a Java API for
monitoring Cluster Node health? Can you point me at an example or give me a
link to that?

Thanks!

On Tuesday, September 9, 2014 10:52:35 AM UTC-4, vineeth mohan
wrote:

Hello Joshuva ,

I have a feeling this has something to do with the threadpool.
There is a limit on number of feeds to be queued for indexing.

Try increasing the size of threadpool queue of index and bulk to
a large number.
Also through cluster node API on threadpool, you can see if any
request has failed.
Monitor this API for any failed request due to large volume.

Threadpool - Elasticsearch Platform — Find real-time answers at scale | Elastic
reference/current/modules-threadpool.html
Threadpool stats - http://www.elasticsearch.org
/guide/en/elasticsearch/reference/current/cluster-nodes-stat
s.html

Having said that , i wont recommend bulk indexing that much
information at a time and 512 MB is not going to help much.

Thanks
Vineeth

On Tue, Sep 9, 2014 at 7:48 PM, Joshua P jpeter...@gmail.com
wrote:

Hi there!

I'm trying to do a one-time index of about 800,000 records into
an instance of elasticsearch. But I'm having a bit of trouble. It
continually fails around 200,000 records. Looking at in the Elasticsearch
Head Plugin, my index goes offline and becomes unrecoverable.

For now, I have it running on a VM on my personal machine.

VM Config:
Ubuntu Server 14.04 64-Bit
8 GB RAM
2 Processors
32 GB SSD

Java
java version "1.7.0_65"
OpenJDK Runtime Environment (IcedTea 2.5.1)
(7u65-2.5.1-4ubuntu1~0.14.04.2)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

Elasticsearch is using mostly the defaults. This is the output
of:
curl http://localhost:9200/_nodes/process?pretty
{
"cluster_name" : "property_transaction_data",
"nodes" : {
"KlFkO_qgSOKmV_jjj5xeVw" : {
"name" : "Marvin Flumm",
"transport_address" : "inet[/192.168.133.131:9300]",
"host" : "ubuntu-es",
"ip" : "127.0.1.1",
"version" : "1.3.2",
"build" : "dee175d",
"http_address" : "inet[/192.168.133.131:9200]",
"process" : {
"refresh_interval_in_millis" : 1000,
"id" : 1092,
"max_file_descriptors" : 65535,
"mlockall" : true
}
}
}
}

I adjusted ES_HEAP_SIZE to 512mb.

I'm using the following code to pull data from SQL Server and
index it.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3
f-462f-bdcf-df717cbc6269%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3f-462f-bdcf-df717cbc6269%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0dcac495-a07
1-4644-9349-109071fb1855%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0dcac495-a071-4644-9349-109071fb1855%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/b439af3d-69b0-4301-bf07-22b37767a17c%40goo
glegroups.com
https://groups.google.com/d/msgid/elasticsearch/b439af3d-69b0-4301-bf07-22b37767a17c%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/1765489f-d2f5-47c5-a499-9633c9be54e2%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/1765489f-d2f5-47c5-a499-9633c9be54e2%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7618bd05-9a0f-4248-8f16-0950198473db%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/7618bd05-9a0f-4248-8f16-0950198473db%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEmuc98erxCGxxn_E8JDzsxvSzu-%3D_w6qLL8RyPeves9w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks!

Turns out I was using less space on the VM than I thought; that with a lack
of decent error checking and I didn't catch the out-of-space problem. As
soon as I added more space, I was able to index everything without a
problem.

Thanks again.

On Tuesday, September 9, 2014 6:49:35 PM UTC-4, Jörg Prante wrote:

Code looks okay, so it might be just the full volume that is in the way

Jörg

On Tue, Sep 9, 2014 at 8:44 PM, Joshua P <jpeter...@gmail.com
<javascript:>> wrote:

This is the code I've been using to index:

I'm going to try to fix the running out of space issue and then try
slimming down settings. Thank you.

public class Indexer {

private static final Logger logger = LogManager.getLogger(

"ESBulkUploader");

public static void main(String[] args) throws IOException, 

NoSuchFieldException {

    DBConnection dbConn = new DBConnection("");

    String query = "SELECT TOP 300000 * FROM vw_PropertyGeneralInfo 

WHERE Country_id = 1 ORDER BY Property_id DESC";

    System.out.println("getting data");
    List<PropertyGeneralInfoRow> pgiTable =  dbConn.

ExecuteQueryWithoutParameters(query);
System.out.println("got data");

    ObjectMapper mapper = new ObjectMapper();

    Settings settings = ImmutableSettings.settingsBuilder().put("

cluster.name", "property_transaction_data").build();

    Client client = new TransportClient(settings).addTransportAddress(

new InetSocketTransportAddress("192.168.133.131", 9300));

    BulkProcessor bulkProcessor = BulkProcessor.builder(client, new 

BulkProcessor.Listener() {
@Override
public void beforeBulk(long executionId, BulkRequest request)
{
System.out.println("About to index " + request.
numberOfActions() + " records of size " + request.estimatedSizeInBytes() +
".");
}

        @Override
        public void afterBulk(long executionId, BulkRequest request, 

BulkResponse response) {
if( response.hasFailures() ){
for( BulkItemResponse item : response.getItems() ){
BulkItemResponse.Failure failure = item.getFailure
();
if( failure != null ){
System.out.println(failure.getId() + " -- " +
failure.getStatus().name() + " -- " + failure.getMessage() + " -- " +
failure.getType());
}
}
}

            System.out.println("Successfully indexed " + request.

numberOfActions() + " records in " + response.getTook() + ".");
}

        @Override
        public void afterBulk(long executionId, BulkRequest request, 

Throwable failure) {
System.out.println("failure somewhere on " + request.
toString());
failure.printStackTrace();
logger.warn("failure on " + request.toString());
}
}).setBulkActions(500).setConcurrentRequests(1).build();

    for( int i = 0; i < pgiTable.size(); i++ ){
        //prep location field
        PropertyGeneralInfoRow pgiRow = pgiTable.get(i);

        Double[] location = {pgiRow.getLon_dbl(), pgiRow.getLat_dbl

()};

        geocode geocode = new geocode();

        geocode.setLocation(location);

        pgiRow.setGeocode(geocode);

        // prep full address string
        pgiRow.setFulladdressstring(pgiRow.getPropertykey_tx() + ", " 
  •               pgiRow.getCity_tx() + ", " + pgiRow.getStateprov_cd() 
    
  •               ", " + pgiRow.getCountry_tx() + ", " + pgiRow.
    

getPostalcode_tx());

        String jsonRow = mapper.writeValueAsString(pgiRow);

        if( jsonRow != null && !jsonRow.isEmpty() && !jsonRow.equals(

"{}") ){
bulkProcessor.add(new IndexRequest("rcapropertydata",
"rcaproperty").source(jsonRow.getBytes()));
//
bulkProcessor.add(client.prepareIndex("rcapropertydata",
"rcaproperty").setSource(jsonRow));
}
else{
// don't add null strings..
try{
System.out.println(pgiRow.toString());
}
catch (Exception e){
System.out.println("Some error in the toString()
method...");
}
System.out.println("Some json output was null. -- " +
pgiRow.getProperty_id().toString());
}

    }

    bulkProcessor.flush();
    bulkProcessor.close();

}

}

On Tuesday, September 9, 2014 1:57:54 PM UTC-4, Jörg Prante wrote:

Check the path.data setting in config/elasticsearch.yml

Jörg

On Tue, Sep 9, 2014 at 7:50 PM, Joshua P jpeter...@gmail.com wrote:

Just reran the indexer and found this error coming up. I'm running out of
disk space on the partition ES wants to write to.

F38KqHhnRDWtiJCss5Wz0g -- INTERNAL_SERVER_ERROR --
TranslogException[[index_type][0] Failed to write operation
[org.elasticsearch.index.translog.Translog$Create@6f1f6b1e]]; nested:
IOException[No space left on device]; -- index_type

Where would I change the write location? Which config file?

On Tuesday, September 9, 2014 1:28:21 PM UTC-4, Joshua P wrote:

Hi Jörg,

Can you elaborate on what you mean by I still need more fine tuning?

I've upped the heap size to 4g (in both places I mentioned before because
it's not clear to me which one ES actually uses). I haven't tried to index
again yet.
Other than throttling my indexing, what are some other things I need to be
thinking about?

On Tuesday, September 9, 2014 12:53:35 PM UTC-4, Jörg Prante wrote:

Let ES_HEAP_SIZE at least to 1 GB, for smaller heaps like 512m and
indexing around 1 million docs, you need some more fine tuning, which is
complicated. Your machine is ok to set the heap to 4 GB which is 50% of 8
GB RAM.

Jörg

On Tue, Sep 9, 2014 at 5:39 PM, Joshua P jpeter...@gmail.com wrote:

Here is /etc/default/elasticsearch

Run Elasticsearch as this user ID and group ID

#ES_USER=elasticsearch
#ES_GROUP=elasticsearch

Heap Size (defaults to 256m min, 1g max)

ES_HEAP_SIZE=512m

Heap new generation

#ES_HEAP_NEWSIZE=

max direct memory

#ES_DIRECT_SIZE=

Maximum number of open files, defaults to 65535.

MAX_OPEN_FILES=65535

Maximum locked memory size. Set to "unlimited" if you use the

bootstrap.mlockall option in elasticsearch.yml. You must also set

ES_HEAP_SIZE.

MAX_LOCKED_MEMORY=unlimited

Maximum number of VMA (Virtual Memory Areas) a process can own

#MAX_MAP_COUNT=262144

Elasticsearch log directory

#LOG_DIR=/var/log/elasticsearch

Elasticsearch data directory

#DATA_DIR=/var/lib/elasticsearch

Elasticsearch work directory

#WORK_DIR=/tmp/elasticsearch

Elasticsearch configuration directory

#CONF_DIR=/etc/elasticsearch

Elasticsearch configuration file (elasticsearch.yml)

#CONF_FILE=/etc/elasticsearch/elasticsearch.yml

Additional Java OPTS

#ES_JAVA_OPTS=

Configure restart on package upgrade (true, every other setting will

lead to not restarting)
#RESTART_ON_UPGRADE=true

I also see the same setting in /etc/init.d/elasticsearch. Do you know
which file takes priority? And what a good size would be?

On Tuesday, September 9, 2014 11:32:19 AM UTC-4, vineeth mohan wrote:

Hello Joshua ,

I am not sure which variable you are referring to on the memory settings
in the config file , please paste the comment and config.
I usually change the config from init.d script.

Best approach would be to bulk index say 10,000 feeds in sync mode , wait
until is everything is indexed and then proceed to the next batch.
I am not sure about the java API , but long back i used to curl to this
stats API and see how much request was rejected.

Thanks
Vineeth

On Tue, Sep 9, 2014 at 8:58 PM, Joshua P jpeter...@gmail.com wrote:

You also said you wouldn't recommend indexing that much information at
once. How would you suggest breaking it up and what status should I look
for before doing another batch? I have to come up with some process that is
repeatable and mostly automated.

On Tuesday, September 9, 2014 11:12:59 AM UTC-4, Joshua P wrote:

Thanks for the reply, Vineeth!

What's a practical heap size? I've seen some people saying they set it to
30gb but this confuses me because in the /etc/default/elasticsearch file,
the comment suggests the max is only 1gb?

I'll look into the threadpool issue. Is there a Java API for monitoring
Cluster Node health? Can you point me at an example or give me a link to
that?

Thanks!

On Tuesday, September 9, 2014 10:52:35 AM UTC-4, vineeth mohan wrote:

Hello Joshuva ,

I have a feeling this has something to do with the threadpool.
There is a limit on number of feeds to be queued for indexing.

Try increasing the size of threadpool queue of index and bulk to a large
number.
Also through cluster node API on threadpool, you can see if any request
has failed.
Monitor this API for any failed request due to large volume.

Threadpool - Elasticsearch Platform — Find real-time answers at scale | Elastic
nce/current/modules-threadpool.html
Threadpool stats - Elasticsearch Platform — Find real-time answers at scale | Elastic
reference/current/cluster-nodes-stats.html

Having said that , i wont recommend bulk indexing that much information at
a time and 512 MB is not going to help much.

Thanks
Vineeth

On Tue, Sep 9, 2014 at 7:48 PM, Joshua P jpeter...@gmail.com wrote:

Hi there!

I'm trying to do a one-time index of about 800,000 records into an
instance of elasticsearch. But I'm having a bit of trouble. It continually
fails around 200,000 records. Looking at in the Elasticsearch Head Plugin,
my index goes offline and becomes unrecoverable.

For now, I have it running on a VM on my personal machine.

VM Config:
Ubuntu Server 14.04 64-Bit
8 GB RAM
2 Processors
32 GB SSD

Java
java version "1.7.0_65"
OpenJDK Runtime Environment (IcedTea 2.5.1) (7u65-2.5.1-4ubuntu1~0.14.04.2
)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

Elasticsearch is using mostly the defaults. This is the output of:
curl http://localhost:9200/_nodes/process?pretty
{
"cluster_name" : "property_transaction_data",
"nodes" : {
"KlFkO_qgSOKmV_jjj5xeVw" : {
"name" : "Marvin Flumm",
"transport_address" : "inet[/192.168.133.131:9300]",
"host" : "ubuntu-es",
"ip" : "127.0.1.1",
"version" : "1.3.2",
"build" : "dee175d",
"http_address" : "inet[/192.168.133.131:9200]",
"process" : {
"refresh_interval_in_millis" : 1000,
"id" : 1092,
"max_file_descriptors" : 65535,
"mlockall" : true
}
}
}
}

I adjusted ES_HEAP_SIZE to 512mb.

I'm using the following code to pull data from SQL Server and index it.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/ms
gid/elasticsearch/f94f96d4-8c3f-462f-bdcf-df717cbc6269%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3f-462f-bdcf-df717cbc6269%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/ms
gid/elasticsearch/0dcac495-a071-4644-9349-109071fb1855%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0dcac495-a071-4644-9349-109071fb1855%40googlegroups.com?utm_medium=email&utm_source=footer
.

...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5ed41ff2-3abd-4e6d-803c-62746ad3c54a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.