Usage of Groovy to index data with ElasticSearch

Jason_N · August 3, 2011, 3:29pm

Hello,

I'm trying to chase down some memory errors, and although I've searched
around, the answer seems to elude me. Trying to index ~500m records with an
average record size of 12k from a server that isn't participating in
cluster. From a high level I fetch ~300k rows from Oracle and process each
row:
sql.query(query [startDate,endDate], {
while(it.next()) {
queue << buildResult(it)
}

Where build result just takes resultset and converts to a map do a little
text processing and converting a single clob field to a String. Due to
excessively quick out of memory errors I pared the threads processing the
database down to 3.

I use GPARS operators to load into ElasticSearch:
for each item in queue pop off and call index function

GNodeBuilder nodeBuilder = nodeBuilder()
nodeBuilder.settings {
node {
client = true
}
../set cluster name, discovery to zen unicast and set hosts

}

node = nodeBuilder.node()
esClient = node.client
def index = esClient.index {
index "someIndex"
type "someType"
source (record)
}

Using either 3 threads or 10 (10 just runs out quicker) I can usually run
for as long as 2 days, or as short as a few hours before I out of memory on
a GCOverheadLimit. I thought everything with the last run at 3 threads was
fine since the heap was minimal 3+ hours into the ETL routine. So far I
haven't had a chance to analyze the heapdump (mainly because the jvm hasn't
produced one yet)

Node configuration: 11 cluster 16 cores each, 16g ram, 4x1T drives
(currently only thing running on cluster is ES 17.2)

I guess a few questions:
What type of speed should I be seeing? Three threads with Oracle I'm
getting ~120k records per minute. With 10 threads it's closer to 400k.
Has anyone else had experience using the groovy-client to load this size
record set?
Should I just move back to Java based API and do bulk loads

Thanks,

Jason

kimchy · August 4, 2011, 5:52pm

Where do you see that OOM? Is that on the client indexing hte data, or in
the cluster?

On Wed, Aug 3, 2011 at 6:29 PM, Jason N jasonnic@gmail.com wrote:

Hello,

I'm trying to chase down some memory errors, and although I've searched
around, the answer seems to elude me. Trying to index ~500m records with an
average record size of 12k from a server that isn't participating in
cluster. From a high level I fetch ~300k rows from Oracle and process
each row:
sql.query(query [startDate,endDate], {
while(it.next()) {
queue << buildResult(it)
}

Where build result just takes resultset and converts to a map do a little
text processing and converting a single clob field to a String. Due to
excessively quick out of memory errors I pared the threads processing the
database down to 3.

I use GPARS operators to load into Elasticsearch:
for each item in queue pop off and call index function

GNodeBuilder nodeBuilder = nodeBuilder()
nodeBuilder.settings {
node {
client = true
}
../set cluster name, discovery to zen unicast and set hosts

}

node = nodeBuilder.node()
esClient = node.client
def index = esClient.index {
index "someIndex"
type "someType"
source (record)
}

Using either 3 threads or 10 (10 just runs out quicker) I can usually run
for as long as 2 days, or as short as a few hours before I out of memory on
a GCOverheadLimit. I thought everything with the last run at 3 threads was
fine since the heap was minimal 3+ hours into the ETL routine. So far I
haven't had a chance to analyze the heapdump (mainly because the jvm hasn't
produced one yet)

Node configuration: 11 cluster 16 cores each, 16g ram, 4x1T drives
(currently only thing running on cluster is ES 17.2)

I guess a few questions:
What type of speed should I be seeing? Three threads with Oracle I'm
getting ~120k records per minute. With 10 threads it's closer to 400k.
Has anyone else had experience using the groovy-client to load this size
record set?
Should I just move back to Java based API and do bulk loads

Thanks,

Jason

Jason_N · August 4, 2011, 7:38pm

I'm seeing it on the client side. At this point I'm not sure if it's
Groovy, GPARS, or ElasticSearch client. When the OOM happened quickly due
to 10 database threads running simultaneously I thought that might be the
issue. Now with 3 threads the same problem occurs, it just takes much
longer to see it happen. I was just hoping to find someone that has done
a Groovy based ETL process.

kimchy · August 4, 2011, 7:46pm

The groovy client simply delegates to the Java client, and people have
loaded considerable amount of data using it. One simple way to try and see
where the memory is spent is using visualvm with sampling on the memory.

On Thu, Aug 4, 2011 at 10:38 PM, Jason N jasonnic@gmail.com wrote:

I'm seeing it on the client side. At this point I'm not sure if it's
Groovy, GPARS, or Elasticsearch client. When the OOM happened quickly due
to 10 database threads running simultaneously I thought that might be the
issue. Now with 3 threads the same problem occurs, it just takes much
longer to see it happen. I was just hoping to find someone that has done
a Groovy based ETL process.

Frederic · September 9, 2011, 6:39am

Hi Jason, have you had any news on this?
I faced a similar problem these days, same OOM error using Java API,
running about 15 threads for indexing docs.

Is it a good approach to initiate X threads clients (either Transport
or Node) for indexing docs massivelly, or it would be better to use
some other solution, like Bulk as Json says?

On 4 ago, 16:46, Shay Banon kim...@gmail.com wrote:

The groovy client simply delegates to the Java client, and people have
loaded considerable amount of data using it. One simple way to try and see
where the memory is spent is using visualvm with sampling on the memory.

On Thu, Aug 4, 2011 at 10:38 PM, Jason N jason...@gmail.com wrote:

I'm seeing it on the client side. At this point I'm not sure if it's
Groovy, GPARS, or Elasticsearch client. When the OOM happened quickly due
to 10 database threads running simultaneously I thought that might be the
issue. Now with 3 threads the same problem occurs, it just takes much
longer to see it happen. I was just hoping to find someone that has done
a Groovy based ETL process.

Paul_Brown · September 9, 2011, 6:57am

There's a very good chance that this isn't Elasticsearch.

If you're using Hibernate (or GORM, since that depends on Hibernate) for the database access, you need to clean up the session every so often or you'll run out of memory as the session-level cache fills up. (Google "hibernate batch" for some pointers on doing bulk operations with Hibernate.)

-- Paul

On Sep 8, 2011, at 11:39 PM, Frederic wrote:

Hi Jason, have you had any news on this?
I faced a similar problem these days, same OOM error using Java API,
running about 15 threads for indexing docs.

Is it a good approach to initiate X threads clients (either Transport
or Node) for indexing docs massivelly, or it would be better to use
some other solution, like Bulk as Json says?

On 4 ago, 16:46, Shay Banon kim...@gmail.com wrote:

The groovy client simply delegates to the Java client, and people have
loaded considerable amount of data using it. One simple way to try and see
where the memory is spent is using visualvm with sampling on the memory.

On Thu, Aug 4, 2011 at 10:38 PM, Jason N jason...@gmail.com wrote:

I'm seeing it on the client side. At this point I'm not sure if it's
Groovy, GPARS, or Elasticsearch client. When the OOM happened quickly due
to 10 database threads running simultaneously I thought that might be the
issue. Now with 3 threads the same problem occurs, it just takes much
longer to see it happen. I was just hoping to find someone that has done
a Groovy based ETL process.

Frederic · September 9, 2011, 1:44pm

Thanks Paul. Unfortunatelly I'm not using Hibernate but reading a lot
of files simultaneously for indexing data, but hope it helps for
Jason.
Anyway, I have the good felling that clients/nodes are not being
closed after indexing a file and some others are being created for
next files (code inherited from other team...)

On 9 sep, 03:57, Paul Brown p...@mult.ifario.us wrote:

There's a very good chance that this isn't Elasticsearch.

If you're using Hibernate (or GORM, since that depends on Hibernate) for the database access, you need to clean up the session every so often or you'll run out of memory as the session-level cache fills up. (Google "hibernate batch" for some pointers on doing bulk operations with Hibernate.)

-- Paul

On Sep 8, 2011, at 11:39 PM, Frederic wrote:

Hi Jason, have you had any news on this?
I faced a similar problem these days, same OOM error using Java API,
running about 15 threads for indexing docs.

Is it a good approach to initiate X threads clients (either Transport
or Node) for indexing docs massivelly, or it would be better to use
some other solution, like Bulk as Json says?

On 4 ago, 16:46, Shay Banon kim...@gmail.com wrote:

The groovy client simply delegates to the Java client, and people have
loaded considerable amount of data using it. One simple way to try and see
where the memory is spent is using visualvm with sampling on the memory.

On Thu, Aug 4, 2011 at 10:38 PM, Jason N jason...@gmail.com wrote:

I'm seeing it on the client side. At this point I'm not sure if it's
Groovy, GPARS, or Elasticsearch client. When the OOM happened quickly due
to 10 database threads running simultaneously I thought that might be the
issue. Now with 3 threads the same problem occurs, it just takes much
longer to see it happen. I was just hoping to find someone that has done
a Groovy based ETL process.