Java bulk API slows down if client is not closed and reopened


(Ondřej Spilka) #1

Hi all,

I'm using JAVA API on ES 1.0.1 to bulk index medium sized docs.
Documents come from 150Mb XML.
Average JSON document is about 500Bytes in 10 propeties, currently testing
on 275.000 documents. Only some key properties are indexed, otherwise
stored in _source.
Bulk index is done in 5000 document blocks.

While contiguously indexing, the speed of indexing slows linearly down, at
approx 100.000th item it took 5 times longer then at first chunk.
But when I close TrasnportClient after each successful bulk index, the
performance remains the same and indexing is breathlessly excellent.

What causes such a problem? Is it correct to close TransportClient
connection each time bulk indexing is done?
Seems okay, index is ready and functional.

1GB given to Java, bootstrap.mlockall: true, ES_HEAP_SIZE = ES_MIN_MEM
= ES_MAX_MEM = 1GB
Windows 8, i7, 8GB RAM, SSD disk.

Thanks in advance

Ondra

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/42e50b9e-3078-462f-b5c5-51b867a34ae9%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #2

Without seeing the code, it is impossible to make helpful statements.

1G is in general a small heap for bulk indexing. 275k documents will work
anyway, they should be ready in ~30 seconds. Maybe you see GC starting to
kick in. To make guesses about ES, you should run bulk indexing for at
least 30-60 min. and not just seconds.

Note, mlockall is not working on Windows.

Jörg

On Tue, Mar 4, 2014 at 2:58 PM, Ondřej Spilka spilka.ondrej@gmail.comwrote:

Hi all,

I'm using JAVA API on ES 1.0.1 to bulk index medium sized docs.
Documents come from 150Mb XML.
Average JSON document is about 500Bytes in 10 propeties, currently testing
on 275.000 documents. Only some key properties are indexed, otherwise
stored in _source.
Bulk index is done in 5000 document blocks.

While contiguously indexing, the speed of indexing slows linearly down, at
approx 100.000th item it took 5 times longer then at first chunk.
But when I close TrasnportClient after each successful bulk index, the
performance remains the same and indexing is breathlessly excellent.

What causes such a problem? Is it correct to close TransportClient
connection each time bulk indexing is done?
Seems okay, index is ready and functional.

1GB given to Java, bootstrap.mlockall: true, ES_HEAP_SIZE = ES_MIN_MEM
= ES_MAX_MEM = 1GB
Windows 8, i7, 8GB RAM, SSD disk.

Thanks in advance

Ondra

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/42e50b9e-3078-462f-b5c5-51b867a34ae9%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFKbVPPOGFL%3DVFRxfmvo6y31LjKuA9P61d_YejL5q0wgg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Brian Yoder) #3

Are you using the BulkRequestBuilder? If so, create a new one for each bulk
operation (and let the de-referenced old one be garbage collected);
otherwise you'll be filling it up and times will drop as seen. At least,
that's what I do, and it runs like the blazes for the entire 97M document
load.

Just a guess. As Jörg said, it's difficult to investigate without more
details. But from the outside looking in, this is the first thing I'd check
for.

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ba4a4948-e19a-4638-9353-b3b585d3068e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Ondřej Spilka) #4

Thanks for tips.Yes I am reusing requestbuilder as stated in example in docs so this can be the case.
I will try to reinstantiate the request builder and I will let you know.

Btw is there a way how to simply bulk index json/xml file as like as in Solr? This is extremely useful feature isolating large document preprocessing and indexing...

Thanks for support.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c98bb9a5-9ac5-41a5-b553-a7c948289739%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #5

There is a special ES indexing data model, as you surely already have
noted. You can only index a subset of valid JSON into ES. For example, each
ES JSON doc must be an object. Arrays must be single-valued, unnested. So,
arbitrary source JSON must be transformed, and due to the field/value
indexing, there is more than one possible model, which depends on your data
domain.

XML is also not straightforward to translate. Attributes and values have to
be mapped to JSON fields and there is more than one possibility to do so.

Another question is how to build identifiers from documents for ES doc _id.

In my domain, I transform all my input data (K/V, ISO 2709, JSON, XML) to
RDF, create an IRI, and this RDF can be serialized to JSON-LD which fits
well into the ES JSON indexing model. YMMV.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGi2dOE43tCb2%3DkUaGhmUj5Z_yt-qM2%2B%3DNa%2BM%3D-VqmBhw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Ondřej Spilka) #6

Thanks Joerg, I completely forgot the way of indexing via JSON documents
I've already done for ES from powershell months ago...

I understand that ES JSON format is very versatile, on the other side, Solr
compatible option to index plain POCO JSON file which consists only from
array of objects would be fine in migration from Solr to ES.
There is no problem while ID property can be specified in schema as like as
in Solr.
So when you have schema with ID property you're on the right way and ES
have to be able to perform index/update on POCO JSON array.

Then I can imagine one have preprocessor converting XML/CSV or whatever to
collection schema compatible JSON and search engine can be easily chosen
between ES and Solr.
It seems like nice idea to me as I'm just a user of search engine...

Ondra

On Monday, March 10, 2014 8:19:20 PM UTC+1, Jörg Prante wrote:

There is a special ES indexing data model, as you surely already have
noted. You can only index a subset of valid JSON into ES. For example, each
ES JSON doc must be an object. Arrays must be single-valued, unnested. So,
arbitrary source JSON must be transformed, and due to the field/value
indexing, there is more than one possible model, which depends on your data
domain.

XML is also not straightforward to translate. Attributes and values have
to be mapped to JSON fields and there is more than one possibility to do so.

Another question is how to build identifiers from documents for ES doc _id.

In my domain, I transform all my input data (K/V, ISO 2709, JSON, XML) to
RDF, create an IRI, and this RDF can be serialized to JSON-LD which fits
well into the ES JSON indexing model. YMMV.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1d480a9f-a043-4de7-a3dc-efa2bea14a17%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #7

I'm not using JSON with Solr but from what I can see at
https://wiki.apache.org/solr/UpdateJSON there is no difference in what I
said. Solr borrowed all the JSON things from ES. That is, Solr seems not to
accept arbitrary JSON either. So you can index JSON into ES as you would
index it into Solr.

If you had already data indexed into Solr, this river may help

The Solr mock plugin
https://github.com/mattweber/elasticsearch-mocksolrplugin needs some love
for JSON but could also be a valuable start to migrate from Solr.

Jörg

On Tue, Mar 11, 2014 at 7:18 AM, Ondřej Spilka spilka.ondrej@gmail.comwrote:

Thanks Joerg, I completely forgot the way of indexing via JSON documents
I've already done for ES from powershell months ago...

I understand that ES JSON format is very versatile, on the other side,
Solr compatible option to index plain POCO JSON file which consists only
from array of objects would be fine in migration from Solr to ES.
There is no problem while ID property can be specified in schema as like
as in Solr.
So when you have schema with ID property you're on the right way and ES
have to be able to perform index/update on POCO JSON array.

Then I can imagine one have preprocessor converting XML/CSV or whatever to
collection schema compatible JSON and search engine can be easily chosen
between ES and Solr.
It seems like nice idea to me as I'm just a user of search engine...

Ondra

On Monday, March 10, 2014 8:19:20 PM UTC+1, Jörg Prante wrote:

There is a special ES indexing data model, as you surely already have
noted. You can only index a subset of valid JSON into ES. For example, each
ES JSON doc must be an object. Arrays must be single-valued, unnested. So,
arbitrary source JSON must be transformed, and due to the field/value
indexing, there is more than one possible model, which depends on your data
domain.

XML is also not straightforward to translate. Attributes and values have
to be mapped to JSON fields and there is more than one possibility to do so.

Another question is how to build identifiers from documents for ES doc
_id.

In my domain, I transform all my input data (K/V, ISO 2709, JSON, XML) to
RDF, create an IRI, and this RDF can be serialized to JSON-LD which fits
well into the ES JSON indexing model. YMMV.

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1d480a9f-a043-4de7-a3dc-efa2bea14a17%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/1d480a9f-a043-4de7-a3dc-efa2bea14a17%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoG%3DeDHdh5UMZCd%3D3K957Ju%2BhCfGOQYJRUhMtthW%2B%2BRW_w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Ondřej Spilka) #8

Yes I can easily index plain POCO JSON into Solr, as I'm currently using
Joerg.
Format is slightly different, while in ES you have to specifiy operation,
in Solr you need not.

Regarding things about history - I'm not an expert in that...
Nevertheless both of them are build upon Lucene which uses BerkeleyDB.
This was how I met Solr/ES at first, I was searching for some high-level
Berkeley DB other than MongoDB, which is big space glutton...

On Tuesday, March 11, 2014 9:04:43 AM UTC+1, Jörg Prante wrote:

I'm not using JSON with Solr but from what I can see at
https://wiki.apache.org/solr/UpdateJSON there is no difference in what I
said. Solr borrowed all the JSON things from ES. That is, Solr seems not to
accept arbitrary JSON either. So you can index JSON into ES as you would
index it into Solr.

If you had already data indexed into Solr, this river may help
https://github.com/javanna/elasticsearch-river-solr

The Solr mock plugin
https://github.com/mattweber/elasticsearch-mocksolrplugin needs some love
for JSON but could also be a valuable start to migrate from Solr.

Jörg

On Tue, Mar 11, 2014 at 7:18 AM, Ondřej Spilka <spilka...@gmail.com<javascript:>

wrote:

Thanks Joerg, I completely forgot the way of indexing via JSON documents
I've already done for ES from powershell months ago...

I understand that ES JSON format is very versatile, on the other side,
Solr compatible option to index plain POCO JSON file which consists only
from array of objects would be fine in migration from Solr to ES.
There is no problem while ID property can be specified in schema as like
as in Solr.
So when you have schema with ID property you're on the right way and ES
have to be able to perform index/update on POCO JSON array.

Then I can imagine one have preprocessor converting XML/CSV or whatever
to collection schema compatible JSON and search engine can be easily chosen
between ES and Solr.
It seems like nice idea to me as I'm just a user of search engine...

Ondra

On Monday, March 10, 2014 8:19:20 PM UTC+1, Jörg Prante wrote:

There is a special ES indexing data model, as you surely already have
noted. You can only index a subset of valid JSON into ES. For example, each
ES JSON doc must be an object. Arrays must be single-valued, unnested. So,
arbitrary source JSON must be transformed, and due to the field/value
indexing, there is more than one possible model, which depends on your data
domain.

XML is also not straightforward to translate. Attributes and values have
to be mapped to JSON fields and there is more than one possibility to do so.

Another question is how to build identifiers from documents for ES doc
_id.

In my domain, I transform all my input data (K/V, ISO 2709, JSON, XML)
to RDF, create an IRI, and this RDF can be serialized to JSON-LD which fits
well into the ES JSON indexing model. YMMV.

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1d480a9f-a043-4de7-a3dc-efa2bea14a17%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/1d480a9f-a043-4de7-a3dc-efa2bea14a17%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6d1592b9-8ef1-4371-b572-8ffbe271f1ef%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #9

BerkeleyDB was never part of Lucene, or Solr, or Elasticsearch, or MongoDB,
it is a complelety other piece of software (transactional key/value store
on single server, no inverted index at all)

Jörg

On Tue, Mar 11, 2014 at 11:06 AM, Ondřej Spilka spilka.ondrej@gmail.comwrote:

Nevertheless both of them are build upon Lucene which uses BerkeleyDB.
This was how I met Solr/ES at first, I was searching for some high-level
Berkeley DB other than MongoDB, which is big space glutton...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE4uR6b1R3rFq6j4C5zHZeOZdJDuE%2By%2B%2BXfn3ogXi%2BLbQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #10