Indexing large number of files each with a huge size


(ElasticSearch Users mailing list) #1

Hi,

I am trying to index documents, each file approx ~10-20 MB. I start seeing
memory issues if I try to index them all in a multi-threaded environment
from a single TransportClient on one machine to a single node cluster with
32GB ES server. It seems like the memory is an issue on the client as well
as server side, and I probably understand and expect that :).

I have tried tuning the heap sizes and batch sizes in Bulk APIs. However,
am I trying to push the limits too much? One thought is to probably stream
the data so that I do not hold it all in memory. Is it possible? Is this a
general problem or just that my usage is wrong?

Thanks,
Sandeep

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d2612109-b31c-4127-857b-f8aa27fb0aeb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #2

Can you show the program how you index?

Before tuning heap sizes or batch sizes, it is good to check if the program
works correct.

Jörg

On Mon, Aug 25, 2014 at 7:00 PM, 'Sandeep Ramesh Khanzode' via
elasticsearch elasticsearch@googlegroups.com wrote:

Hi,

I am trying to index documents, each file approx ~10-20 MB. I start seeing
memory issues if I try to index them all in a multi-threaded environment
from a single TransportClient on one machine to a single node cluster with
32GB ES server. It seems like the memory is an issue on the client as well
as server side, and I probably understand and expect that :).

I have tried tuning the heap sizes and batch sizes in Bulk APIs. However,
am I trying to push the limits too much? One thought is to probably stream
the data so that I do not hold it all in memory. Is it possible? Is this a
general problem or just that my usage is wrong?

Thanks,
Sandeep

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/d2612109-b31c-4127-857b-f8aa27fb0aeb%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/d2612109-b31c-4127-857b-f8aa27fb0aeb%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoG7oByjnRhFoHboLJRRzhdBbsr%2BXC8NO0JU9KP0VEU4HQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Sandeep Ramesh Khanzode) #3

Hi Jorg,

This is mostly standard code that I am referring. This is called from
multiple threads for a different set of files on disk.
Please provide your suggestions. Thanks,

============================================================================================
BulkRequestBuilder bulkRequest = client.prepareBulk();
bulkRequest.setRefresh(false);

        for every input file in the input list, do ...
            Map<String, Object> jsonDocument = new HashMap<String,

Object>();

            jsonDocument.put("fileContent", <STRING_CONTENT_OF_FILE>);
            jsonDocument.put("fileProperty1",

<FILE_PROPERTY_1_STRING>);
jsonDocument.put("fileProperty1", <FILE_PROPERTY_2_STRING>);
jsonDocument.put("fileProperty1", <FILE_PROPERTY_3_STRING>);
jsonDocument.put("filePath", new
BytesRef(filePath.toString()));

            bulkRequest.add(client.prepareIndex(indexName,

typeName).setSource(jsonDocument));
}

        BulkResponse bulkResponse = bulkRequest.execute().actionGet();

============================================================================================

Thanks,
Sandeep

On Mon, Aug 25, 2014 at 10:40 PM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

Can you show the program how you index?

Before tuning heap sizes or batch sizes, it is good to check if the
program works correct.

Jörg

On Mon, Aug 25, 2014 at 7:00 PM, 'Sandeep Ramesh Khanzode' via
elasticsearch elasticsearch@googlegroups.com wrote:

Hi,

I am trying to index documents, each file approx ~10-20 MB. I start
seeing memory issues if I try to index them all in a multi-threaded
environment from a single TransportClient on one machine to a single node
cluster with 32GB ES server. It seems like the memory is an issue on the
client as well as server side, and I probably understand and expect that
:).

I have tried tuning the heap sizes and batch sizes in Bulk APIs. However,
am I trying to push the limits too much? One thought is to probably stream
the data so that I do not hold it all in memory. Is it possible? Is this a
general problem or just that my usage is wrong?

Thanks,
Sandeep

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/d2612109-b31c-4127-857b-f8aa27fb0aeb%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/d2612109-b31c-4127-857b-f8aa27fb0aeb%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/QQDTzCAMQyU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoG7oByjnRhFoHboLJRRzhdBbsr%2BXC8NO0JU9KP0VEU4HQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoG7oByjnRhFoHboLJRRzhdBbsr%2BXC8NO0JU9KP0VEU4HQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKnM90Z91arY3mtT3QGJJow49rRdR9zawuEmTABdVC5m_v%2B%2BuA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #4