How to index text file having size more than the system memory

sbbagal · January 15, 2013, 1:21pm

Hi all ,
I am trying to index the text file using attachment plugin using transport client in java . I have 4 node cluster .
If i try it using REST Api then it give curl argument is large type error and hang up and while trying from transport client it gives java heap size error .I am not sure where i am going wrong .

Please help !!!!

Thanks,
Sanjay B Bagal

Igor_Motov · January 15, 2013, 7:19pm

How big is the text file and how much memory did you give to elasticsearch?
How are you trying to index it? Can you post your curl command (without the
content of the file obviously)?

On Tuesday, January 15, 2013 8:21:16 AM UTC-5, Sanjay wrote:

Hi all ,
I am trying to index the text file using attachment plugin using
transport
client in java . I have 4 node cluster .
If i try it using REST Api then it give curl argument is large type error
and hang up and while trying from transport client it gives java heap size
error .I am not sure where i am going wrong .

Please help !!!!

Thanks,
Sanjay B Bagal

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/How-to-index-text-file-having-size-more-than-the-system-memory-tp4028184.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--

sbbagal · January 29, 2013, 2:04pm

Hi ,
Thanks for reply ,
I configure the memory 1.5 GB to elasticsearch and file size is about 240 MB But when i am trying to index it using curl command which is below then i got to length out argument is very large and when trying to index using transport java client it shows java heap size error.

Different Curl commands i used .

curl -XPUT 'http://10.102.103.196:9200/textattach/?pretty=1' -d '
{
"mappings" : {
"doc" : {
"properties" : {
"file" : {
"type" : "attachment"
}
}
}
}
}
'

{
"file" : 'base64 /home/hduser/sanjay/testinput.txt | perl -pe 's/\n/\\n/g''
}
curl -XPOST 'http://10.102.103.196:9200/textattach/doc?pretty=1' --data-binary '
{
"file" : 'base64 /home/hduser/inputes.txt | perl -pe 's/\n/\\n/g''
}
'

curl -s -XPOST localhost:9200/_bulk --data-binary @inputes.txt

{
"my_attachment" : {
"_content_type" : "text/plain",
"_name" : "resource/name/of/my.txt",
"content" : "... base64 encoded attachment ..."
}
}

curl -XPOST 'http://localhost:9200/textattach/doc?pretty=1' -d '
{
"file" : 'base64 /home/hduser/inputes.txt | perl -pe 's/\n/\\n/g''
}
'

Thanks in advance
Sanjay

Clinton_Gormley · January 30, 2013, 9:00am

On Tue, 2013-01-29 at 06:04 -0800, sbbagal wrote:

Hi ,
Thanks for reply ,
I configure the memory 1.5 GB to elasticsearch and file size is about 240 MB
But when i am trying to index it using curl command which is below then i
got to length out argument is very large and when trying to index using
transport java client it shows java heap size error.

You'll either have to break the document down into smaller documents, or
get a lot more memory

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

sbbagal · January 30, 2013, 9:50am

Hi Clinton,
If I break down the document in smaller document then how it will be indexed as i mean it will index as an separate document.
and one question any other way to speed up the indexing a documents. How can i calculate the time for indexing ? which method curl (REST API) or JAVA transport client is efficient way to indexing documents? Is curl command works as distributed or single node? How can i index document in distributed manner?

Thanks in advance !!!!

Sanjay

Clinton_Gormley · January 30, 2013, 10:12am

On Wed, 2013-01-30 at 01:50 -0800, sbbagal wrote:

Hi Clinton,
If I break down the document in smaller document then how it will be indexed
as i mean it will index as an separate document.

yes

and one question any other way to speed up the indexing a documents.

i'd normally suggest using bulk indexing, but your documents are already
huge, and so processing several documents at once will probably just
result in more OOMs.

How can
i calculate the time for indexing ?

By trial and error.

which method curl (REST API) or JAVA
transport client is efficient way to indexing documents?

The Java client may be slightly faster than the REST API.

Is curl command
works as distributed or single node? How can i index document in distributed
manner?

ES does this out of the box. You can speak to any node in ES and it
will forward the request to the appropriate node (assuming you are
running more than one node)

But really, you need more memory and probably more powerful boxes. You
can't expect something with the power of a calculator to perform well.

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

sbbagal · February 1, 2013, 5:00am

Thanks Clinton ,
I am trying to use the JAVA client to index document .
Can you please suggest me to in which way i have to do that. I mean the sequence for mapping , index creating or document indexing . Give me some hint for generic mapping for any type of document (.pdf , .txt , .doc etc) using java api .
Thanks in advance.

Clinton_Gormley · February 1, 2013, 9:27am

On Thu, 2013-01-31 at 21:00 -0800, sbbagal wrote:

Thanks Clinton ,
I am trying to use the JAVA client to index document .
Can you please suggest me to in which way i have to do that. I mean the
sequence for mapping , index creating or document indexing . Give me some
hint for generic mapping for any type of document (.pdf , .txt , .doc etc)
using java api .
Thanks in advance.

I suggest you start by reading the documentation. Come back when you
have a specific problem that you are struggling with.

Clint

--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/How-to-index-text-file-having-size-more-than-the-system-memory-tp4028184p4029160.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Error while indexing large text file Elasticsearch	1	322	July 6, 2017
Elasticsearch : cannot bulk index file larger than 6mb Elasticsearch	1	570	August 21, 2017
Indexing large number of files each with a huge size Elasticsearch	3	457	July 6, 2017
Elastic search memory heap exception when trying to index large document by chunks Elasticsearch	4	1157	December 13, 2016
High elastic search heap memory consumption while indexing huge files Elasticsearch	7	2007	September 20, 2017

How to index text file having size more than the system memory

Related topics