How to index text file having size more than the system memory

Hi all ,
I am trying to index the text file using attachment plugin using transport client in java . I have 4 node cluster .
If i try it using REST Api then it give curl argument is large type error and hang up and while trying from transport client it gives java heap size error .I am not sure where i am going wrong .

Please help !!!!

Thanks,
Sanjay B Bagal

How big is the text file and how much memory did you give to elasticsearch?
How are you trying to index it? Can you post your curl command (without the
content of the file obviously)?

On Tuesday, January 15, 2013 8:21:16 AM UTC-5, Sanjay wrote:

Hi all ,
I am trying to index the text file using attachment plugin using
transport
client in java . I have 4 node cluster .
If i try it using REST Api then it give curl argument is large type error
and hang up and while trying from transport client it gives java heap size
error .I am not sure where i am going wrong .

Please help !!!!

Thanks,
Sanjay B Bagal

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/How-to-index-text-file-having-size-more-than-the-system-memory-tp4028184.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--

Hi ,
Thanks for reply ,
I configure the memory 1.5 GB to elasticsearch and file size is about 240 MB But when i am trying to index it using curl command which is below then i got to length out argument is very large and when trying to index using transport java client it shows java heap size error.

Different Curl commands i used .

curl -XPUT 'http://10.102.103.196:9200/textattach/?pretty=1' -d '
{
"mappings" : {
"doc" : {
"properties" : {
"file" : {
"type" : "attachment"
}
}
}
}
}
'

{
"file" : 'base64 /home/hduser/sanjay/testinput.txt | perl -pe 's/\n/\\n/g''
}
curl -XPOST 'http://10.102.103.196:9200/textattach/doc?pretty=1' --data-binary '
{
"file" : 'base64 /home/hduser/inputes.txt | perl -pe 's/\n/\\n/g''
}
'

curl -s -XPOST localhost:9200/_bulk --data-binary @inputes.txt

{
"my_attachment" : {
"_content_type" : "text/plain",
"_name" : "resource/name/of/my.txt",
"content" : "... base64 encoded attachment ..."
}
}

curl -XPOST 'http://localhost:9200/textattach/doc?pretty=1' -d '
{
"file" : 'base64 /home/hduser/inputes.txt | perl -pe 's/\n/\\n/g''
}
'

Thanks in advance
Sanjay

On Tue, 2013-01-29 at 06:04 -0800, sbbagal wrote:

Hi ,
Thanks for reply ,
I configure the memory 1.5 GB to elasticsearch and file size is about 240 MB
But when i am trying to index it using curl command which is below then i
got to length out argument is very large and when trying to index using
transport java client it shows java heap size error.

You'll either have to break the document down into smaller documents, or
get a lot more memory :slight_smile:

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Clinton,
If I break down the document in smaller document then how it will be indexed as i mean it will index as an separate document.
and one question any other way to speed up the indexing a documents. How can i calculate the time for indexing ? which method curl (REST API) or JAVA transport client is efficient way to indexing documents? Is curl command works as distributed or single node? How can i index document in distributed manner?

Thanks in advance !!!!

Sanjay

On Wed, 2013-01-30 at 01:50 -0800, sbbagal wrote:

Hi Clinton,
If I break down the document in smaller document then how it will be indexed
as i mean it will index as an separate document.

yes

and one question any other way to speed up the indexing a documents.

i'd normally suggest using bulk indexing, but your documents are already
huge, and so processing several documents at once will probably just
result in more OOMs.

How can
i calculate the time for indexing ?

By trial and error.

which method curl (REST API) or JAVA
transport client is efficient way to indexing documents?

The Java client may be slightly faster than the REST API.

Is curl command
works as distributed or single node? How can i index document in distributed
manner?

ES does this out of the box. You can speak to any node in ES and it
will forward the request to the appropriate node (assuming you are
running more than one node)

But really, you need more memory and probably more powerful boxes. You
can't expect something with the power of a calculator to perform well.

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks Clinton ,
I am trying to use the JAVA client to index document .
Can you please suggest me to in which way i have to do that. I mean the sequence for mapping , index creating or document indexing . Give me some hint for generic mapping for any type of document (.pdf , .txt , .doc etc) using java api .
Thanks in advance.

On Thu, 2013-01-31 at 21:00 -0800, sbbagal wrote:

Thanks Clinton ,
I am trying to use the JAVA client to index document .
Can you please suggest me to in which way i have to do that. I mean the
sequence for mapping , index creating or document indexing . Give me some
hint for generic mapping for any type of document (.pdf , .txt , .doc etc)
using java api .
Thanks in advance.

I suggest you start by reading the documentation. Come back when you
have a specific problem that you are struggling with.

Clint

--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/How-to-index-text-file-having-size-more-than-the-system-memory-tp4028184p4029160.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.