Indexing very large document in ES


(shgeorge) #1

Hi ES team
I am facing issues indexing large documents (~ 35 MB). Is there any size limitation to the documents that we index? We are using nested type and nested query and it is working fine for smaller documents. But when I try a large document, the ES client hangs. Same issue when I tried using curl command.

Do you have any suggestions in indexing large documents? I read about splitting into parent/child documents.
If we go this route, can we query the parent using haschild filter and return fields from both the parent as well as matching children?

Is the delete operation atomic?

Appreciate your inputs.

Thanks
Sheeba


(Igor Motov) #2

Hi Sheeba,

There are no explicit size limitations. As long as you have enough memory,
it should work. Moreover, 35MB doesn't seem to be excessively large. It's
not very clear from your question when ES client hangs (during indexing or
during searching) and what you mean by hanging. Does it take very long time
to execute or it never comes back no matter how long you wait? Assuming
that it hangs during searching, could you run hot_threadhttp://www.elasticsearch.org/guide/reference/api/admin-cluster-nodes-hot-threads/command while your long running query is hanging and post the output here?
Any chance you can provide a repro of this issue?

Igor

On Tuesday, July 23, 2013 2:03:03 AM UTC-4, george wrote:

Hi ES team
I am facing issues indexing large documents (~ 35 MB). Is there any
size
limitation to the documents that we index? We are using nested type and
nested query and it is working fine for smaller documents. But when I try
a
large document, the ES client hangs. Same issue when I tried using curl
command.

Do you have any suggestions in indexing large documents? I read about
splitting into parent/child documents.
If we go this route, can we query the parent using haschild filter and
return fields from both the parent as well as matching children?

Is the delete operation atomic?

Appreciate your inputs.

Thanks
Sheeba

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Indexing-very-large-document-in-ES-tp4038484.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(shgeorge) #3

Hi Igor
Thanks for your reply. Sorry for not being clear. I am facing issue while indexing the data. We have 6GB ES_HEAP_SIZE. I am using the Java client. I got Out of memory exception and another time I tried I got TimeoutException. Will try few more options and will update you if I am able to insert the document.

In our example, we have a type called "nested" and around 160K nested documnets. In case of such big documents, do you suggest us using parent/child relationship? Reading the docs, it seems like we can retrieve data either from parent or from the child , but not from both using has_child/has_parent filter. This means that if we need data from both parent and child doc, do we have to issue 2 queries - 1 query to fetch all parent documents that has child documents matching a particular filter. Then fire another query to fetch the matching child document which has the above parent. Is there any way to have a single query?

Thanks
Sheeba


(Igor Motov) #4

Hi Sheeba,

I need to know more about the structure of your documents, how often
different parts of you document change and what queries you run on your
documents in order to recommend one or another. But your understanding of
parent/child issue is correct, you will have to execute 2 queries in order
to get both parents and children, but you can combine them into single
multi-searchhttp://www.elasticsearch.org/guide/reference/api/multi-search/request.

Igor

On Wednesday, July 24, 2013 2:05:51 AM UTC-4, george wrote:

Hi Igor
Thanks for your reply. Sorry for not being clear. I am facing issue
while
indexing the data. We have 6GB ES_HEAP_SIZE. I am using the Java client. I
got Out of memory exception and another time I tried I got
TimeoutException.
Will try few more options and will update you if I am able to insert the
document.

In our example, we have a type called "nested" and around 160K nested
documnets. In case of such big documents, do you suggest us using
parent/child relationship? Reading the docs, it seems like we can retrieve
data either from parent or from the child , but not from both using
has_child/has_parent filter. This means that if we need data from both
parent and child doc, do we have to issue 2 queries - 1 query to fetch all
parent documents that has child documents matching a particular filter.
Then
fire another query to fetch the matching child document which has the
above
parent. Is there any way to have a single query?

Thanks
Sheeba

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Indexing-very-large-document-in-ES-tp4038484p4038577.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(shgeorge) #5

Hi Igor
We inserted a document of size 10MB (Original document is around 45MB).
Indexing took around 13 secs Searching took around 800ms. We have some
mvel scripts running on the ES server as well.
I am attaching a sample doc that we insert in our ES node and the
corresponding query. This doc is small for the sake of clarity. In our big
docs the # of points will increase to almost 160K. Please let me know if
there are any issues in our query.

Can we use multisearch so that we get a parent AND only the child docs of
that parent matching a filter criteria. I thought multisearch queries have
to be independent of each other. We have the first query that matches the
parent which has atleast a child matching a criteria. Another query to
fetch the matched child document which has the parent from the previous
query. How can these be combined. Its a join based on parent id; Is it
possible in multisearch?

Thanks
Sheeba

On Wed, Jul 24, 2013 at 5:32 PM, Igor Motov-3 [via ElasticSearch Users] <
ml-node+s115913n4038613h55@n3.nabble.com> wrote:

Hi Sheeba,

I need to know more about the structure of your documents, how often
different parts of you document change and what queries you run on your
documents in order to recommend one or another. But your understanding of
parent/child issue is correct, you will have to execute 2 queries in order
to get both parents and children, but you can combine them into single
multi-searchhttp://www.elasticsearch.org/guide/reference/api/multi-search/request.

Igor

On Wednesday, July 24, 2013 2:05:51 AM UTC-4, george wrote:

Hi Igor
Thanks for your reply. Sorry for not being clear. I am facing issue
while
indexing the data. We have 6GB ES_HEAP_SIZE. I am using the Java client.
I
got Out of memory exception and another time I tried I got
TimeoutException.
Will try few more options and will update you if I am able to insert the
document.

In our example, we have a type called "nested" and around 160K nested
documnets. In case of such big documents, do you suggest us using
parent/child relationship? Reading the docs, it seems like we can
retrieve
data either from parent or from the child , but not from both using
has_child/has_parent filter. This means that if we need data from both
parent and child doc, do we have to issue 2 queries - 1 query to fetch
all
parent documents that has child documents matching a particular filter.
Then
fire another query to fetch the matching child document which has the
above
parent. Is there any way to have a single query?

Thanks
Sheeba

--
View this message in context: http://elasticsearch-users.**
115913.n3.nabble.com/Indexing-very-large-document-in-ES-
tp4038484p4038577.htmlhttp://elasticsearch-users.115913.n3.nabble.com/Indexing-very-large-document-in-ES-tp4038484p4038577.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to [hidden email]http://user/SendEmail.jtp?type=node&node=4038613&i=0
.

For more options, visit https://groups.google.com/groups/opt_out.


If you reply to this email, your message will be added to the discussion
below:

http://elasticsearch-users.115913.n3.nabble.com/Indexing-very-large-document-in-ES-tp4038484p4038613.html
To unsubscribe from Indexing very large document in ES, click herehttp://elasticsearch-users.115913.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4038484&code=c2hlZWJhLmdlb3JnZUBnbWFpbC5jb218NDAzODQ4NHw2NzQzNTgxMzA=
.
NAMLhttp://elasticsearch-users.115913.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html!nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers!nabble%3Aemail.naml-instant_emails!nabble%3Aemail.naml-send_instant_email!nabble%3Aemail.naml

--
Sheeba Ann George


(Sergio Henrique) #6

Hi everyone!

I just want to up this post.

I have the same problem. My docs have a nested structure and the nested
part can have more then 5K records. When i do a search on the elasticsearch
the response time is very slow when i have this large docs with nested
type.

Another interesting thing: when i do a sort operation on this kind of
document the response time is very slow too.

Anyone have an ideia on what is the best practices to use nested types in
elasticsearch?

In the moment we have a cluster with 4 machines and each machine has 4GB of
RAM. Our cluster start to slow down when search on the complex documents
begins.

On Thursday, July 25, 2013 5:02:27 AM UTC-3, george wrote:

Hi Igor
We inserted a document of size 10MB (Original document is around 45MB).
Indexing took around 13 secs Searching took around 800ms. We have some
mvel scripts running on the ES server as well.
I am attaching a sample doc that we insert in our ES node and the
corresponding query. This doc is small for the sake of clarity. In our big
docs the # of points will increase to almost 160K. Please let me know if
there are any issues in our query.

Can we use multisearch so that we get a parent AND only the child docs of
that parent matching a filter criteria. I thought multisearch queries have
to be independent of each other. We have the first query that matches the
parent which has atleast a child matching a criteria. Another query to
fetch the matched child document which has the parent from the previous
query. How can these be combined. Its a join based on parent id; Is it
possible in multisearch?

Thanks
Sheeba

On Wed, Jul 24, 2013 at 5:32 PM, Igor Motov-3 [via ElasticSearch Users] <[hidden
email] http://user/SendEmail.jtp?type=node&node=4038631&i=0> wrote:

Hi Sheeba,

I need to know more about the structure of your documents, how often
different parts of you document change and what queries you run on your
documents in order to recommend one or another. But your understanding of
parent/child issue is correct, you will have to execute 2 queries in order
to get both parents and children, but you can combine them into single
multi-search
http://www.elasticsearch.org/guide/reference/api/multi-search/ request.

Igor

On Wednesday, July 24, 2013 2:05:51 AM UTC-4, george wrote:

Hi Igor
Thanks for your reply. Sorry for not being clear. I am facing issue
while
indexing the data. We have 6GB ES_HEAP_SIZE. I am using the Java client.
I
got Out of memory exception and another time I tried I got
TimeoutException.
Will try few more options and will update you if I am able to insert the
document.

In our example, we have a type called "nested" and around 160K nested
documnets. In case of such big documents, do you suggest us using
parent/child relationship? Reading the docs, it seems like we can
retrieve
data either from parent or from the child , but not from both using
has_child/has_parent filter. This means that if we need data from both
parent and child doc, do we have to issue 2 queries - 1 query to fetch
all
parent documents that has child documents matching a particular filter.
Then
fire another query to fetch the matching child document which has the
above
parent. Is there any way to have a single query?

Thanks
Sheeba

--
View this message in context: http://elasticsearch-users.
115913.n3.nabble.com/Indexing-very-large-document-in-ES-
tp4038484p4038577.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to [hidden email]
http://user/SendEmail.jtp?type=node&node=4038613&i=0.

For more options, visit https://groups.google.com/groups/opt_out.


If you reply to this email, your message will be added to the
discussion below:

http://elasticsearch-users.115913.n3.nabble.com/Indexing-very-large-document-in-ES-tp4038484p4038613.html
To unsubscribe from Indexing very large document in ES, click here.
NAML
http://elasticsearch-users.115913.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html!nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers!nabble%3Aemail.naml-instant_emails!nabble%3Aemail.naml-send_instant_email!nabble%3Aemail.naml

--
Sheeba Ann George

query.txt (12K) Download Attachment
http://elasticsearch-users.115913.n3.nabble.com/attachment/4038631/0/query.txt
IBI_small.txt.zip (101K) Download Attachment
http://elasticsearch-users.115913.n3.nabble.com/attachment/4038631/1/IBI_small.txt.zip


View this message in context: Re: Indexing very large document in ES
http://elasticsearch-users.115913.n3.nabble.com/Indexing-very-large-document-in-ES-tp4038484p4038631.html
Sent from the ElasticSearch Users mailing list archive
http://elasticsearch-users.115913.n3.nabble.com/ at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/570ebf49-bdf2-4a58-b045-36060b9e3fe8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #7