Skip "action_and_metadata" for every bulk index documents via http api & other bulk loading questions


(Aditya Alurkar) #1

I am trying to bulk load a large number of documents (15Billion+) into ES
cluster via the http bulk api. Per the docs I generate batches of documents
to comply with the format

{action.....}
{document......}
{action.....}
{document......}
{action.....}
{document......}
{action.....}
{document......}
{action.....}
{document......}
{action.....}
{document......}
{action.....}
{document......}

Is it possible to mark the {action....} clause for an entire group of
{document......} i.e. generate the payload to look like

{action.....}
{document......}
{document......}
{document......}
{document......}
{document......}
{document......}
{document......}
{document......}

If this is not supported :
1 - are there alternatives which would allow me to index my data w/o having
to repeat the same action, which in my case is "index" ?
2 - I am currently using urllib2 via python to insert the documents via
http api, would pyes be better/more efficient?
3 - in addition to disabling or increasing index.refresh_interval, dropping
the replica and evenly distributing the posts to all nodes of the cluster
are there any other optimizations I should consider to improve perf?

-Adi

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Honza Král) #2

Hi,

On Wed, Nov 13, 2013 at 1:44 AM, Aditya Alurkar alurkar@gmail.com wrote:

I am trying to bulk load a large number of documents (15Billion+) into ES
cluster via the http bulk api. Per the docs I generate batches of documents
to comply with the format

{action.....}
{document......}
{action.....}
{document......}
{action.....}
{document......}
{action.....}
{document......}
{action.....}
{document......}
{action.....}
{document......}
{action.....}
{document......}

Is it possible to mark the {action....} clause for an entire group of
{document......} i.e. generate the payload to look like

{action.....}
{document......}
{document......}
{document......}
{document......}
{document......}
{document......}
{document......}
{document......}

No, unfortunately this is not supported - the receiving node would then
have to parse every single line, thus defeating the purpose of the format -
the node that receives your request can only parse every second line and
send the relevant payload to a node that will actually handle the request.

If this is not supported :
1 - are there alternatives which would allow me to index my data w/o
having to repeat the same action, which in my case is "index" ?

not on the protocol level, if you are using python however we can take some
of the pain awaw via elasticsearch-py ([0])

0 -
http://elasticsearch-py.readthedocs.org/en/latest/helpers.html#elasticsearch.helpers.bulk_index

2 - I am currently using urllib2 via python to insert the documents via
http api, would pyes be better/more efficient?

we recommend the official client. If you are aiming for best possible
speed, consider the thrift transport - [1]

1 -
http://elasticsearch-py.readthedocs.org/en/latest/transports.html#transport-classes

3 - in addition to disabling or increasing index.refresh_interval,
dropping the replica and evenly distributing the posts to all nodes of the
cluster are there any other optimizations I should consider to improve perf?

also disable flush for the duration of the bulk indexing, don't forget to
turn it back on.

-Adi

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #3

Please use the official python clients.

Monitor also the network interface if you index from a remote host.

You have 15b+ docs and I assume that are some GBs. If the network is
saturated and you can spend CPU cycles, use gzip compression on HTTP bulk.
If you do not feel like using the official client, check if httplib2 is a
better choice over urllib2, it supports compression.

Also check if you use fast JSON encoding on client side. ujson is a fast
drop-in replacement for the slow standard json python lib.

For fast persisting in the cluster, use SSD instead of spindle disks. On
file systems with spindle disk, you should disable the atime (noatime) in
the Linux file system mount on the data dir for better I/O throughput.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Aditya Alurkar) #4

This is great thank you!

BTW do you see any value to support dedicated REST endpoints of the form:

//_bulk/

which would make it easy to just stream data where _id's need to be auto
generated.

Also one of the recommendations you mentioned is to disable flushing, there
is no way for the entire dataset to fit in RAM on the ES cluster what would
happen if I was to disable flush? Were you referring to disabling flush
completely or increasing the flush timing to a larger value?

-Adi

On Tuesday, 12 November 2013 17:52:25 UTC-8, Honza Král wrote:

Hi,

On Wed, Nov 13, 2013 at 1:44 AM, Aditya Alurkar <alu...@gmail.com<javascript:>

wrote:

I am trying to bulk load a large number of documents (15Billion+) into
ES cluster via the http bulk api. Per the docs I generate batches of
documents to comply with the format

{action.....}
{document......}
{action.....}
{document......}
{action.....}
{document......}
{action.....}
{document......}
{action.....}
{document......}
{action.....}
{document......}
{action.....}
{document......}

Is it possible to mark the {action....} clause for an entire group of
{document......} i.e. generate the payload to look like

{action.....}
{document......}
{document......}
{document......}
{document......}
{document......}
{document......}
{document......}
{document......}

No, unfortunately this is not supported - the receiving node would then
have to parse every single line, thus defeating the purpose of the format -
the node that receives your request can only parse every second line and
send the relevant payload to a node that will actually handle the request.

If this is not supported :
1 - are there alternatives which would allow me to index my data w/o
having to repeat the same action, which in my case is "index" ?

not on the protocol level, if you are using python however we can take
some of the pain awaw via elasticsearch-py ([0])

0 -
http://elasticsearch-py.readthedocs.org/en/latest/helpers.html#elasticsearch.helpers.bulk_index

2 - I am currently using urllib2 via python to insert the documents via
http api, would pyes be better/more efficient?

we recommend the official client. If you are aiming for best possible
speed, consider the thrift transport - [1]

1 -
http://elasticsearch-py.readthedocs.org/en/latest/transports.html#transport-classes

3 - in addition to disabling or increasing index.refresh_interval,
dropping the replica and evenly distributing the posts to all nodes of the
cluster are there any other optimizations I should consider to improve perf?

also disable flush for the duration of the bulk indexing, don't forget to
turn it back on.

-Adi

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Aditya Alurkar) #5

Thank you for the recommendations.

I am not saturating the network at all, I am CPU bound on the ES cluster
side during the load phase. These servers are dedicated only for ES and are
currently only responsible for the initial loading of the data.

Unfortunately I do not have the luxury of having solid state, but will look
at storage layer optimizations when that does become the bottleneck.

-Adi
On Wednesday, 13 November 2013 00:15:57 UTC-8, Jörg Prante wrote:

Please use the official python clients.

Monitor also the network interface if you index from a remote host.

You have 15b+ docs and I assume that are some GBs. If the network is
saturated and you can spend CPU cycles, use gzip compression on HTTP bulk.
If you do not feel like using the official client, check if httplib2 is a
better choice over urllib2, it supports compression.

Also check if you use fast JSON encoding on client side. ujson is a fast
drop-in replacement for the slow standard json python lib.

For fast persisting in the cluster, use SSD instead of spindle disks. On
file systems with spindle disk, you should disable the atime (noatime) in
the Linux file system mount on the data dir for better I/O throughput.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #6