Slow to index

Hi, I'm on elasticsearch master with two nodes, one index or 5 shards and 1
replica. I'm rivering in a couchdb database of about 3GB of 400000
documents. It's going very slowly, something like 50 documents a minute. I
can see on the rivering node that CPU usage is maxed out so its working
pretty hard.

The river and index are defined as per https://gist.github.com/1154061 and
node config is https://gist.github.com/1156885 .

How would I go about debugging this?

Regards,
Harry

Try maybe increasing the bulk timeout? So more documents will get into a
single bulk request. How complex are your docs?

On Wed, Aug 24, 2011 at 5:46 PM, Harry Waye hwaye@microwayes.net wrote:

Hi, I'm on elasticsearch master with two nodes, one index or 5 shards and 1
replica. I'm rivering in a couchdb database of about 3GB of 400000
documents. It's going very slowly, something like 50 documents a minute. I
can see on the rivering node that CPU usage is maxed out so its working
pretty hard.

The river and index are defined as per gist:1154061 · GitHub
and node config is gist:1156885 · GitHub .

How would I go about debugging this?

Regards,
Harry

I just tried increasing to 1s to no avail. The documents vary from about 1k
to 150k, with 100 distinct attributes overall, but each document only
including a small number of these, about 10. I'm going to try a 30s
timeout now to see it that makes a difference, and will play around with the
bulk_size. Are there any other settings that I can twiddle?

On Thu, 2011-08-25 at 02:27 -0700, Harry Waye wrote:

I just tried increasing to 1s to no avail. The documents vary from
about 1k to 150k, with 100 distinct attributes overall, but each
document only including a small number of these, about 10. I'm going
to try a 30s timeout now to see it that makes a difference, and will
play around with the bulk_size. Are there any other settings that I
can twiddle?

How much memory have you allocated to ES? And have you made sure that
swap is disabled, either by turning swapoff completely, or by using
mlockall?

Swap is your enemy - as soon as any part of the heap is in swap, the JVM
will grind to a halt.

clint

Min/max is set to 500/1000. I hadn't turned swap off so will give that a
try now...

No change, still very slow

Wondering, do you see heavy CPU load also on the other node?

On Thu, Aug 25, 2011 at 1:40 PM, Harry Waye hwaye@microwayes.net wrote:

No change, still very slow

No, just on the rivering node as I recall. I've disbanded the the group so
can't verify easily. I'll have to arrange a reunion later to test.

On 25 August 2011 15:29, Shay Banon kimchy@gmail.com wrote:

Wondering, do you see heavy CPU load also on the other node?

On Thu, Aug 25, 2011 at 1:40 PM, Harry Waye hwaye@microwayes.net wrote:

No change, still very slow

We've noticed that the river is pulling in _attachments as well, is it meant
to be doing that?

On 25 August 2011 15:35, Harry Waye harry@arachnys.com wrote:

No, just on the rivering node as I recall. I've disbanded the the group so
can't verify easily. I'll have to arrange a reunion later to test.

On 25 August 2011 15:29, Shay Banon kimchy@gmail.com wrote:

Wondering, do you see heavy CPU load also on the other node?

On Thu, Aug 25, 2011 at 1:40 PM, Harry Waye hwaye@microwayes.net wrote:

No change, still very slow

Yes.
I will submit a pull request to disable it with a parameter next week.
By now, you have to add a script like ctx._doc.attachement=null

Btw, you will have to add the javascript plugin.

Hope this helps
David :wink:

Le 25 août 2011 à 17:01, Harry Waye harry@arachnys.com a écrit :

We've noticed that the river is pulling in _attachments as well, is it meant to be doing that?

On 25 August 2011 15:35, Harry Waye harry@arachnys.com wrote:
No, just on the rivering node as I recall. I've disbanded the the group so can't verify easily. I'll have to arrange a reunion later to test.

On 25 August 2011 15:29, Shay Banon kimchy@gmail.com wrote:
Wondering, do you see heavy CPU load also on the other node?

On Thu, Aug 25, 2011 at 1:40 PM, Harry Waye hwaye@microwayes.net wrote:
No change, still very slow

Thanks David

On 25 August 2011 16:17, David Pilato david@pilato.fr wrote:

Yes.
I will submit a pull request to disable it with a parameter next week.
By now, you have to add a script like ctx._doc.attachement=null

Btw, you will have to add the javascript plugin.

Hope this helps
David :wink:

Le 25 août 2011 à 17:01, Harry Waye harry@arachnys.com a écrit :

We've noticed that the river is pulling in _attachments as well, is it
meant to be doing that?

On 25 August 2011 15:35, Harry Waye < harry@arachnys.com
harry@arachnys.com> wrote:

No, just on the rivering node as I recall. I've disbanded the the group
so can't verify easily. I'll have to arrange a reunion later to test.

On 25 August 2011 15:29, Shay Banon < kimchy@gmail.comkimchy@gmail.com>wrote:

Wondering, do you see heavy CPU load also on the other node?

On Thu, Aug 25, 2011 at 1:40 PM, Harry Waye < hwaye@microwayes.net
hwaye@microwayes.net> wrote:

No change, still very slow

What does that mean, that it pulls attachments? Can we disable pulling
attachments on the _changes stream itself if one does not wish to have them?

On Thu, Aug 25, 2011 at 6:17 PM, David Pilato david@pilato.fr wrote:

Yes.
I will submit a pull request to disable it with a parameter next week.
By now, you have to add a script like ctx._doc.attachement=null

Btw, you will have to add the javascript plugin.

Hope this helps
David :wink:

Le 25 août 2011 à 17:01, Harry Waye harry@arachnys.com a écrit :

We've noticed that the river is pulling in _attachments as well, is it
meant to be doing that?

On 25 August 2011 15:35, Harry Waye < harry@arachnys.com
harry@arachnys.com> wrote:

No, just on the rivering node as I recall. I've disbanded the the group
so can't verify easily. I'll have to arrange a reunion later to test.

On 25 August 2011 15:29, Shay Banon < kimchy@gmail.comkimchy@gmail.com>wrote:

Wondering, do you see heavy CPU load also on the other node?

On Thu, Aug 25, 2011 at 1:40 PM, Harry Waye < hwaye@microwayes.net
hwaye@microwayes.net> wrote:

No change, still very slow

Not that it pulls in attachments, just attachment metadata list
content_type, length etc. Each attachment is assigned a hash so you end up
with many many fields, several for each attachment. I don't think you
can suppress the field, perhaps theres some was of using a view but we're
just removing it elasticsearch side for now.

On 26 August 2011 15:10, Shay Banon kimchy@gmail.com wrote:

What does that mean, that it pulls attachments? Can we disable pulling
attachments on the _changes stream itself if one does not wish to have them?

On Thu, Aug 25, 2011 at 6:17 PM, David Pilato david@pilato.fr wrote:

Yes.
I will submit a pull request to disable it with a parameter next week.
By now, you have to add a script like ctx._doc.attachement=null

Btw, you will have to add the javascript plugin.

Hope this helps
David :wink:

Le 25 août 2011 à 17:01, Harry Waye harry@arachnys.com a écrit :

We've noticed that the river is pulling in _attachments as well, is it
meant to be doing that?

On 25 August 2011 15:35, Harry Waye < harry@arachnys.com
harry@arachnys.com> wrote:

No, just on the rivering node as I recall. I've disbanded the the group
so can't verify easily. I'll have to arrange a reunion later to test.

On 25 August 2011 15:29, Shay Banon < kimchy@gmail.comkimchy@gmail.com

wrote:

Wondering, do you see heavy CPU load also on the other node?

On Thu, Aug 25, 2011 at 1:40 PM, Harry Waye < hwaye@microwayes.net
hwaye@microwayes.net> wrote:

No change, still very slow