Some questions on wikipedia river and cluster config

jorge_canellas_2 · March 18, 2014, 5:52pm

Hi!

I am trying to index the wikipedia dumps (downloaded and uncompressed)
using the wikipedia river, but it takes about 6 hours in the cluster.
I have increased the number of nodes and primary shards from 5 to 8, but
the performance does not increase.

I have set the refresh time to -1, the number of replicas to 0,
increased the amount of memory from 256MB/1GB to 3GB/4GB, allowed the
mlockall, and set the buffer to 30% from 10%.
I have changed the river settings increasing the bulk_size from 100 to
10k, refresh interval is set to 1s
I have changed the mapping to only index title and text. And set
_source to false.

I do not know what else can I modify to increase the indexing ratio, at
this moment it is indexing about 120 docs/sec.

Any ideas?

Kind regards,

Jorge

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c9bd6437-3241-46a4-838e-5327067bc3cc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ivan · March 18, 2014, 6:32pm

It all depends on where the bottleneck is. A river, by design, runs on only
one node. Perhaps Elasticsearch is quickly indexing the content, but the
Wikipedia ingestion is the slowdown. Is the Wikipedia indexer threaded or
are all the bulks being executed on one thread? The BulkProcessor supports
multithreading.

Have you modified any of the default settings? In particular the merge
settings and throttling. Elasticsearch has low defaults for throttling out
of the box. If you have fast disks, you can easily raise the values. Are
you searching on the index while it is being built? If you are just
building the index, you can increase the number of segments and/or reduce
the amount of merging done at once. But it all depends on where the
bottleneck is. If you disk performance is fine, you might need to look
elsewhere.

Cheers,

Ivan

On Tue, Mar 18, 2014 at 10:52 AM, jorge canellas <
jorge.canellas.urt@gmail.com> wrote:

Hi!

I am trying to index the wikipedia dumps (downloaded and uncompressed)
using the wikipedia river, but it takes about 6 hours in the cluster.
I have increased the number of nodes and primary shards from 5 to 8, but
the performance does not increase.

I have set the refresh time to -1, the number of replicas to 0,
increased the amount of memory from 256MB/1GB to 3GB/4GB, allowed the
mlockall, and set the buffer to 30% from 10%.

I have changed the river settings increasing the bulk_size from 100
to 10k, refresh interval is set to 1s

I have changed the mapping to only index title and text. And set
_source to false.

I do not know what else can I modify to increase the indexing ratio, at
this moment it is indexing about 120 docs/sec.

Any ideas?

Kind regards,

Jorge

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c9bd6437-3241-46a4-838e-5327067bc3cc%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/c9bd6437-3241-46a4-838e-5327067bc3cc%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQD_ZWNYmqxu-7qNuv%3D6%3Dfm9Yska_vCNReYa8VNG_i1p8Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

jorge_canellas_2 · March 18, 2014, 9:59pm

Hi!

That was something that I was thinking since the indexing time shown in the
indexing statistic is much lower than the real time the process is running.
Each 5 seconds that the process is running, only a bit more than a second
is spent indexing.
I was thinking on parsing the wikipedia dumps manually and then use the
bulk api to index them.
Only one question: the indexing time shown in the stats of the cluster
takes into account the time used to choose the shard, send the document
through the network, analyze the document and write the terms? or only the
time spent writting the terms in the index? I am asking this because I am
interested on the indexing performance of Elasricsearch, and I could use
this time and forget the time spent by the river parsing the pages of the
wikipedia.
El 18/03/2014 20:32, "Ivan Brusic" ivan@brusic.com escribió:

It all depends on where the bottleneck is. A river, by design, runs on
only one node. Perhaps Elasticsearch is quickly indexing the content, but
the Wikipedia ingestion is the slowdown. Is the Wikipedia indexer threaded
or are all the bulks being executed on one thread? The BulkProcessor
supports multithreading.

Have you modified any of the default settings? In particular the merge
settings and throttling. Elasticsearch has low defaults for throttling out
of the box. If you have fast disks, you can easily raise the values. Are
you searching on the index while it is being built? If you are just
building the index, you can increase the number of segments and/or reduce
the amount of merging done at once. But it all depends on where the
bottleneck is. If you disk performance is fine, you might need to look
elsewhere.

Cheers,

Ivan

On Tue, Mar 18, 2014 at 10:52 AM, jorge canellas <
jorge.canellas.urt@gmail.com> wrote:

Hi!

I am trying to index the wikipedia dumps (downloaded and uncompressed)
using the wikipedia river, but it takes about 6 hours in the cluster.
I have increased the number of nodes and primary shards from 5 to 8, but
the performance does not increase.

I have set the refresh time to -1, the number of replicas to 0,
increased the amount of memory from 256MB/1GB to 3GB/4GB, allowed the
mlockall, and set the buffer to 30% from 10%.

I have changed the river settings increasing the bulk_size from 100
to 10k, refresh interval is set to 1s

I have changed the mapping to only index title and text. And set
_source to false.

I do not know what else can I modify to increase the indexing ratio, at
this moment it is indexing about 120 docs/sec.

Any ideas?

Kind regards,

Jorge

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c9bd6437-3241-46a4-838e-5327067bc3cc%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/c9bd6437-3241-46a4-838e-5327067bc3cc%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/-DxbnAqNlZU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQD_ZWNYmqxu-7qNuv%3D6%3Dfm9Yska_vCNReYa8VNG_i1p8Q%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQD_ZWNYmqxu-7qNuv%3D6%3Dfm9Yska_vCNReYa8VNG_i1p8Q%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEXx9%2BWdibgD8NLt1NUCZWSLo2eLxDxbSeJgOsJgi%3DuRmYOf_w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

jprante · March 19, 2014, 12:23am

You can change the parameters bulk_size from 100 to a higher value,
max_concurrent_bulk from 1 to a higher value, and disable flush_interval
from 5s to -1 to increase wikipedia river bulk performance. Also make sure
the bzip2 file is downloading quick enough over the wire, which is often
not the case (some KB/sec here). Otherwise download it to the local file
system and change url parameter to this file.

The cluster stats show the Lucene indexing time. If you want overall
indexing time, you should take measurements at client side.

Jörg

On Tue, Mar 18, 2014 at 10:59 PM, jorge canellas <
jorge.canellas.urt@gmail.com> wrote:

Hi!

That was something that I was thinking since the indexing time shown in
the indexing statistic is much lower than the real time the process is
running. Each 5 seconds that the process is running, only a bit more than a
second is spent indexing.
I was thinking on parsing the wikipedia dumps manually and then use the
bulk api to index them.
Only one question: the indexing time shown in the stats of the cluster
takes into account the time used to choose the shard, send the document
through the network, analyze the document and write the terms? or only the
time spent writting the terms in the index? I am asking this because I am
interested on the indexing performance of Elasricsearch, and I could use
this time and forget the time spent by the river parsing the pages of the
wikipedia.
El 18/03/2014 20:32, "Ivan Brusic" ivan@brusic.com escribió:

It all depends on where the bottleneck is. A river, by design, runs on
only one node. Perhaps Elasticsearch is quickly indexing the content, but
the Wikipedia ingestion is the slowdown. Is the Wikipedia indexer threaded
or are all the bulks being executed on one thread? The BulkProcessor
supports multithreading.

Have you modified any of the default settings? In particular the merge
settings and throttling. Elasticsearch has low defaults for throttling out
of the box. If you have fast disks, you can easily raise the values. Are
you searching on the index while it is being built? If you are just
building the index, you can increase the number of segments and/or reduce
the amount of merging done at once. But it all depends on where the
bottleneck is. If you disk performance is fine, you might need to look
elsewhere.

Cheers,

Ivan

On Tue, Mar 18, 2014 at 10:52 AM, jorge canellas <
jorge.canellas.urt@gmail.com> wrote:

Hi!

I am trying to index the wikipedia dumps (downloaded and uncompressed)
using the wikipedia river, but it takes about 6 hours in the cluster.
I have increased the number of nodes and primary shards from 5 to 8, but
the performance does not increase.

I have set the refresh time to -1, the number of replicas to 0,
increased the amount of memory from 256MB/1GB to 3GB/4GB, allowed the
mlockall, and set the buffer to 30% from 10%.

I have changed the river settings increasing the bulk_size from
100 to 10k, refresh interval is set to 1s

I have changed the mapping to only index title and text. And set
_source to false.

I do not know what else can I modify to increase the indexing ratio, at
this moment it is indexing about 120 docs/sec.

Any ideas?

Kind regards,

Jorge

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c9bd6437-3241-46a4-838e-5327067bc3cc%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/c9bd6437-3241-46a4-838e-5327067bc3cc%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/-DxbnAqNlZU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQD_ZWNYmqxu-7qNuv%3D6%3Dfm9Yska_vCNReYa8VNG_i1p8Q%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQD_ZWNYmqxu-7qNuv%3D6%3Dfm9Yska_vCNReYa8VNG_i1p8Q%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEXx9%2BWdibgD8NLt1NUCZWSLo2eLxDxbSeJgOsJgi%3DuRmYOf_w%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAEXx9%2BWdibgD8NLt1NUCZWSLo2eLxDxbSeJgOsJgi%3DuRmYOf_w%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHFpQG1Lm8LzYT2WahisQG7uh1ke2K%2BtEfQVgeqJNcN0g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

jorge_canellas_2 · March 19, 2014, 8:26am

I had changed the bulk size and the concurrent bulks, but I had changed the
flush interval in the other way round..(/facepalm) I will try it. thank
you!!
El 19/03/2014 02:23, "joergprante@gmail.com" joergprante@gmail.com
escribió:

You can change the parameters bulk_size from 100 to a higher value,
max_concurrent_bulk from 1 to a higher value, and disable flush_interval
from 5s to -1 to increase wikipedia river bulk performance. Also make sure
the bzip2 file is downloading quick enough over the wire, which is often
not the case (some KB/sec here). Otherwise download it to the local file
system and change url parameter to this file.

The cluster stats show the Lucene indexing time. If you want overall
indexing time, you should take measurements at client side.

Jörg

On Tue, Mar 18, 2014 at 10:59 PM, jorge canellas <
jorge.canellas.urt@gmail.com> wrote:

Hi!

That was something that I was thinking since the indexing time shown in
the indexing statistic is much lower than the real time the process is
running. Each 5 seconds that the process is running, only a bit more than a
second is spent indexing.
I was thinking on parsing the wikipedia dumps manually and then use the
bulk api to index them.
Only one question: the indexing time shown in the stats of the cluster
takes into account the time used to choose the shard, send the document
through the network, analyze the document and write the terms? or only the
time spent writting the terms in the index? I am asking this because I am
interested on the indexing performance of Elasricsearch, and I could use
this time and forget the time spent by the river parsing the pages of the
wikipedia.
El 18/03/2014 20:32, "Ivan Brusic" ivan@brusic.com escribió:

It all depends on where the bottleneck is. A river, by design, runs on
only one node. Perhaps Elasticsearch is quickly indexing the content, but
the Wikipedia ingestion is the slowdown. Is the Wikipedia indexer threaded
or are all the bulks being executed on one thread? The BulkProcessor
supports multithreading.

Have you modified any of the default settings? In particular the merge
settings and throttling. Elasticsearch has low defaults for throttling out
of the box. If you have fast disks, you can easily raise the values. Are
you searching on the index while it is being built? If you are just
building the index, you can increase the number of segments and/or reduce
the amount of merging done at once. But it all depends on where the
bottleneck is. If you disk performance is fine, you might need to look
elsewhere.

Cheers,

Ivan

On Tue, Mar 18, 2014 at 10:52 AM, jorge canellas <
jorge.canellas.urt@gmail.com> wrote:

Hi!

I am trying to index the wikipedia dumps (downloaded and uncompressed)
using the wikipedia river, but it takes about 6 hours in the cluster.
I have increased the number of nodes and primary shards from 5 to 8,
but the performance does not increase.

I have set the refresh time to -1, the number of replicas to 0,
increased the amount of memory from 256MB/1GB to 3GB/4GB, allowed the
mlockall, and set the buffer to 30% from 10%.

I have changed the river settings increasing the bulk_size from
100 to 10k, refresh interval is set to 1s

I have changed the mapping to only index title and text. And set
_source to false.

I do not know what else can I modify to increase the indexing ratio, at
this moment it is indexing about 120 docs/sec.

Any ideas?

Kind regards,

Jorge

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c9bd6437-3241-46a4-838e-5327067bc3cc%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/c9bd6437-3241-46a4-838e-5327067bc3cc%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/-DxbnAqNlZU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQD_ZWNYmqxu-7qNuv%3D6%3Dfm9Yska_vCNReYa8VNG_i1p8Q%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQD_ZWNYmqxu-7qNuv%3D6%3Dfm9Yska_vCNReYa8VNG_i1p8Q%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEXx9%2BWdibgD8NLt1NUCZWSLo2eLxDxbSeJgOsJgi%3DuRmYOf_w%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAEXx9%2BWdibgD8NLt1NUCZWSLo2eLxDxbSeJgOsJgi%3DuRmYOf_w%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/-DxbnAqNlZU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHFpQG1Lm8LzYT2WahisQG7uh1ke2K%2BtEfQVgeqJNcN0g%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHFpQG1Lm8LzYT2WahisQG7uh1ke2K%2BtEfQVgeqJNcN0g%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEXx9%2BUGwsiVvRQBbRK%3DRV6jF3sfAKkw0GpCbAS-2rXUZR7_Fw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

nik9000 · March 19, 2014, 1:23pm

The wikipedia river has a really inefficient regex or two and it spends
most of its time munging wikitext into text. If you Java then you might
want to dig into it. I've never had the time unfortunately. Depending on
what you need you can get the job done with bash and perl. If you just
need the titles from the main namespace and you don't care if you miss a
few you can use this: Load lots of titles · GitHub

Its much faster but indexes much less data.

Nik

On Wed, Mar 19, 2014 at 4:26 AM, jorge canellas <
jorge.canellas.urt@gmail.com> wrote:

I had changed the bulk size and the concurrent bulks, but I had changed
the flush interval in the other way round..(/facepalm) I will try it. thank
you!!
El 19/03/2014 02:23, "joergprante@gmail.com" joergprante@gmail.com
escribió:

You can change the parameters bulk_size from 100 to a higher value,
max_concurrent_bulk from 1 to a higher value, and disable flush_interval
from 5s to -1 to increase wikipedia river bulk performance. Also make sure
the bzip2 file is downloading quick enough over the wire, which is often
not the case (some KB/sec here). Otherwise download it to the local file
system and change url parameter to this file.

The cluster stats show the Lucene indexing time. If you want overall
indexing time, you should take measurements at client side.

Jörg

On Tue, Mar 18, 2014 at 10:59 PM, jorge canellas <
jorge.canellas.urt@gmail.com> wrote:

Hi!

That was something that I was thinking since the indexing time shown in
the indexing statistic is much lower than the real time the process is
running. Each 5 seconds that the process is running, only a bit more than a
second is spent indexing.
I was thinking on parsing the wikipedia dumps manually and then use the
bulk api to index them.
Only one question: the indexing time shown in the stats of the cluster
takes into account the time used to choose the shard, send the document
through the network, analyze the document and write the terms? or only the
time spent writting the terms in the index? I am asking this because I am
interested on the indexing performance of Elasricsearch, and I could use
this time and forget the time spent by the river parsing the pages of the
wikipedia.
El 18/03/2014 20:32, "Ivan Brusic" ivan@brusic.com escribió:

It all depends on where the bottleneck is. A river, by design, runs
on only one node. Perhaps Elasticsearch is quickly indexing the content,
but the Wikipedia ingestion is the slowdown. Is the Wikipedia indexer
threaded or are all the bulks being executed on one thread? The
BulkProcessor supports multithreading.

Have you modified any of the default settings? In particular the merge
settings and throttling. Elasticsearch has low defaults for throttling out
of the box. If you have fast disks, you can easily raise the values. Are
you searching on the index while it is being built? If you are just
building the index, you can increase the number of segments and/or reduce
the amount of merging done at once. But it all depends on where the
bottleneck is. If you disk performance is fine, you might need to look
elsewhere.

Cheers,

Ivan

On Tue, Mar 18, 2014 at 10:52 AM, jorge canellas <
jorge.canellas.urt@gmail.com> wrote:

Hi!

I am trying to index the wikipedia dumps (downloaded and uncompressed)
using the wikipedia river, but it takes about 6 hours in the cluster.
I have increased the number of nodes and primary shards from 5 to 8,
but the performance does not increase.

I have set the refresh time to -1, the number of replicas to 0,
increased the amount of memory from 256MB/1GB to 3GB/4GB, allowed the
mlockall, and set the buffer to 30% from 10%.

I have changed the river settings increasing the bulk_size from
100 to 10k, refresh interval is set to 1s

I have changed the mapping to only index title and text. And set
_source to false.

I do not know what else can I modify to increase the indexing ratio,
at this moment it is indexing about 120 docs/sec.

Any ideas?

Kind regards,

Jorge

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c9bd6437-3241-46a4-838e-5327067bc3cc%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/c9bd6437-3241-46a4-838e-5327067bc3cc%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/-DxbnAqNlZU/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQD_ZWNYmqxu-7qNuv%3D6%3Dfm9Yska_vCNReYa8VNG_i1p8Q%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQD_ZWNYmqxu-7qNuv%3D6%3Dfm9Yska_vCNReYa8VNG_i1p8Q%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEXx9%2BWdibgD8NLt1NUCZWSLo2eLxDxbSeJgOsJgi%3DuRmYOf_w%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAEXx9%2BWdibgD8NLt1NUCZWSLo2eLxDxbSeJgOsJgi%3DuRmYOf_w%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/-DxbnAqNlZU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHFpQG1Lm8LzYT2WahisQG7uh1ke2K%2BtEfQVgeqJNcN0g%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHFpQG1Lm8LzYT2WahisQG7uh1ke2K%2BtEfQVgeqJNcN0g%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEXx9%2BUGwsiVvRQBbRK%3DRV6jF3sfAKkw0GpCbAS-2rXUZR7_Fw%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAEXx9%2BUGwsiVvRQBbRK%3DRV6jF3sfAKkw0GpCbAS-2rXUZR7_Fw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd24NaxoX_84iYQGtzzPjYL2%2BFJbnTZNraZ0GgqU2E6iVg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

jprante · March 19, 2014, 1:33pm

Interesting. Suffering by regex hell is true, maybe
Google Code Archive - Long-term storage for Google Code Project Hosting. is useful to enhance the
river, it comes with an AST parser.

Jörg

On Wed, Mar 19, 2014 at 2:23 PM, Nikolas Everett nik9000@gmail.com wrote:

The wikipedia river has a really inefficient regex or two and it spends
most of its time munging wikitext into text. If you Java then you might
want to dig into it. I've never had the time unfortunately. Depending on
what you need you can get the job done with bash and perl. If you just
need the titles from the main namespace and you don't care if you miss a
few you can use this: Load lots of titles · GitHub

Its much faster but indexes much less data.

Nik

On Wed, Mar 19, 2014 at 4:26 AM, jorge canellas <
jorge.canellas.urt@gmail.com> wrote:

I had changed the bulk size and the concurrent bulks, but I had changed
the flush interval in the other way round..(/facepalm) I will try it. thank
you!!
El 19/03/2014 02:23, "joergprante@gmail.com" joergprante@gmail.com
escribió:

You can change the parameters bulk_size from 100 to a higher value,
max_concurrent_bulk from 1 to a higher value, and disable flush_interval
from 5s to -1 to increase wikipedia river bulk performance. Also make sure
the bzip2 file is downloading quick enough over the wire, which is often
not the case (some KB/sec here). Otherwise download it to the local file
system and change url parameter to this file.

The cluster stats show the Lucene indexing time. If you want overall
indexing time, you should take measurements at client side.

Jörg

On Tue, Mar 18, 2014 at 10:59 PM, jorge canellas <
jorge.canellas.urt@gmail.com> wrote:

Hi!

That was something that I was thinking since the indexing time shown in
the indexing statistic is much lower than the real time the process is
running. Each 5 seconds that the process is running, only a bit more than a
second is spent indexing.
I was thinking on parsing the wikipedia dumps manually and then use the
bulk api to index them.
Only one question: the indexing time shown in the stats of the cluster
takes into account the time used to choose the shard, send the document
through the network, analyze the document and write the terms? or only the
time spent writting the terms in the index? I am asking this because I am
interested on the indexing performance of Elasricsearch, and I could use
this time and forget the time spent by the river parsing the pages of the
wikipedia.
El 18/03/2014 20:32, "Ivan Brusic" ivan@brusic.com escribió:

It all depends on where the bottleneck is. A river, by design, runs
on only one node. Perhaps Elasticsearch is quickly indexing the content,
but the Wikipedia ingestion is the slowdown. Is the Wikipedia indexer
threaded or are all the bulks being executed on one thread? The
BulkProcessor supports multithreading.

Have you modified any of the default settings? In particular the merge
settings and throttling. Elasticsearch has low defaults for throttling out
of the box. If you have fast disks, you can easily raise the values. Are
you searching on the index while it is being built? If you are just
building the index, you can increase the number of segments and/or reduce
the amount of merging done at once. But it all depends on where the
bottleneck is. If you disk performance is fine, you might need to look
elsewhere.

Cheers,

Ivan

On Tue, Mar 18, 2014 at 10:52 AM, jorge canellas <
jorge.canellas.urt@gmail.com> wrote:

Hi!

I am trying to index the wikipedia dumps (downloaded and
uncompressed) using the wikipedia river, but it takes about 6 hours in the
cluster.
I have increased the number of nodes and primary shards from 5 to 8,
but the performance does not increase.

I have set the refresh time to -1, the number of replicas to 0,
increased the amount of memory from 256MB/1GB to 3GB/4GB, allowed the
mlockall, and set the buffer to 30% from 10%.

I have changed the river settings increasing the bulk_size from
100 to 10k, refresh interval is set to 1s

I have changed the mapping to only index title and text. And
set _source to false.

I do not know what else can I modify to increase the indexing ratio,
at this moment it is indexing about 120 docs/sec.

Any ideas?

Kind regards,

Jorge

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c9bd6437-3241-46a4-838e-5327067bc3cc%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/c9bd6437-3241-46a4-838e-5327067bc3cc%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/-DxbnAqNlZU/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQD_ZWNYmqxu-7qNuv%3D6%3Dfm9Yska_vCNReYa8VNG_i1p8Q%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQD_ZWNYmqxu-7qNuv%3D6%3Dfm9Yska_vCNReYa8VNG_i1p8Q%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEXx9%2BWdibgD8NLt1NUCZWSLo2eLxDxbSeJgOsJgi%3DuRmYOf_w%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAEXx9%2BWdibgD8NLt1NUCZWSLo2eLxDxbSeJgOsJgi%3DuRmYOf_w%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/-DxbnAqNlZU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHFpQG1Lm8LzYT2WahisQG7uh1ke2K%2BtEfQVgeqJNcN0g%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHFpQG1Lm8LzYT2WahisQG7uh1ke2K%2BtEfQVgeqJNcN0g%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEXx9%2BUGwsiVvRQBbRK%3DRV6jF3sfAKkw0GpCbAS-2rXUZR7_Fw%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAEXx9%2BUGwsiVvRQBbRK%3DRV6jF3sfAKkw0GpCbAS-2rXUZR7_Fw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd24NaxoX_84iYQGtzzPjYL2%2BFJbnTZNraZ0GgqU2E6iVg%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAPmjWd24NaxoX_84iYQGtzzPjYL2%2BFJbnTZNraZ0GgqU2E6iVg%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEgNbkixbwuUbo-ow1U_Mkmf5tuoA%3DAyr7qvfOLUaaW0A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Bulk indexing tips for Elastic search and Cassandra River Elasticsearch	10	991	July 6, 2017
Questions about scaling elasticsearch with regard to the number of documents indexed per second Elasticsearch	8	548	July 6, 2017
How to change the default index settings for Wikipedia river? Elasticsearch	5	517	July 6, 2017
Issue Indexing 50mil Docs via Bulk API Elasticsearch	23	2459	July 5, 2017
Improve indexing throughput Elasticsearch	15	2663	July 6, 2017

Some questions on wikipedia river and cluster config

Related topics