Some questions on wikipedia river and cluster config


(jorge canellas-2) #1

Hi!

I am trying to index the wikipedia dumps (downloaded and uncompressed)
using the wikipedia river, but it takes about 6 hours in the cluster.
I have increased the number of nodes and primary shards from 5 to 8, but
the performance does not increase.

  • I have set the refresh time to -1, the number of replicas to 0,
    increased the amount of memory from 256MB/1GB to 3GB/4GB, allowed the
    mlockall, and set the buffer to 30% from 10%.
  • I have changed the river settings increasing the bulk_size from 100 to
    10k, refresh interval is set to 1s
  • I have changed the mapping to only index title and text. And set
    _source to false.

I do not know what else can I modify to increase the indexing ratio, at
this moment it is indexing about 120 docs/sec.

Any ideas?

Kind regards,

Jorge

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c9bd6437-3241-46a4-838e-5327067bc3cc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Ivan Brusic) #2

It all depends on where the bottleneck is. A river, by design, runs on only
one node. Perhaps Elasticsearch is quickly indexing the content, but the
Wikipedia ingestion is the slowdown. Is the Wikipedia indexer threaded or
are all the bulks being executed on one thread? The BulkProcessor supports
multithreading.

Have you modified any of the default settings? In particular the merge
settings and throttling. Elasticsearch has low defaults for throttling out
of the box. If you have fast disks, you can easily raise the values. Are
you searching on the index while it is being built? If you are just
building the index, you can increase the number of segments and/or reduce
the amount of merging done at once. But it all depends on where the
bottleneck is. If you disk performance is fine, you might need to look
elsewhere.

Cheers,

Ivan

On Tue, Mar 18, 2014 at 10:52 AM, jorge canellas <
jorge.canellas.urt@gmail.com> wrote:

Hi!

I am trying to index the wikipedia dumps (downloaded and uncompressed)
using the wikipedia river, but it takes about 6 hours in the cluster.
I have increased the number of nodes and primary shards from 5 to 8, but
the performance does not increase.

  • I have set the refresh time to -1, the number of replicas to 0,
    increased the amount of memory from 256MB/1GB to 3GB/4GB, allowed the
    mlockall, and set the buffer to 30% from 10%.
  • I have changed the river settings increasing the bulk_size from 100
    to 10k, refresh interval is set to 1s
  • I have changed the mapping to only index title and text. And set
    _source to false.

I do not know what else can I modify to increase the indexing ratio, at
this moment it is indexing about 120 docs/sec.

Any ideas?

Kind regards,

Jorge

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c9bd6437-3241-46a4-838e-5327067bc3cc%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/c9bd6437-3241-46a4-838e-5327067bc3cc%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQD_ZWNYmqxu-7qNuv%3D6%3Dfm9Yska_vCNReYa8VNG_i1p8Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(jorge canellas-2) #3

Hi!

That was something that I was thinking since the indexing time shown in the
indexing statistic is much lower than the real time the process is running.
Each 5 seconds that the process is running, only a bit more than a second
is spent indexing.
I was thinking on parsing the wikipedia dumps manually and then use the
bulk api to index them.
Only one question: the indexing time shown in the stats of the cluster
takes into account the time used to choose the shard, send the document
through the network, analyze the document and write the terms? or only the
time spent writting the terms in the index? I am asking this because I am
interested on the indexing performance of Elasricsearch, and I could use
this time and forget the time spent by the river parsing the pages of the
wikipedia.
El 18/03/2014 20:32, "Ivan Brusic" ivan@brusic.com escribió:

It all depends on where the bottleneck is. A river, by design, runs on
only one node. Perhaps Elasticsearch is quickly indexing the content, but
the Wikipedia ingestion is the slowdown. Is the Wikipedia indexer threaded
or are all the bulks being executed on one thread? The BulkProcessor
supports multithreading.

Have you modified any of the default settings? In particular the merge
settings and throttling. Elasticsearch has low defaults for throttling out
of the box. If you have fast disks, you can easily raise the values. Are
you searching on the index while it is being built? If you are just
building the index, you can increase the number of segments and/or reduce
the amount of merging done at once. But it all depends on where the
bottleneck is. If you disk performance is fine, you might need to look
elsewhere.

Cheers,

Ivan

On Tue, Mar 18, 2014 at 10:52 AM, jorge canellas <
jorge.canellas.urt@gmail.com> wrote:

Hi!

I am trying to index the wikipedia dumps (downloaded and uncompressed)
using the wikipedia river, but it takes about 6 hours in the cluster.
I have increased the number of nodes and primary shards from 5 to 8, but
the performance does not increase.

  • I have set the refresh time to -1, the number of replicas to 0,
    increased the amount of memory from 256MB/1GB to 3GB/4GB, allowed the
    mlockall, and set the buffer to 30% from 10%.
  • I have changed the river settings increasing the bulk_size from 100
    to 10k, refresh interval is set to 1s
  • I have changed the mapping to only index title and text. And set
    _source to false.

I do not know what else can I modify to increase the indexing ratio, at
this moment it is indexing about 120 docs/sec.

Any ideas?

Kind regards,

Jorge

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c9bd6437-3241-46a4-838e-5327067bc3cc%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/c9bd6437-3241-46a4-838e-5327067bc3cc%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/-DxbnAqNlZU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQD_ZWNYmqxu-7qNuv%3D6%3Dfm9Yska_vCNReYa8VNG_i1p8Q%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CALY%3DcQD_ZWNYmqxu-7qNuv%3D6%3Dfm9Yska_vCNReYa8VNG_i1p8Q%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEXx9%2BWdibgD8NLt1NUCZWSLo2eLxDxbSeJgOsJgi%3DuRmYOf_w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #4

You can change the parameters bulk_size from 100 to a higher value,
max_concurrent_bulk from 1 to a higher value, and disable flush_interval
from 5s to -1 to increase wikipedia river bulk performance. Also make sure
the bzip2 file is downloading quick enough over the wire, which is often
not the case (some KB/sec here). Otherwise download it to the local file
system and change url parameter to this file.

The cluster stats show the Lucene indexing time. If you want overall
indexing time, you should take measurements at client side.

Jörg

On Tue, Mar 18, 2014 at 10:59 PM, jorge canellas <
jorge.canellas.urt@gmail.com> wrote:

Hi!

That was something that I was thinking since the indexing time shown in
the indexing statistic is much lower than the real time the process is
running. Each 5 seconds that the process is running, only a bit more than a
second is spent indexing.
I was thinking on parsing the wikipedia dumps manually and then use the
bulk api to index them.
Only one question: the indexing time shown in the stats of the cluster
takes into account the time used to choose the shard, send the document
through the network, analyze the document and write the terms? or only the
time spent writting the terms in the index? I am asking this because I am
interested on the indexing performance of Elasricsearch, and I could use
this time and forget the time spent by the river parsing the pages of the
wikipedia.
El 18/03/2014 20:32, "Ivan Brusic" ivan@brusic.com escribió:

It all depends on where the bottleneck is. A river, by design, runs on
only one node. Perhaps Elasticsearch is quickly indexing the content, but
the Wikipedia ingestion is the slowdown. Is the Wikipedia indexer threaded
or are all the bulks being executed on one thread? The BulkProcessor
supports multithreading.

Have you modified any of the default settings? In particular the merge
settings and throttling. Elasticsearch has low defaults for throttling out
of the box. If you have fast disks, you can easily raise the values. Are
you searching on the index while it is being built? If you are just
building the index, you can increase the number of segments and/or reduce
the amount of merging done at once. But it all depends on where the
bottleneck is. If you disk performance is fine, you might need to look
elsewhere.

Cheers,

Ivan

On Tue, Mar 18, 2014 at 10:52 AM, jorge canellas <
jorge.canellas.urt@gmail.com> wrote:

Hi!

I am trying to index the wikipedia dumps (downloaded and uncompressed)
using the wikipedia river, but it takes about 6 hours in the cluster.
I have increased the number of nodes and primary shards from 5 to 8, but
the performance does not increase.

  • I have set the refresh time to -1, the number of replicas to 0,
    increased the amount of memory from 256MB/1GB to 3GB/4GB, allowed the
    mlockall, and set the buffer to 30% from 10%.
  • I have changed the river settings increasing the bulk_size from
    100 to 10k, refresh interval is set to 1s
  • I have changed the mapping to only index title and text. And set
    _source to false.

I do not know what else can I modify to increase the indexing ratio, at
this moment it is indexing about 120 docs/sec.

Any ideas?

Kind regards,

Jorge

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c9bd6437-3241-46a4-838e-5327067bc3cc%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/c9bd6437-3241-46a4-838e-5327067bc3cc%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/-DxbnAqNlZU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQD_ZWNYmqxu-7qNuv%3D6%3Dfm9Yska_vCNReYa8VNG_i1p8Q%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CALY%3DcQD_ZWNYmqxu-7qNuv%3D6%3Dfm9Yska_vCNReYa8VNG_i1p8Q%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEXx9%2BWdibgD8NLt1NUCZWSLo2eLxDxbSeJgOsJgi%3DuRmYOf_w%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAEXx9%2BWdibgD8NLt1NUCZWSLo2eLxDxbSeJgOsJgi%3DuRmYOf_w%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHFpQG1Lm8LzYT2WahisQG7uh1ke2K%2BtEfQVgeqJNcN0g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(jorge canellas-2) #5

I had changed the bulk size and the concurrent bulks, but I had changed the
flush interval in the other way round..(/facepalm) I will try it. thank
you!!
El 19/03/2014 02:23, "joergprante@gmail.com" joergprante@gmail.com
escribió:

You can change the parameters bulk_size from 100 to a higher value,
max_concurrent_bulk from 1 to a higher value, and disable flush_interval
from 5s to -1 to increase wikipedia river bulk performance. Also make sure
the bzip2 file is downloading quick enough over the wire, which is often
not the case (some KB/sec here). Otherwise download it to the local file
system and change url parameter to this file.

The cluster stats show the Lucene indexing time. If you want overall
indexing time, you should take measurements at client side.

Jörg

On Tue, Mar 18, 2014 at 10:59 PM, jorge canellas <
jorge.canellas.urt@gmail.com> wrote:

Hi!

That was something that I was thinking since the indexing time shown in
the indexing statistic is much lower than the real time the process is
running. Each 5 seconds that the process is running, only a bit more than a
second is spent indexing.
I was thinking on parsing the wikipedia dumps manually and then use the
bulk api to index them.
Only one question: the indexing time shown in the stats of the cluster
takes into account the time used to choose the shard, send the document
through the network, analyze the document and write the terms? or only the
time spent writting the terms in the index? I am asking this because I am
interested on the indexing performance of Elasricsearch, and I could use
this time and forget the time spent by the river parsing the pages of the
wikipedia.
El 18/03/2014 20:32, "Ivan Brusic" ivan@brusic.com escribió:

It all depends on where the bottleneck is. A river, by design, runs on
only one node. Perhaps Elasticsearch is quickly indexing the content, but
the Wikipedia ingestion is the slowdown. Is the Wikipedia indexer threaded
or are all the bulks being executed on one thread? The BulkProcessor
supports multithreading.

Have you modified any of the default settings? In particular the merge
settings and throttling. Elasticsearch has low defaults for throttling out
of the box. If you have fast disks, you can easily raise the values. Are
you searching on the index while it is being built? If you are just
building the index, you can increase the number of segments and/or reduce
the amount of merging done at once. But it all depends on where the
bottleneck is. If you disk performance is fine, you might need to look
elsewhere.

Cheers,

Ivan

On Tue, Mar 18, 2014 at 10:52 AM, jorge canellas <
jorge.canellas.urt@gmail.com> wrote:

Hi!

I am trying to index the wikipedia dumps (downloaded and uncompressed)
using the wikipedia river, but it takes about 6 hours in the cluster.
I have increased the number of nodes and primary shards from 5 to 8,
but the performance does not increase.

  • I have set the refresh time to -1, the number of replicas to 0,
    increased the amount of memory from 256MB/1GB to 3GB/4GB, allowed the
    mlockall, and set the buffer to 30% from 10%.
  • I have changed the river settings increasing the bulk_size from
    100 to 10k, refresh interval is set to 1s
  • I have changed the mapping to only index title and text. And set
    _source to false.

I do not know what else can I modify to increase the indexing ratio, at
this moment it is indexing about 120 docs/sec.

Any ideas?

Kind regards,

Jorge

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c9bd6437-3241-46a4-838e-5327067bc3cc%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/c9bd6437-3241-46a4-838e-5327067bc3cc%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/-DxbnAqNlZU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQD_ZWNYmqxu-7qNuv%3D6%3Dfm9Yska_vCNReYa8VNG_i1p8Q%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CALY%3DcQD_ZWNYmqxu-7qNuv%3D6%3Dfm9Yska_vCNReYa8VNG_i1p8Q%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEXx9%2BWdibgD8NLt1NUCZWSLo2eLxDxbSeJgOsJgi%3DuRmYOf_w%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAEXx9%2BWdibgD8NLt1NUCZWSLo2eLxDxbSeJgOsJgi%3DuRmYOf_w%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/-DxbnAqNlZU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHFpQG1Lm8LzYT2WahisQG7uh1ke2K%2BtEfQVgeqJNcN0g%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAKdsXoHFpQG1Lm8LzYT2WahisQG7uh1ke2K%2BtEfQVgeqJNcN0g%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEXx9%2BUGwsiVvRQBbRK%3DRV6jF3sfAKkw0GpCbAS-2rXUZR7_Fw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Nik Everett) #6

The wikipedia river has a really inefficient regex or two and it spends
most of its time munging wikitext into text. If you Java then you might
want to dig into it. I've never had the time unfortunately. Depending on
what you need you can get the job done with bash and perl. If you just
need the titles from the main namespace and you don't care if you miss a
few you can use this: https://gist.github.com/nik9000/9528106

Its much faster but indexes much less data.

Nik

On Wed, Mar 19, 2014 at 4:26 AM, jorge canellas <
jorge.canellas.urt@gmail.com> wrote:

I had changed the bulk size and the concurrent bulks, but I had changed
the flush interval in the other way round..(/facepalm) I will try it. thank
you!!
El 19/03/2014 02:23, "joergprante@gmail.com" joergprante@gmail.com
escribió:

You can change the parameters bulk_size from 100 to a higher value,
max_concurrent_bulk from 1 to a higher value, and disable flush_interval
from 5s to -1 to increase wikipedia river bulk performance. Also make sure
the bzip2 file is downloading quick enough over the wire, which is often
not the case (some KB/sec here). Otherwise download it to the local file
system and change url parameter to this file.

The cluster stats show the Lucene indexing time. If you want overall
indexing time, you should take measurements at client side.

Jörg

On Tue, Mar 18, 2014 at 10:59 PM, jorge canellas <
jorge.canellas.urt@gmail.com> wrote:

Hi!

That was something that I was thinking since the indexing time shown in
the indexing statistic is much lower than the real time the process is
running. Each 5 seconds that the process is running, only a bit more than a
second is spent indexing.
I was thinking on parsing the wikipedia dumps manually and then use the
bulk api to index them.
Only one question: the indexing time shown in the stats of the cluster
takes into account the time used to choose the shard, send the document
through the network, analyze the document and write the terms? or only the
time spent writting the terms in the index? I am asking this because I am
interested on the indexing performance of Elasricsearch, and I could use
this time and forget the time spent by the river parsing the pages of the
wikipedia.
El 18/03/2014 20:32, "Ivan Brusic" ivan@brusic.com escribió:

It all depends on where the bottleneck is. A river, by design, runs
on only one node. Perhaps Elasticsearch is quickly indexing the content,
but the Wikipedia ingestion is the slowdown. Is the Wikipedia indexer
threaded or are all the bulks being executed on one thread? The
BulkProcessor supports multithreading.

Have you modified any of the default settings? In particular the merge
settings and throttling. Elasticsearch has low defaults for throttling out
of the box. If you have fast disks, you can easily raise the values. Are
you searching on the index while it is being built? If you are just
building the index, you can increase the number of segments and/or reduce
the amount of merging done at once. But it all depends on where the
bottleneck is. If you disk performance is fine, you might need to look
elsewhere.

Cheers,

Ivan

On Tue, Mar 18, 2014 at 10:52 AM, jorge canellas <
jorge.canellas.urt@gmail.com> wrote:

Hi!

I am trying to index the wikipedia dumps (downloaded and uncompressed)
using the wikipedia river, but it takes about 6 hours in the cluster.
I have increased the number of nodes and primary shards from 5 to 8,
but the performance does not increase.

  • I have set the refresh time to -1, the number of replicas to 0,
    increased the amount of memory from 256MB/1GB to 3GB/4GB, allowed the
    mlockall, and set the buffer to 30% from 10%.
  • I have changed the river settings increasing the bulk_size from
    100 to 10k, refresh interval is set to 1s
  • I have changed the mapping to only index title and text. And set
    _source to false.

I do not know what else can I modify to increase the indexing ratio,
at this moment it is indexing about 120 docs/sec.

Any ideas?

Kind regards,

Jorge

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c9bd6437-3241-46a4-838e-5327067bc3cc%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/c9bd6437-3241-46a4-838e-5327067bc3cc%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/-DxbnAqNlZU/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQD_ZWNYmqxu-7qNuv%3D6%3Dfm9Yska_vCNReYa8VNG_i1p8Q%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CALY%3DcQD_ZWNYmqxu-7qNuv%3D6%3Dfm9Yska_vCNReYa8VNG_i1p8Q%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEXx9%2BWdibgD8NLt1NUCZWSLo2eLxDxbSeJgOsJgi%3DuRmYOf_w%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAEXx9%2BWdibgD8NLt1NUCZWSLo2eLxDxbSeJgOsJgi%3DuRmYOf_w%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/-DxbnAqNlZU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHFpQG1Lm8LzYT2WahisQG7uh1ke2K%2BtEfQVgeqJNcN0g%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAKdsXoHFpQG1Lm8LzYT2WahisQG7uh1ke2K%2BtEfQVgeqJNcN0g%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEXx9%2BUGwsiVvRQBbRK%3DRV6jF3sfAKkw0GpCbAS-2rXUZR7_Fw%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAEXx9%2BUGwsiVvRQBbRK%3DRV6jF3sfAKkw0GpCbAS-2rXUZR7_Fw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd24NaxoX_84iYQGtzzPjYL2%2BFJbnTZNraZ0GgqU2E6iVg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #7

Interesting. Suffering by regex hell is true, maybe
https://code.google.com/p/java-wikipedia-parser is useful to enhance the
river, it comes with an AST parser.

Jörg

On Wed, Mar 19, 2014 at 2:23 PM, Nikolas Everett nik9000@gmail.com wrote:

The wikipedia river has a really inefficient regex or two and it spends
most of its time munging wikitext into text. If you Java then you might
want to dig into it. I've never had the time unfortunately. Depending on
what you need you can get the job done with bash and perl. If you just
need the titles from the main namespace and you don't care if you miss a
few you can use this: https://gist.github.com/nik9000/9528106

Its much faster but indexes much less data.

Nik

On Wed, Mar 19, 2014 at 4:26 AM, jorge canellas <
jorge.canellas.urt@gmail.com> wrote:

I had changed the bulk size and the concurrent bulks, but I had changed
the flush interval in the other way round..(/facepalm) I will try it. thank
you!!
El 19/03/2014 02:23, "joergprante@gmail.com" joergprante@gmail.com
escribió:

You can change the parameters bulk_size from 100 to a higher value,
max_concurrent_bulk from 1 to a higher value, and disable flush_interval
from 5s to -1 to increase wikipedia river bulk performance. Also make sure
the bzip2 file is downloading quick enough over the wire, which is often
not the case (some KB/sec here). Otherwise download it to the local file
system and change url parameter to this file.

The cluster stats show the Lucene indexing time. If you want overall
indexing time, you should take measurements at client side.

Jörg

On Tue, Mar 18, 2014 at 10:59 PM, jorge canellas <
jorge.canellas.urt@gmail.com> wrote:

Hi!

That was something that I was thinking since the indexing time shown in
the indexing statistic is much lower than the real time the process is
running. Each 5 seconds that the process is running, only a bit more than a
second is spent indexing.
I was thinking on parsing the wikipedia dumps manually and then use the
bulk api to index them.
Only one question: the indexing time shown in the stats of the cluster
takes into account the time used to choose the shard, send the document
through the network, analyze the document and write the terms? or only the
time spent writting the terms in the index? I am asking this because I am
interested on the indexing performance of Elasricsearch, and I could use
this time and forget the time spent by the river parsing the pages of the
wikipedia.
El 18/03/2014 20:32, "Ivan Brusic" ivan@brusic.com escribió:

It all depends on where the bottleneck is. A river, by design, runs
on only one node. Perhaps Elasticsearch is quickly indexing the content,
but the Wikipedia ingestion is the slowdown. Is the Wikipedia indexer
threaded or are all the bulks being executed on one thread? The
BulkProcessor supports multithreading.

Have you modified any of the default settings? In particular the merge
settings and throttling. Elasticsearch has low defaults for throttling out
of the box. If you have fast disks, you can easily raise the values. Are
you searching on the index while it is being built? If you are just
building the index, you can increase the number of segments and/or reduce
the amount of merging done at once. But it all depends on where the
bottleneck is. If you disk performance is fine, you might need to look
elsewhere.

Cheers,

Ivan

On Tue, Mar 18, 2014 at 10:52 AM, jorge canellas <
jorge.canellas.urt@gmail.com> wrote:

Hi!

I am trying to index the wikipedia dumps (downloaded and
uncompressed) using the wikipedia river, but it takes about 6 hours in the
cluster.
I have increased the number of nodes and primary shards from 5 to 8,
but the performance does not increase.

  • I have set the refresh time to -1, the number of replicas to 0,
    increased the amount of memory from 256MB/1GB to 3GB/4GB, allowed the
    mlockall, and set the buffer to 30% from 10%.
  • I have changed the river settings increasing the bulk_size from
    100 to 10k, refresh interval is set to 1s
  • I have changed the mapping to only index title and text. And
    set _source to false.

I do not know what else can I modify to increase the indexing ratio,
at this moment it is indexing about 120 docs/sec.

Any ideas?

Kind regards,

Jorge

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c9bd6437-3241-46a4-838e-5327067bc3cc%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/c9bd6437-3241-46a4-838e-5327067bc3cc%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/-DxbnAqNlZU/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQD_ZWNYmqxu-7qNuv%3D6%3Dfm9Yska_vCNReYa8VNG_i1p8Q%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CALY%3DcQD_ZWNYmqxu-7qNuv%3D6%3Dfm9Yska_vCNReYa8VNG_i1p8Q%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEXx9%2BWdibgD8NLt1NUCZWSLo2eLxDxbSeJgOsJgi%3DuRmYOf_w%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAEXx9%2BWdibgD8NLt1NUCZWSLo2eLxDxbSeJgOsJgi%3DuRmYOf_w%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/-DxbnAqNlZU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHFpQG1Lm8LzYT2WahisQG7uh1ke2K%2BtEfQVgeqJNcN0g%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAKdsXoHFpQG1Lm8LzYT2WahisQG7uh1ke2K%2BtEfQVgeqJNcN0g%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEXx9%2BUGwsiVvRQBbRK%3DRV6jF3sfAKkw0GpCbAS-2rXUZR7_Fw%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAEXx9%2BUGwsiVvRQBbRK%3DRV6jF3sfAKkw0GpCbAS-2rXUZR7_Fw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd24NaxoX_84iYQGtzzPjYL2%2BFJbnTZNraZ0GgqU2E6iVg%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAPmjWd24NaxoX_84iYQGtzzPjYL2%2BFJbnTZNraZ0GgqU2E6iVg%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEgNbkixbwuUbo-ow1U_Mkmf5tuoA%3DAyr7qvfOLUaaW0A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #8