I am trying to index the wikipedia dumps (downloaded and uncompressed)
using the wikipedia river, but it takes about 6 hours in the cluster.
I have increased the number of nodes and primary shards from 5 to 8, but
the performance does not increase.
I have set the refresh time to -1, the number of replicas to 0,
increased the amount of memory from 256MB/1GB to 3GB/4GB, allowed the
mlockall, and set the buffer to 30% from 10%.
I have changed the river settings increasing the bulk_size from 100 to
10k, refresh interval is set to 1s
I have changed the mapping to only index title and text. And set
_source to false.
I do not know what else can I modify to increase the indexing ratio, at
this moment it is indexing about 120 docs/sec.
It all depends on where the bottleneck is. A river, by design, runs on only
one node. Perhaps Elasticsearch is quickly indexing the content, but the
Wikipedia ingestion is the slowdown. Is the Wikipedia indexer threaded or
are all the bulks being executed on one thread? The BulkProcessor supports
multithreading.
Have you modified any of the default settings? In particular the merge
settings and throttling. Elasticsearch has low defaults for throttling out
of the box. If you have fast disks, you can easily raise the values. Are
you searching on the index while it is being built? If you are just
building the index, you can increase the number of segments and/or reduce
the amount of merging done at once. But it all depends on where the
bottleneck is. If you disk performance is fine, you might need to look
elsewhere.
I am trying to index the wikipedia dumps (downloaded and uncompressed)
using the wikipedia river, but it takes about 6 hours in the cluster.
I have increased the number of nodes and primary shards from 5 to 8, but
the performance does not increase.
I have set the refresh time to -1, the number of replicas to 0,
increased the amount of memory from 256MB/1GB to 3GB/4GB, allowed the
mlockall, and set the buffer to 30% from 10%.
I have changed the river settings increasing the bulk_size from 100
to 10k, refresh interval is set to 1s
I have changed the mapping to only index title and text. And set
_source to false.
I do not know what else can I modify to increase the indexing ratio, at
this moment it is indexing about 120 docs/sec.
That was something that I was thinking since the indexing time shown in the
indexing statistic is much lower than the real time the process is running.
Each 5 seconds that the process is running, only a bit more than a second
is spent indexing.
I was thinking on parsing the wikipedia dumps manually and then use the
bulk api to index them.
Only one question: the indexing time shown in the stats of the cluster
takes into account the time used to choose the shard, send the document
through the network, analyze the document and write the terms? or only the
time spent writting the terms in the index? I am asking this because I am
interested on the indexing performance of Elasricsearch, and I could use
this time and forget the time spent by the river parsing the pages of the
wikipedia.
El 18/03/2014 20:32, "Ivan Brusic" ivan@brusic.com escribió:
It all depends on where the bottleneck is. A river, by design, runs on
only one node. Perhaps Elasticsearch is quickly indexing the content, but
the Wikipedia ingestion is the slowdown. Is the Wikipedia indexer threaded
or are all the bulks being executed on one thread? The BulkProcessor
supports multithreading.
Have you modified any of the default settings? In particular the merge
settings and throttling. Elasticsearch has low defaults for throttling out
of the box. If you have fast disks, you can easily raise the values. Are
you searching on the index while it is being built? If you are just
building the index, you can increase the number of segments and/or reduce
the amount of merging done at once. But it all depends on where the
bottleneck is. If you disk performance is fine, you might need to look
elsewhere.
I am trying to index the wikipedia dumps (downloaded and uncompressed)
using the wikipedia river, but it takes about 6 hours in the cluster.
I have increased the number of nodes and primary shards from 5 to 8, but
the performance does not increase.
I have set the refresh time to -1, the number of replicas to 0,
increased the amount of memory from 256MB/1GB to 3GB/4GB, allowed the
mlockall, and set the buffer to 30% from 10%.
I have changed the river settings increasing the bulk_size from 100
to 10k, refresh interval is set to 1s
I have changed the mapping to only index title and text. And set
_source to false.
I do not know what else can I modify to increase the indexing ratio, at
this moment it is indexing about 120 docs/sec.
You can change the parameters bulk_size from 100 to a higher value,
max_concurrent_bulk from 1 to a higher value, and disable flush_interval
from 5s to -1 to increase wikipedia river bulk performance. Also make sure
the bzip2 file is downloading quick enough over the wire, which is often
not the case (some KB/sec here). Otherwise download it to the local file
system and change url parameter to this file.
The cluster stats show the Lucene indexing time. If you want overall
indexing time, you should take measurements at client side.
That was something that I was thinking since the indexing time shown in
the indexing statistic is much lower than the real time the process is
running. Each 5 seconds that the process is running, only a bit more than a
second is spent indexing.
I was thinking on parsing the wikipedia dumps manually and then use the
bulk api to index them.
Only one question: the indexing time shown in the stats of the cluster
takes into account the time used to choose the shard, send the document
through the network, analyze the document and write the terms? or only the
time spent writting the terms in the index? I am asking this because I am
interested on the indexing performance of Elasricsearch, and I could use
this time and forget the time spent by the river parsing the pages of the
wikipedia.
El 18/03/2014 20:32, "Ivan Brusic" ivan@brusic.com escribió:
It all depends on where the bottleneck is. A river, by design, runs on
only one node. Perhaps Elasticsearch is quickly indexing the content, but
the Wikipedia ingestion is the slowdown. Is the Wikipedia indexer threaded
or are all the bulks being executed on one thread? The BulkProcessor
supports multithreading.
Have you modified any of the default settings? In particular the merge
settings and throttling. Elasticsearch has low defaults for throttling out
of the box. If you have fast disks, you can easily raise the values. Are
you searching on the index while it is being built? If you are just
building the index, you can increase the number of segments and/or reduce
the amount of merging done at once. But it all depends on where the
bottleneck is. If you disk performance is fine, you might need to look
elsewhere.
I am trying to index the wikipedia dumps (downloaded and uncompressed)
using the wikipedia river, but it takes about 6 hours in the cluster.
I have increased the number of nodes and primary shards from 5 to 8, but
the performance does not increase.
I have set the refresh time to -1, the number of replicas to 0,
increased the amount of memory from 256MB/1GB to 3GB/4GB, allowed the
mlockall, and set the buffer to 30% from 10%.
I have changed the river settings increasing the bulk_size from
100 to 10k, refresh interval is set to 1s
I have changed the mapping to only index title and text. And set
_source to false.
I do not know what else can I modify to increase the indexing ratio, at
this moment it is indexing about 120 docs/sec.
I had changed the bulk size and the concurrent bulks, but I had changed the
flush interval in the other way round..(/facepalm) I will try it. thank
you!!
El 19/03/2014 02:23, "joergprante@gmail.com" joergprante@gmail.com
escribió:
You can change the parameters bulk_size from 100 to a higher value,
max_concurrent_bulk from 1 to a higher value, and disable flush_interval
from 5s to -1 to increase wikipedia river bulk performance. Also make sure
the bzip2 file is downloading quick enough over the wire, which is often
not the case (some KB/sec here). Otherwise download it to the local file
system and change url parameter to this file.
The cluster stats show the Lucene indexing time. If you want overall
indexing time, you should take measurements at client side.
That was something that I was thinking since the indexing time shown in
the indexing statistic is much lower than the real time the process is
running. Each 5 seconds that the process is running, only a bit more than a
second is spent indexing.
I was thinking on parsing the wikipedia dumps manually and then use the
bulk api to index them.
Only one question: the indexing time shown in the stats of the cluster
takes into account the time used to choose the shard, send the document
through the network, analyze the document and write the terms? or only the
time spent writting the terms in the index? I am asking this because I am
interested on the indexing performance of Elasricsearch, and I could use
this time and forget the time spent by the river parsing the pages of the
wikipedia.
El 18/03/2014 20:32, "Ivan Brusic" ivan@brusic.com escribió:
It all depends on where the bottleneck is. A river, by design, runs on
only one node. Perhaps Elasticsearch is quickly indexing the content, but
the Wikipedia ingestion is the slowdown. Is the Wikipedia indexer threaded
or are all the bulks being executed on one thread? The BulkProcessor
supports multithreading.
Have you modified any of the default settings? In particular the merge
settings and throttling. Elasticsearch has low defaults for throttling out
of the box. If you have fast disks, you can easily raise the values. Are
you searching on the index while it is being built? If you are just
building the index, you can increase the number of segments and/or reduce
the amount of merging done at once. But it all depends on where the
bottleneck is. If you disk performance is fine, you might need to look
elsewhere.
I am trying to index the wikipedia dumps (downloaded and uncompressed)
using the wikipedia river, but it takes about 6 hours in the cluster.
I have increased the number of nodes and primary shards from 5 to 8,
but the performance does not increase.
I have set the refresh time to -1, the number of replicas to 0,
increased the amount of memory from 256MB/1GB to 3GB/4GB, allowed the
mlockall, and set the buffer to 30% from 10%.
I have changed the river settings increasing the bulk_size from
100 to 10k, refresh interval is set to 1s
I have changed the mapping to only index title and text. And set
_source to false.
I do not know what else can I modify to increase the indexing ratio, at
this moment it is indexing about 120 docs/sec.
The wikipedia river has a really inefficient regex or two and it spends
most of its time munging wikitext into text. If you Java then you might
want to dig into it. I've never had the time unfortunately. Depending on
what you need you can get the job done with bash and perl. If you just
need the titles from the main namespace and you don't care if you miss a
few you can use this: Load lots of titles · GitHub
I had changed the bulk size and the concurrent bulks, but I had changed
the flush interval in the other way round..(/facepalm) I will try it. thank
you!!
El 19/03/2014 02:23, "joergprante@gmail.com" joergprante@gmail.com
escribió:
You can change the parameters bulk_size from 100 to a higher value,
max_concurrent_bulk from 1 to a higher value, and disable flush_interval
from 5s to -1 to increase wikipedia river bulk performance. Also make sure
the bzip2 file is downloading quick enough over the wire, which is often
not the case (some KB/sec here). Otherwise download it to the local file
system and change url parameter to this file.
The cluster stats show the Lucene indexing time. If you want overall
indexing time, you should take measurements at client side.
That was something that I was thinking since the indexing time shown in
the indexing statistic is much lower than the real time the process is
running. Each 5 seconds that the process is running, only a bit more than a
second is spent indexing.
I was thinking on parsing the wikipedia dumps manually and then use the
bulk api to index them.
Only one question: the indexing time shown in the stats of the cluster
takes into account the time used to choose the shard, send the document
through the network, analyze the document and write the terms? or only the
time spent writting the terms in the index? I am asking this because I am
interested on the indexing performance of Elasricsearch, and I could use
this time and forget the time spent by the river parsing the pages of the
wikipedia.
El 18/03/2014 20:32, "Ivan Brusic" ivan@brusic.com escribió:
It all depends on where the bottleneck is. A river, by design, runs
on only one node. Perhaps Elasticsearch is quickly indexing the content,
but the Wikipedia ingestion is the slowdown. Is the Wikipedia indexer
threaded or are all the bulks being executed on one thread? The
BulkProcessor supports multithreading.
Have you modified any of the default settings? In particular the merge
settings and throttling. Elasticsearch has low defaults for throttling out
of the box. If you have fast disks, you can easily raise the values. Are
you searching on the index while it is being built? If you are just
building the index, you can increase the number of segments and/or reduce
the amount of merging done at once. But it all depends on where the
bottleneck is. If you disk performance is fine, you might need to look
elsewhere.
I am trying to index the wikipedia dumps (downloaded and uncompressed)
using the wikipedia river, but it takes about 6 hours in the cluster.
I have increased the number of nodes and primary shards from 5 to 8,
but the performance does not increase.
I have set the refresh time to -1, the number of replicas to 0,
increased the amount of memory from 256MB/1GB to 3GB/4GB, allowed the
mlockall, and set the buffer to 30% from 10%.
I have changed the river settings increasing the bulk_size from
100 to 10k, refresh interval is set to 1s
I have changed the mapping to only index title and text. And set
_source to false.
I do not know what else can I modify to increase the indexing ratio,
at this moment it is indexing about 120 docs/sec.
On Wed, Mar 19, 2014 at 2:23 PM, Nikolas Everett nik9000@gmail.com wrote:
The wikipedia river has a really inefficient regex or two and it spends
most of its time munging wikitext into text. If you Java then you might
want to dig into it. I've never had the time unfortunately. Depending on
what you need you can get the job done with bash and perl. If you just
need the titles from the main namespace and you don't care if you miss a
few you can use this: Load lots of titles · GitHub
I had changed the bulk size and the concurrent bulks, but I had changed
the flush interval in the other way round..(/facepalm) I will try it. thank
you!!
El 19/03/2014 02:23, "joergprante@gmail.com" joergprante@gmail.com
escribió:
You can change the parameters bulk_size from 100 to a higher value,
max_concurrent_bulk from 1 to a higher value, and disable flush_interval
from 5s to -1 to increase wikipedia river bulk performance. Also make sure
the bzip2 file is downloading quick enough over the wire, which is often
not the case (some KB/sec here). Otherwise download it to the local file
system and change url parameter to this file.
The cluster stats show the Lucene indexing time. If you want overall
indexing time, you should take measurements at client side.
That was something that I was thinking since the indexing time shown in
the indexing statistic is much lower than the real time the process is
running. Each 5 seconds that the process is running, only a bit more than a
second is spent indexing.
I was thinking on parsing the wikipedia dumps manually and then use the
bulk api to index them.
Only one question: the indexing time shown in the stats of the cluster
takes into account the time used to choose the shard, send the document
through the network, analyze the document and write the terms? or only the
time spent writting the terms in the index? I am asking this because I am
interested on the indexing performance of Elasricsearch, and I could use
this time and forget the time spent by the river parsing the pages of the
wikipedia.
El 18/03/2014 20:32, "Ivan Brusic" ivan@brusic.com escribió:
It all depends on where the bottleneck is. A river, by design, runs
on only one node. Perhaps Elasticsearch is quickly indexing the content,
but the Wikipedia ingestion is the slowdown. Is the Wikipedia indexer
threaded or are all the bulks being executed on one thread? The
BulkProcessor supports multithreading.
Have you modified any of the default settings? In particular the merge
settings and throttling. Elasticsearch has low defaults for throttling out
of the box. If you have fast disks, you can easily raise the values. Are
you searching on the index while it is being built? If you are just
building the index, you can increase the number of segments and/or reduce
the amount of merging done at once. But it all depends on where the
bottleneck is. If you disk performance is fine, you might need to look
elsewhere.
I am trying to index the wikipedia dumps (downloaded and
uncompressed) using the wikipedia river, but it takes about 6 hours in the
cluster.
I have increased the number of nodes and primary shards from 5 to 8,
but the performance does not increase.
I have set the refresh time to -1, the number of replicas to 0,
increased the amount of memory from 256MB/1GB to 3GB/4GB, allowed the
mlockall, and set the buffer to 30% from 10%.
I have changed the river settings increasing the bulk_size from
100 to 10k, refresh interval is set to 1s
I have changed the mapping to only index title and text. And
set _source to false.
I do not know what else can I modify to increase the indexing ratio,
at this moment it is indexing about 120 docs/sec.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.