I am trying to index Wikipedia dump with a Python script I wrote. I simply
read the Wikipedia dump and for each article, I send a curl request to
index id, title and text properties of the page. Here is the mapping I use:
After my script finishes its work, I am left with a 55GB data index
(extracted wikipedia dump is 39GB). However, after I start Elasticsearch
again, it deletes all the indexed files and give the following error:
"failed to start
shard org.elasticsearch.indices.recovery.RecoveryFailedException:" (more
log: https://gist.github.com/4099519). Do you have any suggestions on it?
What can be the problem?
Note: I also tried to use Wikipedia plugin, however I was never able to
index more than 5gb (indexing usually stops when data folder reach to 5gb
index and even when I start the server again it reaches at most ~7gb of
data).
Are you running the same version of elasticsearch on all 3 nodes?
Wikipedia river runs much better when you use local dump files instead of
downloading it and indexing it at the same time.
On Saturday, November 17, 2012 4:03:00 PM UTC-5, Pınar Yanardağ wrote:
Hi,
I am trying to index Wikipedia dump with a Python script I wrote. I simply
read the Wikipedia dump and for each article, I send a curl request to
index id, title and text properties of the page. Here is the mapping I use:
After my script finishes its work, I am left with a 55GB data index
(extracted wikipedia dump is 39GB). However, after I start Elasticsearch
again, it deletes all the indexed files and give the following error:
"failed to start
shard org.elasticsearch.indices.recovery.RecoveryFailedException:" (more
log: elastic · GitHub). Do you have any suggestions on it?
What can be the problem?
Note: I also tried to use Wikipedia plugin, however I was never able to
index more than 5gb (indexing usually stops when data folder reach to 5gb
index and even when I start the server again it reaches at most ~7gb of
data).
On Saturday, November 17, 2012 4:03:00 PM UTC-5, Pınar Yanardağ wrote:
Hi,
I am trying to index Wikipedia dump with a Python script I wrote. I
simply read the Wikipedia dump and for each article, I send a curl request
to index id, title and text properties of the page. Here is the mapping I
use:
After my script finishes its work, I am left with a 55GB data index
(extracted wikipedia dump is 39GB). However, after I start Elasticsearch
again, it deletes all the indexed files and give the following error:
"failed to start
shard org.elasticsearch.indices.recovery.RecoveryFailedException:" (more
log: elastic · GitHub). Do you have any suggestions on
it? What can be the problem?
Note: I also tried to use Wikipedia plugin, however I was never able to
index more than 5gb (indexing usually stops when data folder reach to 5gb
index and even when I start the server again it reaches at most ~7gb of
data).
On Saturday, November 17, 2012 4:03:00 PM UTC-5, Pınar Yanardağ wrote:
Hi,
I am trying to index Wikipedia dump with a Python script I wrote. I
simply read the Wikipedia dump and for each article, I send a curl request
to index id, title and text properties of the page. Here is the mapping I
use:
After my script finishes its work, I am left with a 55GB data index
(extracted wikipedia dump is 39GB). However, after I start Elasticsearch
again, it deletes all the indexed files and give the following error:
"failed to start
shard org.elasticsearch.indices.recovery.RecoveryFailedException:" (more
log: elastic · GitHub). Do you have any suggestions on
it? What can be the problem?
Note: I also tried to use Wikipedia plugin, however I was never able to
index more than 5gb (indexing usually stops when data folder reach to 5gb
index and even when I start the server again it reaches at most ~7gb of
data).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.