Elasticsearch failed to start shard and deletes all index files

Hi,

I am trying to index Wikipedia dump with a Python script I wrote. I simply
read the Wikipedia dump and for each article, I send a curl request to
index id, title and text properties of the page. Here is the mapping I use:

{
"article" : {
"properties" : {
"text" : {
"type" : "string"
},
"title" : {
"type" : "string"
},
"wid" : {
"type" : "date",
"format" : "dateOptionalTime"
}
}
}

After my script finishes its work, I am left with a 55GB data index
(extracted wikipedia dump is 39GB). However, after I start Elasticsearch
again, it deletes all the indexed files and give the following error:
"failed to start
shard org.elasticsearch.indices.recovery.RecoveryFailedException:" (more
log: https://gist.github.com/4099519). Do you have any suggestions on it?
What can be the problem?

Note: I also tried to use Wikipedia plugin, however I was never able to
index more than 5gb (indexing usually stops when data folder reach to 5gb
index and even when I start the server again it reaches at most ~7gb of
data).

Thanks,
Pinar

--

Are you running the same version of elasticsearch on all 3 nodes?

Wikipedia river runs much better when you use local dump files instead of
downloading it and indexing it at the same time.

On Saturday, November 17, 2012 4:03:00 PM UTC-5, Pınar Yanardağ wrote:

Hi,

I am trying to index Wikipedia dump with a Python script I wrote. I simply
read the Wikipedia dump and for each article, I send a curl request to
index id, title and text properties of the page. Here is the mapping I use:

{
"article" : {
"properties" : {
"text" : {
"type" : "string"
},
"title" : {
"type" : "string"
},
"wid" : {
"type" : "date",
"format" : "dateOptionalTime"
}
}
}

After my script finishes its work, I am left with a 55GB data index
(extracted wikipedia dump is 39GB). However, after I start Elasticsearch
again, it deletes all the indexed files and give the following error:
"failed to start
shard org.elasticsearch.indices.recovery.RecoveryFailedException:" (more
log: elastic · GitHub). Do you have any suggestions on it?
What can be the problem?

Note: I also tried to use Wikipedia plugin, however I was never able to
index more than 5gb (indexing usually stops when data folder reach to 5gb
index and even when I start the server again it reaches at most ~7gb of
data).

Thanks,
Pinar

--

On Sunday, November 18, 2012 8:08:55 AM UTC-5, Igor Motov wrote:

Are you running the same version of elasticsearch on all 3 nodes?

Yes, I am using the same version (0.19.11).

Wikipedia river runs much better when you use local dump files instead of
downloading it and indexing it at the same time.

Actually I always tried to index with a local dump like the following:

curl -XPUT localhost:9200/_river/my_river/_meta -d '{ "type" : "wikipedia",
"index" : { "name" : "my_index", "type" : "my_type", "url" :
"/local/enwiki-latest-pages-articles.xml.bz2", "bulk_size" : 10 } }'

However I am not sure if Elasticsearch is using the local dump I provided,
because after I sent the curl above, I see the download link in the logs:
"[wikipedia][my_river] creating wikipedia stream river for
[http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2]".

Thanks,
Pinar

On Saturday, November 17, 2012 4:03:00 PM UTC-5, Pınar Yanardağ wrote:

Hi,

I am trying to index Wikipedia dump with a Python script I wrote. I
simply read the Wikipedia dump and for each article, I send a curl request
to index id, title and text properties of the page. Here is the mapping I
use:

{
"article" : {
"properties" : {
"text" : {
"type" : "string"
},
"title" : {
"type" : "string"
},
"wid" : {
"type" : "date",
"format" : "dateOptionalTime"
}
}
}

After my script finishes its work, I am left with a 55GB data index
(extracted wikipedia dump is 39GB). However, after I start Elasticsearch
again, it deletes all the indexed files and give the following error:
"failed to start
shard org.elasticsearch.indices.recovery.RecoveryFailedException:" (more
log: elastic · GitHub). Do you have any suggestions on
it? What can be the problem?

Note: I also tried to use Wikipedia plugin, however I was never able to
index more than 5gb (indexing usually stops when data folder reach to 5gb
index and even when I start the server again it reaches at most ~7gb of
data).

Thanks,
Pinar

--

It actually should be:

{
"type" : "wikipedia",
"wikipedia" : {
"url" : "file:///local/enwiki-latest-pages-articles.xml.bz2"
},
"index": {
"index" : "my_index",
"type" : "my_type"
},
"bulk_size" : 10
}

On Sunday, November 18, 2012 12:55:35 PM UTC-5, Pınar Yanardağ wrote:

On Sunday, November 18, 2012 8:08:55 AM UTC-5, Igor Motov wrote:

Are you running the same version of elasticsearch on all 3 nodes?

Yes, I am using the same version (0.19.11).

Wikipedia river runs much better when you use local dump files instead of
downloading it and indexing it at the same time.

Actually I always tried to index with a local dump like the following:

curl -XPUT localhost:9200/_river/my_river/_meta -d '{ "type" :
"wikipedia", "index" : { "name" : "my_index", "type" : "my_type",
"url" : "/local/enwiki-latest-pages-articles.xml.bz2", "bulk_size" : 10 }
}'

However I am not sure if Elasticsearch is using the local dump I provided,
because after I sent the curl above, I see the download link in the logs:
"[wikipedia][my_river] creating wikipedia stream river for [
http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
]".

Thanks,
Pinar

On Saturday, November 17, 2012 4:03:00 PM UTC-5, Pınar Yanardağ wrote:

Hi,

I am trying to index Wikipedia dump with a Python script I wrote. I
simply read the Wikipedia dump and for each article, I send a curl request
to index id, title and text properties of the page. Here is the mapping I
use:

{
"article" : {
"properties" : {
"text" : {
"type" : "string"
},
"title" : {
"type" : "string"
},
"wid" : {
"type" : "date",
"format" : "dateOptionalTime"
}
}
}

After my script finishes its work, I am left with a 55GB data index
(extracted wikipedia dump is 39GB). However, after I start Elasticsearch
again, it deletes all the indexed files and give the following error:
"failed to start
shard org.elasticsearch.indices.recovery.RecoveryFailedException:" (more
log: elastic · GitHub). Do you have any suggestions on
it? What can be the problem?

Note: I also tried to use Wikipedia plugin, however I was never able to
index more than 5gb (indexing usually stops when data folder reach to 5gb
index and even when I start the server again it reaches at most ~7gb of
data).

Thanks,
Pinar

--