Hi,
I have a index named 'news' and a mapping named 'document'
'news' has 5 shards, 2 replicates.
Below is the mapping of 'document'
{
"document" : {
"_parent" : {
"type" : "cluster"
},
"_routing" : {
"required" : true
},
"_source" : {
"enabled" : false
},
"properties" : {
"clusterid" : {
"type" : "string",
"store" : "yes"
},
"company" : {
"type" : "string"
},
"companyNum" : {
"type" : "string"
},
"count" : {
"type" : "integer"
},
"date" : {
"type" : "date",
"store" : "yes",
"format" : "YYYY-MM-dd"
},
"text" : {
"type" : "string",
"analyzer" : "snowball",
"term_vector" : "with_positions_offsets"
},
"title" : {
"type" : "string",
"boost" : 2.0,
"analyzer" : "snowball",
"store" : "yes",
"term_vector" : "with_positions_offsets"
},
"url" : {
"type" : "string"
}
}
}
}
Don't mind the '_parent'. I don't think that is relevant.
I bulk-indexed 427410 json-documents. I got no errors. Everything went
fine.
Below is the code for bulk indexing (partial code actually)
public void insertDocumentBulk(ArrayList<String> lineList, String
documentMapping) throws Exception
{
BulkRequestBuilder brb = client.prepareBulk();
for (String line: lineList)
{
JSONObject jobj = new JSONObject(line);
String id = jobj.getString("docid");
String clusterId = jobj.getString("clusterid");
String dateFormat =
convertDateFormat(jobj.getInt("date"));
JSONArray sentArray = jobj.getJSONArray("text");
String text = jsonArrayToString(sentArray);
jobj.put("text", text);
jobj.put("date", dateFormat);
jobj.remove("docid");
brb.add(client.prepareIndex(index, documentMapping,
id).setParent(clusterId).setSource(jobj.toString()));
}
BulkResponse bulkResponse = brb.execute().actionGet();
int count = 0;
if (bulkResponse.hasFailures())
{
count++;
}
System.out.println("error count: " + count);
}
But when I try to 'GET' some of the documents, I get nothing.
For example, if I do a query "curl -XGET 'http://etridorm.iptime.org:
9200/news/document/20110803_0_96464759519569618?fields=title'",
I get
"{"_index":"news","_type":"document","_id":"20110803_0_96464759519569618","exists":false}".
It's not that I can't 'GET' all the documents. Some of them are
accessible, but some of them aren't.
But the funny thing is, when I do a query "curl -XGET 'http://
etridorm.iptime.org:9200/news/document/_count?q=*'"
I get {"count":427410,"_shards":{"total":5,"successful":5,"failed":
0}}", which is the exact same number of documents I had indexed.
I did flushing (curl -XPOST 'http://etridorm.iptime.org:9200/news/
document/_flush),
I did refreshing (curl -XPOST 'http://etridorm.iptime.org:9200/news/
document/_refresh),
but nothing seemed to improve the situation.
Am I doing something wrong here?
I'd appreciate any help.
Ed