Documents indexed, but cannot 'GET' them

Hi,

I have a index named 'news' and a mapping named 'document'
'news' has 5 shards, 2 replicates.
Below is the mapping of 'document'

{
"document" : {
"_parent" : {
"type" : "cluster"
},
"_routing" : {
"required" : true
},
"_source" : {
"enabled" : false
},
"properties" : {
"clusterid" : {
"type" : "string",
"store" : "yes"
},
"company" : {
"type" : "string"
},
"companyNum" : {
"type" : "string"
},
"count" : {
"type" : "integer"
},
"date" : {
"type" : "date",
"store" : "yes",
"format" : "YYYY-MM-dd"
},
"text" : {
"type" : "string",
"analyzer" : "snowball",
"term_vector" : "with_positions_offsets"
},
"title" : {
"type" : "string",
"boost" : 2.0,
"analyzer" : "snowball",
"store" : "yes",
"term_vector" : "with_positions_offsets"
},
"url" : {
"type" : "string"
}
}
}
}

Don't mind the '_parent'. I don't think that is relevant.

I bulk-indexed 427410 json-documents. I got no errors. Everything went
fine.
Below is the code for bulk indexing (partial code actually)

public void insertDocumentBulk(ArrayList<String> lineList, String

documentMapping) throws Exception
{
BulkRequestBuilder brb = client.prepareBulk();

    for (String line: lineList)
    {
        JSONObject jobj = new JSONObject(line);
        String id = jobj.getString("docid");
        String clusterId = jobj.getString("clusterid");
        String dateFormat =

convertDateFormat(jobj.getInt("date"));
JSONArray sentArray = jobj.getJSONArray("text");
String text = jsonArrayToString(sentArray);
jobj.put("text", text);
jobj.put("date", dateFormat);
jobj.remove("docid");
brb.add(client.prepareIndex(index, documentMapping,
id).setParent(clusterId).setSource(jobj.toString()));
}

    BulkResponse bulkResponse = brb.execute().actionGet();

    int count = 0;
    if (bulkResponse.hasFailures())
    {
        count++;
    }
    System.out.println("error count: " + count);
}

But when I try to 'GET' some of the documents, I get nothing.
For example, if I do a query "curl -XGET 'http://etridorm.iptime.org:
9200/news/document/20110803_0_96464759519569618?fields=title'",
I get
"{"_index":"news","_type":"document","_id":"20110803_0_96464759519569618","exists":false}".

It's not that I can't 'GET' all the documents. Some of them are
accessible, but some of them aren't.
But the funny thing is, when I do a query "curl -XGET 'http://
etridorm.iptime.org:9200/news/document/_count?q=*'"
I get {"count":427410,"_shards":{"total":5,"successful":5,"failed":
0}}", which is the exact same number of documents I had indexed.

I did flushing (curl -XPOST 'http://etridorm.iptime.org:9200/news/
document/_flush),
I did refreshing (curl -XPOST 'http://etridorm.iptime.org:9200/news/
document/_refresh),
but nothing seemed to improve the situation.

Am I doing something wrong here?
I'd appreciate any help.

Ed

I've found the answer to my problem. (as it is often the case)
After you index a document with the "_parent" set, if you want to "GET" the
document, you need to specify the "routing" parameter.
For example, if you index a document with an id "1234" and its parent
"5678", then the "GET" command should be,
"curl -XGET 'http://localhost:9200/index/mapping/1234?routing=5678'".
Hope this helps someone like me.

2012/4/9 mp2893 mp2893@gmail.com

Hi,

I have a index named 'news' and a mapping named 'document'
'news' has 5 shards, 2 replicates.
Below is the mapping of 'document'


{
"document" : {
"_parent" : {
"type" : "cluster"
},
"_routing" : {
"required" : true
},
"_source" : {
"enabled" : false
},
"properties" : {
"clusterid" : {
"type" : "string",
"store" : "yes"
},
"company" : {
"type" : "string"
},
"companyNum" : {
"type" : "string"
},
"count" : {
"type" : "integer"
},
"date" : {
"type" : "date",
"store" : "yes",
"format" : "YYYY-MM-dd"
},
"text" : {
"type" : "string",
"analyzer" : "snowball",
"term_vector" : "with_positions_offsets"
},
"title" : {
"type" : "string",
"boost" : 2.0,
"analyzer" : "snowball",
"store" : "yes",
"term_vector" : "with_positions_offsets"
},
"url" : {
"type" : "string"
}
}
}
}


Don't mind the '_parent'. I don't think that is relevant.

I bulk-indexed 427410 json-documents. I got no errors. Everything went
fine.
Below is the code for bulk indexing (partial code actually)


public void insertDocumentBulk(ArrayList lineList, String
documentMapping) throws Exception
{
BulkRequestBuilder brb = client.prepareBulk();

   for (String line: lineList)
   {
       JSONObject jobj = new JSONObject(line);
       String id = jobj.getString("docid");
       String clusterId = jobj.getString("clusterid");
       String dateFormat =

convertDateFormat(jobj.getInt("date"));
JSONArray sentArray = jobj.getJSONArray("text");
String text = jsonArrayToString(sentArray);
jobj.put("text", text);
jobj.put("date", dateFormat);
jobj.remove("docid");
brb.add(client.prepareIndex(index, documentMapping,
id).setParent(clusterId).setSource(jobj.toString()));
}

   BulkResponse bulkResponse = brb.execute().actionGet();

   int count = 0;
   if (bulkResponse.hasFailures())
   {
       count++;
   }
   System.out.println("error count: " + count);

}


But when I try to 'GET' some of the documents, I get nothing.
For example, if I do a query "curl -XGET 'http://etridorm.iptime.org:
9200/news/document/20110803_0_96464759519569618?fields=title'",
I get

"{"_index":"news","_type":"document","_id":"20110803_0_96464759519569618","exists":false}".

It's not that I can't 'GET' all the documents. Some of them are
accessible, but some of them aren't.
But the funny thing is, when I do a query "curl -XGET 'http://
etridorm.iptime.org:9200/news/document/_count?q=*'"
I get {"count":427410,"_shards":{"total":5,"successful":5,"failed":
0}}", which is the exact same number of documents I had indexed.

I did flushing (curl -XPOST 'http://etridorm.iptime.org:9200/news/
document/_flush),
I did refreshing (curl -XPOST 'http://etridorm.iptime.org:9200/news/
document/_refresh),
but nothing seemed to improve the situation.

Am I doing something wrong here?
I'd appreciate any help.

Ed

Yea, thats because the child document is routed based on the parent id so
they end up in the same shard.

On Tue, Apr 10, 2012 at 3:14 AM, edward choi mp2893@gmail.com wrote:

I've found the answer to my problem. (as it is often the case)
After you index a document with the "_parent" set, if you want to "GET"
the document, you need to specify the "routing" parameter.
For example, if you index a document with an id "1234" and its parent
"5678", then the "GET" command should be,
"curl -XGET 'http://localhost:9200/index/mapping/1234?routing=5678'".
Hope this helps someone like me.

2012/4/9 mp2893 mp2893@gmail.com

Hi,

I have a index named 'news' and a mapping named 'document'
'news' has 5 shards, 2 replicates.
Below is the mapping of 'document'


{
"document" : {
"_parent" : {
"type" : "cluster"
},
"_routing" : {
"required" : true
},
"_source" : {
"enabled" : false
},
"properties" : {
"clusterid" : {
"type" : "string",
"store" : "yes"
},
"company" : {
"type" : "string"
},
"companyNum" : {
"type" : "string"
},
"count" : {
"type" : "integer"
},
"date" : {
"type" : "date",
"store" : "yes",
"format" : "YYYY-MM-dd"
},
"text" : {
"type" : "string",
"analyzer" : "snowball",
"term_vector" : "with_positions_offsets"
},
"title" : {
"type" : "string",
"boost" : 2.0,
"analyzer" : "snowball",
"store" : "yes",
"term_vector" : "with_positions_offsets"
},
"url" : {
"type" : "string"
}
}
}
}


Don't mind the '_parent'. I don't think that is relevant.

I bulk-indexed 427410 json-documents. I got no errors. Everything went
fine.
Below is the code for bulk indexing (partial code actually)


public void insertDocumentBulk(ArrayList lineList, String
documentMapping) throws Exception
{
BulkRequestBuilder brb = client.prepareBulk();

   for (String line: lineList)
   {
       JSONObject jobj = new JSONObject(line);
       String id = jobj.getString("docid");
       String clusterId = jobj.getString("clusterid");
       String dateFormat =

convertDateFormat(jobj.getInt("date"));
JSONArray sentArray = jobj.getJSONArray("text");
String text = jsonArrayToString(sentArray);
jobj.put("text", text);
jobj.put("date", dateFormat);
jobj.remove("docid");
brb.add(client.prepareIndex(index, documentMapping,
id).setParent(clusterId).setSource(jobj.toString()));
}

   BulkResponse bulkResponse = brb.execute().actionGet();

   int count = 0;
   if (bulkResponse.hasFailures())
   {
       count++;
   }
   System.out.println("error count: " + count);

}


But when I try to 'GET' some of the documents, I get nothing.
For example, if I do a query "curl -XGET 'http://etridorm.iptime.org:
9200/news/document/20110803_0_96464759519569618?fields=title'",
I get

"{"_index":"news","_type":"document","_id":"20110803_0_96464759519569618","exists":false}".

It's not that I can't 'GET' all the documents. Some of them are
accessible, but some of them aren't.
But the funny thing is, when I do a query "curl -XGET 'http://
etridorm.iptime.org:9200/news/document/_count?q=*'"
I get {"count":427410,"_shards":{"total":5,"successful":5,"failed":
0}}", which is the exact same number of documents I had indexed.

I did flushing (curl -XPOST 'http://etridorm.iptime.org:9200/news/
document/_flush http://etridorm.iptime.org:9200/news/document/_flush),
I did refreshing (curl -XPOST 'http://etridorm.iptime.org:9200/news/
document/_refreshhttp://etridorm.iptime.org:9200/news/document/_refresh
),
but nothing seemed to improve the situation.

Am I doing something wrong here?
I'd appreciate any help.

Ed

So I can't GET a document with a known ID, unless I know its parent's ID as
well??

This really throws a spanner in the works for me. I started getting the
same problem (random GETs failing) when I added parent-child documents to
my db. But I was relying on being able to get a document knowing only its
ID. Is there a workaround?

thx
A

On Wednesday, April 11, 2012 9:31:45 PM UTC+10, kimchy wrote:

Yea, thats because the child document is routed based on the parent id so
they end up in the same shard.

On Tue, Apr 10, 2012 at 3:14 AM, edward choi <mp2...@gmail.com<javascript:>

wrote:

I've found the answer to my problem. (as it is often the case)
After you index a document with the "_parent" set, if you want to "GET"
the document, you need to specify the "routing" parameter.
For example, if you index a document with an id "1234" and its parent
"5678", then the "GET" command should be,
"curl -XGET 'http://localhost:9200/index/mapping/1234?routing=5678'".
Hope this helps someone like me.

2012/4/9 mp2893 <mp2...@gmail.com <javascript:>>

Hi,

I have a index named 'news' and a mapping named 'document'
'news' has 5 shards, 2 replicates.
Below is the mapping of 'document'


{
"document" : {
"_parent" : {
"type" : "cluster"
},
"_routing" : {
"required" : true
},
"_source" : {
"enabled" : false
},
"properties" : {
"clusterid" : {
"type" : "string",
"store" : "yes"
},
"company" : {
"type" : "string"
},
"companyNum" : {
"type" : "string"
},
"count" : {
"type" : "integer"
},
"date" : {
"type" : "date",
"store" : "yes",
"format" : "YYYY-MM-dd"
},
"text" : {
"type" : "string",
"analyzer" : "snowball",
"term_vector" : "with_positions_offsets"
},
"title" : {
"type" : "string",
"boost" : 2.0,
"analyzer" : "snowball",
"store" : "yes",
"term_vector" : "with_positions_offsets"
},
"url" : {
"type" : "string"
}
}
}
}


Don't mind the '_parent'. I don't think that is relevant.

I bulk-indexed 427410 json-documents. I got no errors. Everything went
fine.
Below is the code for bulk indexing (partial code actually)


public void insertDocumentBulk(ArrayList lineList, String
documentMapping) throws Exception
{
BulkRequestBuilder brb = client.prepareBulk();

   for (String line: lineList)
   {
       JSONObject jobj = new JSONObject(line);
       String id = jobj.getString("docid");
       String clusterId = jobj.getString("clusterid");
       String dateFormat =

convertDateFormat(jobj.getInt("date"));
JSONArray sentArray = jobj.getJSONArray("text");
String text = jsonArrayToString(sentArray);
jobj.put("text", text);
jobj.put("date", dateFormat);
jobj.remove("docid");
brb.add(client.prepareIndex(index, documentMapping,
id).setParent(clusterId).setSource(jobj.toString()));
}

   BulkResponse bulkResponse = brb.execute().actionGet();

   int count = 0;
   if (bulkResponse.hasFailures())
   {
       count++;
   }
   System.out.println("error count: " + count);

}


But when I try to 'GET' some of the documents, I get nothing.
For example, if I do a query "curl -XGET 'http://etridorm.iptime.org:
9200/news/document/20110803_0_96464759519569618?fields=title'",
I get

"{"_index":"news","_type":"document","_id":"20110803_0_96464759519569618","exists":false}".

It's not that I can't 'GET' all the documents. Some of them are
accessible, but some of them aren't.
But the funny thing is, when I do a query "curl -XGET 'http://
etridorm.iptime.org:9200/news/document/_count?q=*'"
I get {"count":427410,"_shards":{"total":5,"successful":5,"failed":
0}}", which is the exact same number of documents I had indexed.

I did flushing (curl -XPOST 'http://etridorm.iptime.org:9200/news/
document/_flush http://etridorm.iptime.org:9200/news/document/_flush),
I did refreshing (curl -XPOST 'http://etridorm.iptime.org:9200/news/
document/_refreshhttp://etridorm.iptime.org:9200/news/document/_refresh
),
but nothing seemed to improve the situation.

Am I doing something wrong here?
I'd appreciate any help.

Ed

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.