Unexpected result from field collapsing

I have an index containing emails. These documents have a type field, one value of which is "pending" and another is "sent", with the obvious meanings. They also have a "batchid" by which multiple emails are grouped. There is a UI which displays pending emails. When instructed in the UI a batch (those emails with a common batchid) can start to send in the background (it is a slow process, only a few a second), and if the user stays there, the UI will update (by polling every 5s) showing how far it has got.

The object which is retrieved from ES and passed to the UI is an array of batches, each of which is an array of email objects, so it is in a convenient form to display.

The problem is that just occasionally (and unreproducibly), it looks like one of the batches is an empty array. Since there isn't a structure for a batch, merely common batchid's, in principle a batch can't exist without an email belonging to it, so this ought to be impossible.

The sending process saves each emails with type="sent" as it completes, and is therefore no longer found by my UI query once the type has been re-indexed. There is an explicit _refresh at the end of sending, so I think it is more likely to be seen changing at the very end, which is when there would be no "pending" email in a batch.

So what I think is happening (but currently only have circumstantial evidence for) is the query (below) is seeing the batchid on which is is collapsing, but the sending process then removes the email from the query criteria while the query is in progress, and the UI query then doesn't actually retrieve the email (and doesn't say it can't - it appears to work). So instead of yielding an array with one fewer batches than before, it gives and array with an empty batch.

Is that possible? And what is the status of the totals the query is providing in the result if it is changing under its feet? If it shouldn't be possible, what other way could this query produce an empty element in the top level array?

The query is roughly as follows (I've removed some of the _source array content for clarity):

{
"query": {
	"term": {
		"type": "pending"
	}
},
"sort": {
	"batchid": "desc"
},
"from": 0,
"size": 10,
"_source": ["toname", "toemail", "date", ...],
"seq_no_primary_term": true,
"collapse": {
	"field": "batchid",
	"inner_hits": {
		"sort": {
			"membershipnumber": "asc"
		},
		"from": 0,
		"size": 5,
		"_source": ["batchid", "toname", "toemail", "date", ...],
		"name": "na",
		"seq_no_primary_term": true
	}
}
}

And a typical result:

[
    [
        {"batchid": 1002, "toemail": "pesron1@example.com", ....},
        {"batchid": 1002, "toemail": "person2@example.com"...},
        ....
    ],
    [
        {"batchid": 1001, ....},
        {"batchid": 1001, ....},
        ....
    ]
]

And an anomalous result:

[
    [
    ],
    [
        {"batchid": 1001, ....},
        {"batchid": 1001, ....},
        ....
    ]
]

Actually more commonly there is only one pending batch, so the anomalous result is:

[
    [
    ]
]

(I've added some code to excise empty batch arrays if they happen and to instrument it further so I can see detail of what is actually happening. But it happens rarely enough that it may be some time before this shows me anything. I should probably also guard against the impossible happening in the UI and check for an empty array - it is currently assuming there is at least one email in a batch).

What you describe here is highly possible since the extra round trip that retrieves the inner_hits for the collapsing field uses a new reader. We have an idea on how to fix this and we are currently working on a new feature to allow the same reader to be used in multiple requests:


Once this issue is fixed we'll be able to tackle this issue efficiently.

Thanks for confirming. That means the workround I had already put in speculatively should indeed fix the problem next time it shows up.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.