Hello,
I am currently running into an error where I reindex documents from an index with a subset of data, run them through a pipeline to create vector embeddings with ELSER and create passages, which also have vector embeddings. I use a similar pipeline as this one.
The error in question looks like this:
{
"completed": true,
"task": {
"node": "nodeId",
"id": 168105,
"type": "transport",
"action": "indices:data/write/reindex",
"status": {
"total": 10000,
"updated": 0,
"created": 59,
"deleted": 0,
"batches": 59,
"version_conflicts": 0,
"noops": 0,
"retries": {
"bulk": 0,
"search": 0
},
"throttled_millis": 0,
"requests_per_second": -1,
"throttled_until_millis": 0
},
"description": "reindex from [1-subset] to [1-elser]",
"start_time_in_millis": 1719323450496,
"running_time_in_nanos": 1067795555450,
"cancellable": true,
"cancelled": false,
"headers": {
"trace.id": "3b984be232f7068413ddfb716b77fc02"
}
},
"error": {
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": -1,
"index": null,
"reason": {
"type": "search_context_missing_exception",
"reason": "No search context found for id [14257]"
}
}
],
"caused_by": {
"type": "search_context_missing_exception",
"reason": "No search context found for id [14257]"
}
}
}
I recognize this error, a while back in my local docker container I had the same error when trying to reindex documents that took too long. I had thought it was my local hardware, however this time I am trying it on a cloud trial and I still run into this error.
Initially I thought the problem was that I had put the size at too much. This did fix the issue. Now the error is back even though I put the size on 1.
My Reindex request:
POST _reindex?wait_for_completion=false
{
"source": {
"index": "Index1",
"size": 1
},
"dest": {
"index": "Index2",
"pipeline": "chunker-elser-v2"
}
}
The reindex request is pretty simple, put the size on 1, and add the pipeline.
My Pipeline:
PUT _ingest/pipeline/chunker-elser-v2
{
"processors": [
{
"script": {
"description": "Chunk content into sentences by looking for . followed by a space",
"lang": "painless",
"if": "ctx.content != null && !ctx.content.isEmpty()",
"source": "\n String[] envSplit = /((?<!M(r|s|rs)\\.)(?<=\\.) |(?<=\\!) |(?<=\\?) )/.split(ctx['content']);\n ctx['passages'] = new ArrayList();\n int i = 0;\n boolean remaining = true;\n if (envSplit.length == 0) {\n return\n } else if (envSplit.length == 1) {\n Map passage = ['text': envSplit[0]];ctx['passages'].add(passage)\n } else {\n while (remaining) {\n Map passage = ['text': envSplit[i++]];\n while (i < envSplit.length && passage.text.length() + envSplit[i].length() < params.model_limit) {passage.text = passage.text + ' ' + envSplit[i++]}\n if (i == envSplit.length) {remaining = false}\n ctx['passages'].add(passage)\n }\n }\n ",
"params": {
"model_limit": 400
}
}
},
{
"foreach": {
"field": "passages",
"processor": {
"inference": {
"model_id": ".elser_model_2",
"input_output": {
"input_field": "_ingest._value.text",
"output_field": "_ingest._value.vector.predicted_value"
},
"on_failure": [
{
"append": {
"field": "_source._ingest.inference_errors",
"value": [
{
"message": "Processor 'inference' in pipeline 'chunker-elser-v2' failed with message '{{ _ingest.on_failure_message }}'",
"pipeline": "ml-inference-title-vector",
"timestamp": "{{{ _ingest.timestamp }}}"
}
]
}
}
]
}
},
"if": "ctx.passages != null"
}
},
{
"inference": {
"if": "ctx.title != null && !ctx.title.isEmpty()",
"model_id": ".elser_model_2",
"input_output": {
"input_field": "title",
"output_field": "ml.title.vector.predicted_value"
},
"on_failure": [
{
"append": {
"field": "_source._ingest.inference_errors",
"value": [
{
"message": "Processor 'inference' in pipeline 'ml-inference-title-vector' failed with message '{{ _ingest.on_failure_message }}'",
"pipeline": "ml-inference-title-vector",
"timestamp": "{{{ _ingest.timestamp }}}"
}
]
}
}
]
}
}
]
}
As mentioned before the pipeline is based on this example from Elastics blog. However I modified it slightly.
Would increasing the scroll time be a viable solution?
Any Idea how to fix this would be appreciated!
I have also seen this blog popup which is awesome!