Here is our situation. Following is our document structure :-
{
'key1': 'some_val',
'key2': 'some_very_large_binary_value',
...
}
Because value of key2 is a large binary value and because we don't need it to be accessible in kibana and other various analytical jobs, we are excluding it from '_src' and making it a stored field, there is only one type of analysis where we need that, when we provide the following query to get it :-
GET /_search
{
"stored_fields" : ["key2"],
"query" : {
"term" : { "key1" : "some_value" }
}
}
This works fine until we are not using spark. With spark, the value of 'key2' comes out as null in each document.
We use spark for large scale analysis and thus the elasticsearch hadoop connector. We saw the query which is eventually generated by the connector. It looks like :-
POST /_search?sort=_doc&scroll=5m&size=50&_source=key1,key2
Why is the connector putting 'key2' in '_source' ? That's why we get null in each document retrieved.
Is there some configuration we are missing ?