I'm trying to spin up a small working example of what we have in staging/production in a local dockerized setup. I've downloaded some representative data from my staging environment, now I just need to inject it into my docker container. However the _bulk
API seems to be creating my index as a data stream and I don't know why. It shouldn't be doing that.
Here is my data seeding script which should seed the elasticsearch container with my exported data:
# first, I push up the templates
for name in $(jq -r 'keys[]' "templates.json"); do
template=$(jq -r --arg name "$name" '.[$name]' "templates.json")
curl -s -XPUT "http://elasticsearch:9200/_template/$name" \
-H 'Content-Type: application/json' \
-d"$template"
done
bulk_insert_daily_metric_data() {
local source_data="$1"
local index_prefix="$2"
local op_type="$3"
rm -f "${source_data}_bulk_data.json" || true
jq -rc 'keys[] as $k | "\(.[$k].timestamp)\n\(.[$k])"' "${source_data}.json" |
while read -r timestamp; do
read -r doc
if [[ $timestamp != "null" ]]; then
day=$(echo "$timestamp" | cut -f1 -dT)
# add a header to EACH ROW to indicate the correct index and op_type
printf '{"%s": {"_index": "%s-%s", "_type": "_doc"}}\n' \
"$op_type" "${index_prefix}" "${day}" \
>>"${source_data}_bulk_data.json"
printf "%s\n" "$doc" >>"${source_data}_bulk_data.json"
fi
done
curl -s -XPOST "http://elasticsearch:9200/_bulk" \
-H "Content-Type: application/x-ndjson" \
--data-binary "@${source_data}_bulk_data.json"
}
# Bulk insert metric and category-metric data
bulk_insert_daily_metric_data "elasticsearch_metrics" "metrics" "index"
bulk_insert_daily_metric_data "elasticsearch_category_metrics" "category-metrics" "index"
Here is my templates.json
file (formatted for readability, the actual file is just one single line): templates.json · GitHub
As for the elasticsearch_metrics.json
and elasticsearch_category_metrics.json
data, the files contain data like this:
elasticsearch_metrics.json
[{"warehouse_uuid": "f40cc278-68c0-47ab-848d-b4a4cf201e3b", "full_table_id": "test_schema:public.sch_test_00007", "field": null, "metric": "total_byte_count", "value": 0.0, "timestamp": "2024-05-13T03:36:41", "measurement_timestamp": "2024-05-13T03:36:41", "dimensions": null, "context": {"is_bootstrap": null, "data_source": null, "data_provider": null, "consolidating_uuid": null, "latest_value": null, "baseline_value": 0.0}, "last_timestamp_in_bucket": null, "job_execution_uuid": "f9339f7c-d578-4728-99fd-1f191336b5dc", "pipeline_options": null, "mcon": "MCON++44a3f5f6-4015-44c1-8f98-3a5cbbaba2ff++f40cc278-68c0-47ab-848d-b4a4cf201e3b++table++test_schema:public.sch_test_00007", "segmented_fields": null, "thresholds": [{"type": "volume_change", "status": "inactive", "reason": null, "upper": null, "lower": null, "upper_high": null, "upper_medium": null, "upper_low": null, "lower_high": null, "lower_medium": null, "lower_low": null}]}, ...]
And the above script essentially translates each item in this JSON Array into two lines in the bulk_data
file. So the above JSON would become this:
{"index": {"_index": "metrics-2024-05-13", "_type": "_doc"}}
{"warehouse_uuid": "f40cc278-68c0-47ab-848d-b4a4cf201e3b" ...}
...
As you can see, I have added the line {"index": {"_index": "metrics-2024-05-13", "_type": "_doc"}}
above the first item in the array. Each item in the array will have its own line added above it, with the correct _index
value, based on timestamp. This is just how the data is currently set up in our deployed environments.
elasticsearch_category_metrics.json
[{"warehouse_uuid":"f40cc278-68c0-47ab-848d-b4a4cf201e3b","full_table_id":"test_schema:smoke_tests.test_table","field":"name","metric":"category_dist","value":1,"timestamp":"2024-05-01T01:20:32","measurement_timestamp":"2024-05-01T01:20:32","dimensions":{"label":"one","monitor_uuid":"cf6fbd46-d85b-4c1a-a577-5872d60216e2"},"context":{"is_bootstrap":null,"data_source":null,"data_provider":null,"consolidating_uuid":"73a22b71-ffff-49fe-ae27-992f5d5b3aeb","latest_value":null,"baseline_value":null},"last_timestamp_in_bucket":null,"job_execution_uuid":"a01a3da3-63ed-495a-b605-ab9dde66f5ad","pipeline_options":null,"mcon":null,"segmented_fields":null,"thresholds":null}, ...]
And then the bulk_data
version would look like:
{"index": {"_index": "category-metrics-2024-05-01", "_type": "_doc"}}
{"warehouse_uuid":"f40cc278-68c0-47ab-848d-b4a4cf201e3b", ...}
...
What Actually Happens
When I run my script, I get the correct ingestion of category-metrics
data.
However when the script tries to ingest the metrics
data, I get this error:
{
"took": 0,
"errors": true,
"items": [
{
"index": {
"_index": "metrics-2024-05-13",
"_type": "_doc",
"_id": null,
"status": 400,
"error": {
"type": "illegal_argument_exception",
"reason": "only write ops with an op_type of create are allowed in data streams"
}
}
}
]
}
So why is metrics
being created as a data stream (apparently???) but category-metrics
is not?
If it matters, I am using docker.elastic.co/elasticsearch/elasticsearch:7.10.1
since it most closely resembles the version in our staging environment.