The _template API is erroneously creating my index as a data stream, which causes the _bulk API to fail

lexicalunit · May 15, 2024, 5:15pm

I'm trying to spin up a small working example of what we have in staging/production in a local dockerized setup. I've downloaded some representative data from my staging environment, now I just need to inject it into my docker container. However the _bulk API seems to be creating my index as a data stream and I don't know why. It shouldn't be doing that.

Here is my data seeding script which should seed the elasticsearch container with my exported data:

# first, I push up the templates
for name in $(jq -r 'keys[]' "templates.json"); do
    template=$(jq -r --arg name "$name" '.[$name]' "templates.json")
    curl -s -XPUT "http://elasticsearch:9200/_template/$name" \
        -H 'Content-Type: application/json' \
        -d"$template"
done

bulk_insert_daily_metric_data() {
    local source_data="$1"
    local index_prefix="$2"
    local op_type="$3"

    rm -f "${source_data}_bulk_data.json" || true
    jq -rc 'keys[] as $k | "\(.[$k].timestamp)\n\(.[$k])"' "${source_data}.json" |
        while read -r timestamp; do
            read -r doc
            if [[ $timestamp != "null" ]]; then
                day=$(echo "$timestamp" | cut -f1 -dT)
                # add a header to EACH ROW to indicate the correct index and op_type
                printf '{"%s": {"_index": "%s-%s", "_type": "_doc"}}\n' \
                    "$op_type" "${index_prefix}" "${day}" \
                    >>"${source_data}_bulk_data.json"
                printf "%s\n" "$doc" >>"${source_data}_bulk_data.json"
            fi
        done

    curl -s -XPOST "http://elasticsearch:9200/_bulk" \
        -H "Content-Type: application/x-ndjson" \
        --data-binary "@${source_data}_bulk_data.json"
}

# Bulk insert metric and category-metric data
bulk_insert_daily_metric_data "elasticsearch_metrics" "metrics" "index"
bulk_insert_daily_metric_data "elasticsearch_category_metrics" "category-metrics" "index"

Here is my templates.json file (formatted for readability, the actual file is just one single line): templates.json · GitHub

As for the elasticsearch_metrics.json and elasticsearch_category_metrics.json data, the files contain data like this:

elasticsearch_metrics.json

[{"warehouse_uuid": "f40cc278-68c0-47ab-848d-b4a4cf201e3b", "full_table_id": "test_schema:public.sch_test_00007", "field": null, "metric": "total_byte_count", "value": 0.0, "timestamp": "2024-05-13T03:36:41", "measurement_timestamp": "2024-05-13T03:36:41", "dimensions": null, "context": {"is_bootstrap": null, "data_source": null, "data_provider": null, "consolidating_uuid": null, "latest_value": null, "baseline_value": 0.0}, "last_timestamp_in_bucket": null, "job_execution_uuid": "f9339f7c-d578-4728-99fd-1f191336b5dc", "pipeline_options": null, "mcon": "MCON++44a3f5f6-4015-44c1-8f98-3a5cbbaba2ff++f40cc278-68c0-47ab-848d-b4a4cf201e3b++table++test_schema:public.sch_test_00007", "segmented_fields": null, "thresholds": [{"type": "volume_change", "status": "inactive", "reason": null, "upper": null, "lower": null, "upper_high": null, "upper_medium": null, "upper_low": null, "lower_high": null, "lower_medium": null, "lower_low": null}]}, ...]

And the above script essentially translates each item in this JSON Array into two lines in the bulk_data file. So the above JSON would become this:

{"index": {"_index": "metrics-2024-05-13", "_type": "_doc"}}
{"warehouse_uuid": "f40cc278-68c0-47ab-848d-b4a4cf201e3b" ...}
...

As you can see, I have added the line {"index": {"_index": "metrics-2024-05-13", "_type": "_doc"}} above the first item in the array. Each item in the array will have its own line added above it, with the correct _index value, based on timestamp. This is just how the data is currently set up in our deployed environments.

elasticsearch_category_metrics.json

[{"warehouse_uuid":"f40cc278-68c0-47ab-848d-b4a4cf201e3b","full_table_id":"test_schema:smoke_tests.test_table","field":"name","metric":"category_dist","value":1,"timestamp":"2024-05-01T01:20:32","measurement_timestamp":"2024-05-01T01:20:32","dimensions":{"label":"one","monitor_uuid":"cf6fbd46-d85b-4c1a-a577-5872d60216e2"},"context":{"is_bootstrap":null,"data_source":null,"data_provider":null,"consolidating_uuid":"73a22b71-ffff-49fe-ae27-992f5d5b3aeb","latest_value":null,"baseline_value":null},"last_timestamp_in_bucket":null,"job_execution_uuid":"a01a3da3-63ed-495a-b605-ab9dde66f5ad","pipeline_options":null,"mcon":null,"segmented_fields":null,"thresholds":null}, ...]

And then the bulk_data version would look like:

{"index": {"_index": "category-metrics-2024-05-01", "_type": "_doc"}}
{"warehouse_uuid":"f40cc278-68c0-47ab-848d-b4a4cf201e3b", ...}
...

What Actually Happens

When I run my script, I get the correct ingestion of category-metrics data.

However when the script tries to ingest the metrics data, I get this error:

{
    "took": 0,
    "errors": true,
    "items": [
        {
            "index": {
                "_index": "metrics-2024-05-13",
                "_type": "_doc",
                "_id": null,
                "status": 400,
                "error": {
                    "type": "illegal_argument_exception",
                    "reason": "only write ops with an op_type of create are allowed in data streams"
                }
            }
        }
    ]
}

So why is metrics being created as a data stream (apparently???) but category-metrics is not?

If it matters, I am using docker.elastic.co/elasticsearch/elasticsearch:7.10.1 since it most closely resembles the version in our staging environment.

lexicalunit · May 15, 2024, 5:43pm

Figured out the issue. There's a system created data stream for metrics-* on version 7.9+.

I needed to set -e stack.templates.enabled=false in docker / es config to disable this feature so that I can use my own metrics-* index.