Elastic Engineer - Lab 4.3 Transforming data - target index empty

My traffic_stats index created by creating the pivot chart transform is empty. I've tried deleting the transform and following the steps in the answers again both in the UI and pasting commands in to the Console but no docs in index.

GET traffic_stats/_search

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

Not sure what is missing. Can you assist?

So we can help, please can you supply an example doc from your traffic source index (redact anything confidential) and your transform config.

Hi Sophie, Thanks for getting back to me. Excuse the delay in responding. The documents are from the Elastic Engineer on demand training lab 4.3. Below are documents from the web_traffic index and its mappings. Also the transform config.

GET web_traffic/_search
{ "size": 2 }

{
  "took" : 12,
  "timed_out" : false,
  "_shards" : {
    "total" : 10,
    "successful" : 10,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "web_traffic",
        "_type" : "_doc",
        "_id" : "YC8lwnwBZfkXtLD3j3Jv",
        "_score" : 1.0,
        "_source" : {
          "user_Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36",
          "request" : "/blog/find-strings-within-strings-faster-with-the-new-elasticsearch-wildcard-field",
          "content_type" : "text/html; charset=utf-8",
          "is_https" : true,
          "response" : 200,
          "verb" : "GET",
          "geoip_location_lon" : -0.0961,
          "@timestamp" : "2021-04-21T13:46:30.000Z",
          "bytes_sent" : 40322,
          "geoip_location_lat" : 51.5132,
          "runtime_ms" : 390558
        }
      },
      {
        "_index" : "web_traffic",
        "_type" : "_doc",
        "_id" : "bi8lwnwBZfkXtLD3j3Jv",
        "_score" : 1.0,
        "_source" : {
          "user_Agent" : "got (https://github.com/sindresorhus/got)",
          "request" : "/blog/this-week-in-elasticsearch-and-apache-lucene-2017-12-18",
          "content_type" : "text/html; charset=utf-8",
          "is_https" : true,
          "response" : 200,
          "verb" : "GET",
          "geoip_location_lon" : -77.2481,
          "@timestamp" : "2021-04-16T06:06:57.000Z",
          "bytes_sent" : 25122,
          "geoip_location_lat" : 38.6583,
          "runtime_ms" : 304112
        }
      }
    ]
  }
}

GET web_traffic/_mapping

{
  "web_traffic" : {
    "mappings" : {
      "properties" : {
        "@timestamp" : {
          "type" : "date"
        },
        "bytes_sent" : {
          "type" : "long"
        },
        "content_type" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "geo" : {
          "properties" : {
            "location" : {
              "type" : "geo_point"
            }
          }
        },
        "geoip_location_lat" : {
          "type" : "float"
        },
        "geoip_location_lon" : {
          "type" : "float"
        },
        "http" : {
          "properties" : {
            "request" : {
              "properties" : {
                "method" : {
                  "type" : "keyword"
                }
              }
            },
            "response" : {
              "properties" : {
                "status_code" : {
                  "type" : "keyword"
                }
              }
            }
          }
        },
        "is_https" : {
          "type" : "boolean"
        },
        "request" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "response" : {
          "type" : "long"
        },
        "runtime_ms" : {
          "type" : "long"
        },
        "url" : {
          "properties" : {
            "original" : {
              "type" : "keyword",
              "fields" : {
                "text" : {
                  "type" : "text"
                }
              }
            }
          }
        },
        "user_Agent" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "user_agent" : {
          "properties" : {
            "device" : {
              "properties" : {
                "name" : {
                  "type" : "keyword"
                }
              }
            },
            "name" : {
              "type" : "keyword"
            },
            "original" : {
              "type" : "keyword",
              "fields" : {
                "text" : {
                  "type" : "text"
                }
              }
            },
            "version" : {
              "type" : "keyword"
            }
          }
        },
        "verb" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

GET _transform/traffic_stats

{
  "count" : 1,
  "transforms" : [
    {
      "id" : "traffic_stats",
      "version" : "7.13.1",
      "create_time" : 1635420306615,
      "source" : {
        "index" : [
          "web_traffic"
        ],
        "query" : {
          "match_all" : { }
        }
      },
      "dest" : {
        "index" : "traffic_stats"
      },
      "frequency" : "1m",
      "pivot" : {
        "group_by" : {
          "url.original" : {
            "terms" : {
              "field" : "url.original"
            }
          }
        },
        "aggregations" : {
          "@timestamp.value_count" : {
            "value_count" : {
              "field" : "@timestamp"
            }
          },
          "runtime_ms.avg" : {
            "avg" : {
              "field" : "runtime_ms"
            }
          }
        }
      },
      "settings" : {
        "max_page_search_size" : 500
      }
    }
  ]
}


I think the fact that the transform is not working correctly is linked to this question the training course labs -

Lab 5.3: Scaling Elasticsearch contains the instruction

"Using the Reindex API, reindex the documents from web_traffic into temp1 where user_agent.os.name.keyword equals "Android" (which will be 69,630 documents). "

The user_agent.os.name.keyword field does not appear in the mappings for the web_traffic index supplied in ## Lab 4.1: Changing Data.

Should i go back and restructure web_traffic to include user_agent.os.name.keyword perhaps using the user_agent processor? Or have I just missed something in the lab instructions. Up to this point information about building lab has supplied in full but unless I'm missing something in the instructions this step seems to require troubleshooting and configuration that's not specified. I'm wondering if that's the way the labs work or I've just not followed the steps.

@sophie_chang can you kindly assist with Michael's question?

Hi @micw

Sorry for the delay, I missed your update.

Thanks for sending through the source docs and configs. The transform uses field url.original in its group_by clause. However, this field does not exist in the source docs that were supplied.

The following transform would work, which substitutes verb.keyword as the grouping field. This is shown using the _preview call which allows you to see what the transform would output, without having to actually create it.

GET _transform/_preview
{
  "source": {
    "index": [
      "web_traffic"
    ],
    "query": {
      "match_all": {}
    }
  },
  "dest": {
    "index": "traffic_stats"
  },
  "pivot": {
    "group_by": {
      "verb": {
        "terms": {
          "field": "verb.keyword"
        }
      }
    },
    "aggregations": {
      "@timestamp.value_count": {
        "value_count": {
          "field": "@timestamp"
        }
      },
      "runtime_ms.avg": {
        "avg": {
          "field": "runtime_ms"
        }
      }
    }
  }
}

I am not familiar with these labs, but the source data examples from the web_traffic index seems to be missing fields. I will raise this with the education team, but is it possible that the earlier steps to ingest the web_traffic data might have thrown errors? Best of luck on your engineer training.

Sophie

Thanks Sophie. I will try your suggestions. Will the education team let you know what they find? I would love to hear whether this troubleshooting is part of the course design.

Hello @micw !

By looking at the mapping you sent for the index web_traffic it looks like you missed some steps.

Especially step 4.1.8 where you need to create an ingest pipeline.

Take a look at the ingest pipeline below:

PUT _ingest/pipeline/web_traffic_pipeline
{
  "processors": [
    {
      "remove": {
        "field": "is_https",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "request",
        "target_field": "url.original",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "verb",
        "target_field": "http.request.method",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "response",
        "target_field": "http.response.status_code",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "geoip_location_lat",
        "target_field": "geo.location.lat",
        "ignore_missing": true
      }
    },
    {
      "rename": {
        "field": "geoip_location_lon",
        "target_field": "geo.location.lon",
        "ignore_missing": true
      }
    },
    {
      "user_agent": {
        "field": "user_Agent",
        "ignore_missing": true
      }
    },
    {
      "remove": {
        "field": "user_Agent",
        "ignore_missing": true
      }
    }
  ]
}

If you are using this ingest pipeline, the request field will be renamed to url.original. The transform will work much better after this!

Let me know if you are able to make it work.

Romain