Elasticsearch input missing [@metadata][_index]

I'm doing some work with ES-input, ES-filter for lookup and ES-output into a new index... but the lookup uses [@metadata][_index] information. And I've been noticing a HUGE load of errors on LS log for missing that. Any idea on why that could be happening, or tips to troubleshoot it?

One of the things I tried to do was use logger.info("Index:", event.get("[@metadata][_index]")), but on those cases, it simply comes up empty...

We're running ES 6.4.2 and LS 6.8.

What does your input {} section look like? Do you have any filters such as aggregate that create events?

I have a simple working elasticsearch on input {} - tested with the stdout rubydebug option. I've experimented with the http input, also - then I use an add_field to include a fixed working [@metadata][_index].

This is what I've been getting, from time to time:

[2020-03-13T22:26:13,041][WARN ][logstash.filters.elasticsearch] Failed to query elasticsearch for previous event {:index=>"%{[@metadata][_index]}", :error=>"Read timed out"}

Then it turns into

[2020-03-13T22:27:50,560][WARN ][org.logstash.execution.ShutdownWatcherExt] {"inflight_count"=>0, "stalling_threads_info"=>{["LogStash::Filters::Elasticsearch", {"index"=>"%{[@metadata][_index]}", "id"=>"167a75afa10e6f0085a4f9f341c0beff25f1eb4c42c2f7e5621db8aa5e9374b9", "fields"=>{"MSRASClientName"=>"Source_MSRASClientName", "FQUser"=>"Source_FQUser", "UserName"=>"Source_UserName", "TunnelClientEndpt"=>"Source_TunnelClientEndpt", "CallingStationID"=>"Source_CallingStationID", "[@metadata][_id]"=>"Source_ID", "message"=>"Source_message"}, "query_template"=>"config/reindex_lookup.json", "hosts"=>["http://10.122.151.127:9200"]}]=>[{"thread_id"=>93, "name"=>"[VPN_Reindex]>worker0", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/manticore-0.6.4-java/lib/manticore/response.rb:50:in `call'"}]}}

Hi there,

put this docinfo => true as one of the elasticsearch input options and try again. Like:

input {
  elasticsearch {
    hosts => "your_host:your_port"
    index => "your_index"
    docinfo => true
  }
}

docinfo is already in there, and it works when I use rubydebug output.

Is it possible to see your conf pipeline file?

It's too big to insert here

I tried to install and use sleep to try and tackle the timeout and see if that's what was causing an empty [@metadata][_index], but due to proxy configuration, I can't reach rubygems.org.

Now to get worse...

[2020-03-16T23:34:20,644][ERROR][logstash.pipeline        ] A plugin had an unrecoverable error. Will restart this plugin.
  Pipeline_id:VPN_Reindex
  Plugin: <LogStash::Inputs::Elasticsearch index=>"rras_vpn*", docinfo=>true, id=>"3c508463366f1442c1a7f5509dbb9e18e440d4947e37358036af9c68abd8c51f", hosts=>["http://10.122.151.127:9200"], query=>"\r\n    {\r\n      \"query\":\r\n        { \"range\" : { \r\n          \"@timestamp\" : { \r\n          \"gte\" : \"2018-11-01\", \"lte\" : \"now\"\r\n        }}},\r\n      \"sort\": [ \"@timestamp\" ] \r\n    }\r\n    ", enable_metric=>true, codec=><LogStash::Codecs::JSON id=>"json_6ffa364a-1f61-43d2-b908-93cba2cb7f31", enable_metric=>true, charset=>"UTF-8">, size=>1000, scroll=>"1m", docinfo_target=>"@metadata", docinfo_fields=>["_index", "_type", "_id"], ssl=>false>
  Error: [404] {"error":{"root_cause":[{"type":"search_context_missing_exception","reason":"No search context found for id [4881219442]"},{"type":"search_context_missing_exception","reason":"No search context found for id [4881219448]"},{"type":"search_context_missing_exception","reason":"No search context found for id [4881219457]"},

and that "No search context found" repeats for what it seems is every next ID that passes thru the pipeline. After that...

,{"shard":-1,"index":null,"reason":{"type":"search_context_missing_exception","reason":"No search context found for id [4881222701]"}}],"caused_by":{"type":"search_context_missing_exception","reason":"No search context found for id [4881222701]"}},"status":404}
  Exception: Elasticsearch::Transport::Transport::Errors::NotFound
  Stack: /logstash-6.8.6/vendor/bundle/jruby/2.5.0/gems/elasticsearch-transport-5.0.5/lib/elasticsearch/transport/transport/base.rb:202:in `__raise_transport_error'
/logstash-6.8.6/vendor/bundle/jruby/2.5.0/gems/elasticsearch-transport-5.0.5/lib/elasticsearch/transport/transport/base.rb:319:in `perform_request'
/logstash-6.8.6/vendor/bundle/jruby/2.5.0/gems/elasticsearch-transport-5.0.5/lib/elasticsearch/transport/transport/http/faraday.rb:20:in `perform_request'
/logstash-6.8.6/vendor/bundle/jruby/2.5.0/gems/elasticsearch-transport-5.0.5/lib/elasticsearch/transport/client.rb:131:in `perform_request'
/logstash-6.8.6/vendor/bundle/jruby/2.5.0/gems/elasticsearch-api-5.0.5/lib/elasticsearch/api/actions/scroll.rb:61:in `scroll'
/logstash-6.8.6/vendor/bundle/jruby/2.5.0/gems/logstash-input-elasticsearch-4.3.3/lib/logstash/inputs/elasticsearch.rb:276:in `scroll_request'
/logstash-6.8.6/vendor/bundle/jruby/2.5.0/gems/logstash-input-elasticsearch-4.3.3/lib/logstash/inputs/elasticsearch.rb:244:in `process_next_scroll'
/logstash-6.8.6/vendor/bundle/jruby/2.5.0/gems/logstash-input-elasticsearch-4.3.3/lib/logstash/inputs/elasticsearch.rb:236:in `do_run_slice'
/logstash-6.8.6/vendor/bundle/jruby/2.5.0/gems/logstash-input-elasticsearch-4.3.3/lib/logstash/inputs/elasticsearch.rb:209:in `do_run'
/logstash-6.8.6/vendor/bundle/jruby/2.5.0/gems/logstash-input-elasticsearch-4.3.3/lib/logstash/inputs/elasticsearch.rb:197:in `run'
/logstash-6.8.6/logstash-core/lib/logstash/pipeline.rb:426:in `inputworker'
/logstash-6.8.6/logstash-core/lib/logstash/pipeline.rb:420:in `block in start_input'

Oh but you are receiving that error in your elasticsearch FILTER plugin, not the input or output ones, aren't you?

At a first glance, you are referencing some metadata fields here in the filter section:

elasticsearch {
      hosts => ["http://<server_ip>:9200"]
      index => "%{[metadata][_index]}"
      query_template => "config/reindex_lookup.json"
      fields => { 
        "MSRASClientName" => "Source_MSRASClientName"
        "FQUser" => "Source_FQUser"
        "UserName" => "Source_UserName"
        "TunnelClientEndpt" => "Source_TunnelClientEndpt"
        "CallingStationID" => "Source_CallingStationID"
        "[@metadata][_id]" => "Source_ID"
        "message" => "Source_message"
      }
    }

without using the proper docinfo_fields

Does it change anything using that?

Also, will it work if you comment out your whole filter section except the fingerprint part useful to build the document_id?

Finally, IMO your ruby code can be refactored and I'd sincerely recommend you to use aither a translate filter or a ruby hash (if you wanna keep everything in the conf file) to build a dictionary because that case+elsif...elsif part is unbelievably long.

1 Like

ok, I really didn't knew that docinfo_fields was required to be explicit - every example I saw while googling it used the [@metadata][_index] field only with docinfo => true and nothing else.

When using docinfo_fields I get

                             "@metadata" => {
              "\"_id\", \"document_id\"" => nil,
        "\"_index\", \"document_index\"" => nil,
                           "fingerprint" => 2279622448
    },

but when it's not set...

          "@metadata" => {
             "_index" => "rras_vpn-2018.11.24",
                "_id" => "d7DwEGkBOCcFUcpMwgAf",
              "_type" => "doc",
        "fingerprint" => 251235229
    },

and I've simply copy/pasted from the documentation, like you can see in updated gist's lines 16-19:

    docinfo_fields => {
      "_id" => "document_id"
      "_index" => "document_index"
    }

And I'm still getting failures:

[2020-03-17T23:24:04,352][INFO ][logstash.filters.elasticsearch] New ElasticSearch filter client {:hosts=>["<server>"]}
[2020-03-17T23:24:04,553][WARN ][logstash.filters.elasticsearch] Failed to query elasticsearch for previous event {:index=>"%{[@metadata][document_index]}", :error=>"[404] {\"error\":{\"root_cause\":[{\"type\":\"index_not_found_exception\",\"reason\":\"no such index\",\"resource.type\":\"index_or_alias\",\"resource.id\":\"%{[@metadata][document_index]}\",\"index_uuid\":\"_na_\",\"index\":\"%{[@metadata][document_index]}\"}],\"type\":\"index_not_found_exception\",\"reason\":\"no such index\",\"resource.type\":\"index_or_alias\",\"resource.id\":\"%{[@metadata][document_index]}\",\"index_uuid\":\"_na_\",\"index\":\"%{[@metadata][document_index]}\"},\"status\":404}"}
[2020-03-17T23:24:05,186][WARN ][logstash.filters.elasticsearch] Failed to query elasticsearch for previous event {:index=>"%{[@metadata][document_index]}", :error=>"[404] {\"error\":{\"root_cause\":[{\"type\":\"index_not_found_exception\",\"reason\":\"no such index\",\"resource.type\":\"index_or_alias\",\"resource.id\":\"%{[@metadata][document_index]}\",\"index_uuid\":\"_na_\",\"index\":\"%{[@metadata][document_index]}\"}],\"type\":\"index_not_found_exception\",\"reason\":\"no such index\",\"resource.type\":\"index_or_alias\",\"resource.id\":\"%{[@metadata][document_index]}\",\"index_uuid\":\"_na_\",\"index\":\"%{[@metadata][document_index]}\"},\"status\":404}"}

Just to make it clear, I'm using %{[@metadata][_index]} and not just [@metadata][_index] on the filters, so it should pick the [_index] from the elasticsearch input to use it on the elasticsearch filter. Since the input comes from the whole rras_vpn-* but I want to lookup only on the doc's specific index. Or did I misunderstand how %{} works?

But I know it's working, even if I get tons of failures, because
image

if I comment the filter section out, it could work but it loses it's purpose. Specially the elasticsearch lookup part, that's the main reason for this.

The whole idea here is to 1) lookup missing fields (if ![CallingStationID]) on PacketType:1 and; 2) include some descriptive fields to make my team's work easier, so they don't have to keep checking what PacketType:3 or ReasonCode:32 means.

yeah, I'm not really a ruby guy. I'm happy it worked, and I have no idea how it could be improved. I actually know now that I can change that huge piece with a case, like I use on line 157 - would it be better? (ok, the updated version changed if for case.)

Okay, I was looking into the hash thing you've mentioned. Would it change lines 163..201 into something like this..?

            TERMINATION_CAUSE = {
              1 => "Solicitação do Usuário",
              2 => "Perda de Sinal",
              3 => "Perda de Serviço",
              4 => "Tempo Limite de Inatividade",
              5 => "Tempo Limite da Sess√£o",
              6 => "Reset Administrativo",
              7 => "Reboot Administrativo",
              8 => "Erro de Porta",
              9 => "Erro do NAS",
              10 => "Solicitação do NAS",
              11 => "Reboot do NAS",
              12 => "Porta Desnecess√°ria",
              13 => "Porta Preemptada",
              14 => "Porta Suspensa",
              15 => "Serviço Indisponível",
              16 => "Chamada de Retorno",
              17 => "Erro do Us√°rio",
              18 => "Solicitado pelo Host"
            }
            connectionInfo = "Fim da Conex√£o. Causa: " + TERMINATION_CAUSE[termCause] + " | Tempo de Sess√£o: " + sessionTime;

This worked, but looking at the logs I get

}
(ruby filter code):29: warning: already initialized constant TERMINATION_CAUSE
{

from time to time... why?

I'm starting to think this is all because of a sub-dimensioned Elasticsearch cluster that can't handle the input + lookup, so it keeps crashing and failing and timing out...

I'm starting to think this is all because of a sub-dimensioned Elasticsearch cluster that can't handle the input + lookup, so it keeps crashing and failing and timing out...

Well, this might definitely be a cause. You have to consider the elasticsearch filter as one of the heaviest ones, since if you have a throughput of say 500 events per second in logstash, and a index of 300k docs (very modest parameters), you're forcing your elasticsearch to make 500 queries per second, each scanning a 300k-doc index. If you have a very poor ES cluster, it might not be enough for this task.

Would it change lines 163..201 into something like this..?

You can see it is already better than a case statement. A case statement gives you more flexibility since it allows you to make some more complex operations inside the specific case and to match cases using regex or more complex comparison methods. But for a simple mapping, a hash is the way. More compact, more readable, maybe faster.

if I comment the filter section out, it could work but it loses it's purpose. Specially the elasticsearch lookup part, that's the main reason for this.

I know, I simply wanted to understand where the problem resides. It definitely is in your elasticsearch filter part. It most likely is a performance issue, as you stated. And the fact that it works for many documents before crashing or returning errors only confirms it.
In fact, I guess if you try on a single, manual input, it'll always work, won't it? It means the syntax is correct. You do have problems in a long run, which is almost definitely linked to performance.

Can you share here a sample of the documents you are working on?

Also, I see in your input there's a index-pattern. Are all the indices identified by that patter structured the same way? Have you tried applying your pipeline to one index at a time to see if there's a guilty index responsible for the failures?

I'll try later when I get back to work!

100% positive. This is VPN log. So even if different packet types return different fields, the whole index is the same, all fields considered.

Yes, when I switch to http input for testing and feed it one of the docs retrieved, it will always work.

I'm gonna try and see if I can convince our infra guys to improve things over there, otherwise I think I'll be stuck =\

There you go:

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 8.916708,
    "hits": [
      {
        "_index": "rras_vpn-2018.11.24",
        "_type": "doc",
        "_id": "jbDwEGkBOCcFUcpMwgAf",
        "_score": 8.916708,
        "_source": {
          "ProxyPolicyName": "Microsoft Routing and Remote Access Service Policy",
          "ClientIPAddress": null,
          "AuthenticationType": "4",
          "SessionTimeout": null,
          "MSAcctAuthType": null,
          "EventTimestamp": null,
          "AcctTerminateCause": null,
          "ComputerName": "<server>",
          "CallbackNumber": null,
          "TunnelPvtGroupID": null,
          "MSMPPEEncryptionTypes": null,
          "AcctInputPackets": null,
          "AcctAuthentic": null,
          "MSRASVersion": null,
          "ProviderName": null,
          "ConnectInfo": null,
          "AcctLinkCount": null,
          "PortLimit": null,
          "AcctTunnelConn": null,
          "MSRASClientVersion": null,
          "message": """"<server>","RAS",11/24/2018,03:14:17,3,,"<user>",,,,,,,,,,,,,,,,,4,,48,"311 1 <ip> 05/26/2018 16:09:57 100364",,,,,,,,,"100365",,,,,,,,,,,,,,,,,,,,,,,,,"Microsoft Routing and Remote Access Service Policy",1,,,,
""",
          "FQUser": "<user>",
          "FramedIPAddress": null,
          "AcctInterimInterval": null,
          "AcctOutputPackets": null,
          "AcctStatusType": null,
          "NASIPAddress": null,
          "ReasonCode": "48",
          "host": "ccssvitrlx033",
          "UserName": null,
          "TunnelMediumType": null,
          "NASIdentifier": null,
          "@timestamp": "2018-11-24T05:14:17.000Z",
          "AcctDelayTime": null,
          "MSMPPEEncryptionPolicy": null,
          "PacketType": "3",
          "RemoteServerAddress": null,
          "ClientFriendlyName": null,
          "CallingStationID": null,
          "Time": "03:14:17",
          "MSRASVendor": null,
          "MSRASClientName": null,
          "ProviderType": "1",
          "ClientVendor": null,
          "@version": "1",
          "MSAcctEAPType": null,
          "CalledStationID": null,
          "AcctInputOctets": null,
          "AcctSessionTime": null,
          "ServiceName": "RAS",
          "type": "rras_log",
          "EAPFriendlyName": null,
          "TunnelAssignmentID": null,
          "TunnelServerEndpt": null,
          "TunnelPreference": null,
          "MSCHAPError": null,
          "MSCHAPDomain": null,
          "Class": "311 1 <ip> 05/26/2018 16:09:57 100364",
          "PolicyName": null,
          "ServiceType": null,
          "TunnelType": null,
          "path": "/var/leitura/entrada_vpn_full.csv",
          "TerminationAction": null,
          "AcctOutputOctets": null,
          "NASPort": null,
          "IdleTimeout": null,
          "AcctSessionID": "100365",
          "AcctMultiSsnID": null,
          "TunnelClientEndpt": null,
          "NASPortType": null,
          "Date": "11/24/2018",
          "FramedProtocol": null
        }
      },
      {
        "_index": "rras_vpn-2018.11.24",
        "_type": "doc",
        "_id": "jLDwEGkBOCcFUcpMwgAf",
        "_score": 8.1542425,
        "_source": {
          "ProxyPolicyName": "Microsoft Routing and Remote Access Service Policy",
          "ClientIPAddress": "10.222.174.4",
          "AuthenticationType": "4",
          "SessionTimeout": null,
          "MSAcctAuthType": null,
          "EventTimestamp": null,
          "AcctTerminateCause": null,
          "ComputerName": "<server>",
          "CallbackNumber": null,
          "TunnelPvtGroupID": null,
          "MSMPPEEncryptionTypes": null,
          "AcctInputPackets": null,
          "AcctAuthentic": null,
          "MSRASVersion": "MSRASV5.20",
          "ProviderName": null,
          "ConnectInfo": null,
          "AcctLinkCount": null,
          "PortLimit": null,
          "AcctTunnelConn": null,
          "MSRASClientVersion": null,
          "message": """"<server>","RAS",11/24/2018,03:14:17,1,"<user>","<user>","<ip>","<ip>",,,"<server>","10.222.174.4",129,,"10.222.174.4","<server>",,,5,,1,2,4,,0,"311 1 <ip> 05/26/2018 16:09:57 100364",,,,,,,,,"100365",,,,,,,,,1,1,"<ip>","<ip>",,,,,,,"MSRASV5.20",311,,,,,"Microsoft Routing and Remote Access Service Policy",1,,,,
""",
          "FQUser": "<user>",
          "FramedIPAddress": null,
          "AcctInterimInterval": null,
          "AcctOutputPackets": null,
          "AcctStatusType": null,
          "NASIPAddress": "10.222.174.4",
          "ReasonCode": "0",
          "host": "ccssvitrlx033",
          "UserName": "<user>",
          "TunnelMediumType": "1",
          "NASIdentifier": "<server>",
          "@timestamp": "2018-11-24T05:14:17.000Z",
          "AcctDelayTime": null,
          "MSMPPEEncryptionPolicy": null,
          "PacketType": "1",
          "RemoteServerAddress": null,
          "ClientFriendlyName": "<server>",
          "CallingStationID": "<ip>",
          "Time": "03:14:17",
          "MSRASVendor": "311",
          "MSRASClientName": null,
          "ProviderType": "1",
          "ClientVendor": null,
          "@version": "1",
          "MSAcctEAPType": null,
          "CalledStationID": "<ip>",
          "AcctInputOctets": null,
          "AcctSessionTime": null,
          "ServiceName": "RAS",
          "type": "rras_log",
          "EAPFriendlyName": null,
          "TunnelAssignmentID": null,
          "TunnelServerEndpt": "<ip>",
          "TunnelPreference": null,
          "MSCHAPError": null,
          "MSCHAPDomain": null,
          "Class": "311 1 <ip> 05/26/2018 16:09:57 100364",
          "PolicyName": null,
          "ServiceType": "2",
          "TunnelType": "1",
          "path": "/var/leitura/entrada_vpn_full.csv",
          "TerminationAction": null,
          "AcctOutputOctets": null,
          "NASPort": "129",
          "IdleTimeout": null,
          "AcctSessionID": "100365",
          "AcctMultiSsnID": null,
          "TunnelClientEndpt": "<ip>",
          "NASPortType": "5",
          "Date": "11/24/2018",
          "FramedProtocol": "1"
        }
      }
    ]
  }
}