Failed to connect to backoff(elasticsearch(http://es01:9200)): Connection marked as failed because the onConnect callback failed: resource 'apm-7.5.2-error' exists, but it is not an alias

Kibana version:
7.5.2
Elasticsearch version:
7.5.2
APM Server version:
7.5.2
APM Agent language and version:
python elastic-apm 5.5.2 elasticsearch 7.5.1

I have two server machines. One of them is for elastic and the other one is for my flask application. This is how we set up our elastic server:

version: '2.2'
services:
  es01:
    image: docker.elastic.co/elasticsearch/elasticsearch:$ELASTIC_VERSION
    container_name: es01
    environment:
      - node.name=es01
      - cluster.name=es-docker-cluster
      - discovery.seed_hosts=es02,es03
      - cluster.initial_master_nodes=es01,es02,es03
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - esdata01:/usr/share/elasticsearch/data
    ports:
      - 9200:9200
    networks:
      - elastic

  es02:
    image: docker.elastic.co/elasticsearch/elasticsearch:$ELASTIC_VERSION
    container_name: es02
    environment:
      - node.name=es02
      - cluster.name=es-docker-cluster
      - discovery.seed_hosts=es01,es03
      - cluster.initial_master_nodes=es01,es02,es03
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - esdata02:/usr/share/elasticsearch/data
    networks:
      - elastic

  es03:
    image: docker.elastic.co/elasticsearch/elasticsearch:$ELASTIC_VERSION
    container_name: es03
    environment:
      - node.name=es03
      - cluster.name=es-docker-cluster
      - discovery.seed_hosts=es01,es02
      - cluster.initial_master_nodes=es01,es02,es03
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - esdata03:/usr/share/elasticsearch/data
    networks:
      - elastic

  kibana:
    container_name: kibana
    image: docker.elastic.co/kibana/kibana:$ELASTIC_VERSION
    environment:
      - ELASTICSEARCH_HOSTS=http://es01:9200
    ports:
      - "5600:5601"
    networks:
      - elastic 
    depends_on:
      - es01

  apm-server:
    container_name: apm-server
    image: store/elastic/apm-server:$ELASTIC_VERSION
    user: apm-server
    ports:
      - "7200:7200"
    depends_on: ["es01", "kibana"]
    #volumes:
    #  - ./apm-conf/apm-server.yml:/usr/share/apm-server/apm-server.yml
    command: /usr/share/apm-server/apm-server -e -c /usr/share/apm-server/apm-server.yml -E apm-server.host=apm-server:7200 --strict.perms=false -E output.elasticsearch.hosts=["es01:9200"] -E setup.kibana.host="kibana:5600"
    networks:
      - elastic

volumes:
  esdata01:
    external: true
  esdata02:
    external: true
  esdata03:
    external: true

networks:
  elastic:
    driver: bridge

We use it to index city names, locations, and so on. We add ElasticAPM to our flask app to index the logs of our application in elastic too.
Everything was fine until last week. Both machines were shut down for about three days due to displacement. When we turned on the servers, I start the dockers and everything came back to normal except apm-server. I saw this error in the log of my flask app:

services_1         | ERROR:elasticapm.transport:Failed to submit message: 'HTTP 503: {"accepted":0,"errors":[{"message":"queue is full"}]}\n'
services_1         | Traceback (most recent call last):
services_1         |   File "/usr/local/lib/python3.6/site-packages/elasticapm/transport/base.py", line 224, in _flush
services_1         |     self.send(data)
services_1         |   File "/usr/local/lib/python3.6/site-packages/elasticapm/transport/http.py", line 105, in send
services_1         |     raise TransportException(message, data, print_trace=print_trace)
services_1         | elasticapm.transport.base.TransportException: HTTP 503: {"accepted":0,"errors":[{"message":"queue is full"}]}

I searched for this error and I find https://www.elastic.co/guide/en/apm/server/master/common-problems.html#queue-full. But I didn't understand how I could solve this problem. I added the ‍‍‍‍‍‍output.elasticsearch.bulk_max_size=5120 to the command section of apm-server and restart apm-server container. After that, a new error appeared:

services_1         | WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f31517d53c8>: Failed to establish a new connection: [Errno 111] Connection refused',)': /intake/v2/events
services_1         | ERROR:elasticapm.transport:Failed to submit message: 'Connection to APM Server timed out (url: http://192.168.49.37:7200/intake/v2/events, timeout: 5 seconds)'

Meanwhile, I see this in the log of apm-server:

apm-server    | 2020-06-07T13:06:49.514Z        INFO    [request]       middleware/log_middleware.go:76 request accepted        {"request_id": "f4e3fd78-169e-4c89-a288-d8f2f1c35ed5", "method": "POST", "URL": "/intake/v2/events", "content_length": 714, "remote_address": "192.168.49.35", "user-agent": "elasticapm-python/5.5.2", "response_code": 202}
apm-server    | 2020-06-07T13:07:10.023Z        INFO    [request]       middleware/log_middleware.go:76 request accepted        {"request_id": "66791de1-f896-42ce-bc99-99a2c389afd3", "method": "POST", "URL": "/intake/v2/events", "content_length": 1253, "remote_address": "192.168.49.35", "user-agent": "elasticapm-python/5.5.2", "response_code": 202}
apm-server    | 2020-06-07T13:07:30.378Z        ERROR   pipeline/output.go:100  Failed to connect to backoff(elasticsearch(http://es01:9200)): Connection marked as failed because the onConnect callback failed: resource 'apm-7.5.2-error' exists, but it is not an alias
apm-server    | 2020-06-07T13:07:30.378Z        INFO    pipeline/output.go:93   Attempting to reconnect to backoff(elasticsearch(http://es01:9200)) with 136 reconnect attempt(s)
apm-server    | 2020-06-07T13:07:30.379Z        INFO    [publisher]     pipeline/retry.go:196   retryer: send unwait-signal to consumer
apm-server    | 2020-06-07T13:07:30.379Z        INFO    [publisher]     pipeline/retry.go:198     done
apm-server    | 2020-06-07T13:07:30.379Z        INFO    [publisher]     pipeline/retry.go:173   retryer: send wait signal to consumer
apm-server    | 2020-06-07T13:07:30.379Z        INFO    [publisher]     pipeline/retry.go:175     done
apm-server    | 2020-06-07T13:07:30.381Z        INFO    elasticsearch/client.go:753     Attempting to connect to Elasticsearch version 7.5.2
apm-server    | 2020-06-07T13:07:30.426Z        INFO    [pipelines]     pipeline/register.go:53 Pipeline already registered: apm
apm-server    | 2020-06-07T13:07:30.435Z        INFO    [pipelines]     pipeline/register.go:53 Pipeline already registered: apm_user_agent
apm-server    | 2020-06-07T13:07:30.437Z        INFO    [pipelines]     pipeline/register.go:53 Pipeline already registered: apm_user_geo
apm-server    | 2020-06-07T13:07:30.437Z        INFO    [pipelines]     pipeline/register.go:56 Registered Ingest Pipelines successfully.
apm-server    | 2020-06-07T13:07:30.437Z        INFO    [index-management]      idxmgmt/manager.go:84   Overwrite ILM setup is disabled.
apm-server    | 2020-06-07T13:07:30.438Z        INFO    [index-management]      idxmgmt/manager.go:203  Set setup.template.name to 'apm-%{[observer.version]}'.
apm-server    | 2020-06-07T13:07:30.438Z        INFO    [index-management]      idxmgmt/manager.go:205  Set setup.template.pattern to 'apm-%{[observer.version]}*'.
apm-server    | 2020-06-07T13:07:30.446Z        INFO    template/load.go:89     Template apm-7.5.2 already exists and will not be overwritten.
apm-server    | 2020-06-07T13:07:30.446Z        INFO    [index-management]      idxmgmt/manager.go:211  Finished loading index template.
apm-server    | 2020-06-07T13:07:30.449Z        INFO    [index-management.ilm]  ilm/std.go:138  do not generate ilm policy: exists=true, overwrite=false
apm-server    | 2020-06-07T13:07:30.449Z        INFO    [index-management]      idxmgmt/manager.go:240  ILM policy apm-rollover-30-days successfully loaded.
apm-server    | 2020-06-07T13:07:30.456Z        INFO    template/load.go:89     Template apm-7.5.2-error already exists and will not be overwritten.
apm-server    | 2020-06-07T13:07:30.456Z        INFO    [index-management]      idxmgmt/manager.go:223  Finished template setup for apm-7.5.2-error.
apm-server    | 2020-06-07T13:07:39.848Z        ERROR   [request]       middleware/log_middleware.go:74 forbidden request       {"request_id": "30938466-b8a9-4e9e-9edc-3074ea2bd872", "method": "POST", "URL": "/config/v1/agents", "content_length": 40, "remote_address": "192.168.49.35", "user-agent": "elasticapm-python/5.5.2", "response_code": 403, "error": "forbidden request: Agent remote configuration is disabled. Configure the `apm-server.kibana` section in apm-server.yml to enable it. If you are using a RUM agent, you also need to configure the `apm-server.rum` section. If you are not using remote configuration, you can safely ignore this error."}
apm-server    | 2020-06-07T13:07:48.284Z        INFO    [request]       middleware/log_middleware.go:76 request accepted        {"request_id": "7b75214e-6ff4-401a-a2b8-4e0f415cf3a8", "method": "POST", "URL": "/intake/v2/events", "content_length": 556, "remote_address": "192.168.49.35", "user-agent": "elasticapm-python/5.5.2", "response_code": 202}

It should be noted that elasticserach is fine and I can search my index without any problem:

services_1         | INFO:elasticsearch:GET http://192.168.49.37:9200/cities/_search?ignore_unavailable=true&size=5 [status:200 request:0.008s]

I'm new to elastic and I don't know what to do. Do you have any idea where the problem is? What do you think I should do now?

@mac_71128 welcome to the forum!

This sounds like a known issue which occurs when an alias is deleted: https://github.com/elastic/apm-server/issues/3698#issuecomment-620865066. Perhaps someone deleted apm-* in Elasticsearch? The recovery steps are in the proceeding comment of that issue.

We have some work underway that should prevent this from recurring, but hopefully that issue helps you resolve the problem in the short term.

Thanks. I don't remember deleting anything :sweat_smile:. Anyway. I followed all the steps of this comment for apm-7.5.2-transaction, apm-7.5.2-metric, apm-7.5.2-error and apm-7.5.2-span. Now, everything came back to normal thanks to you. I don't know why their new names looks weird.

Is it possible to change their name? Is it possible to merge apm-*-original with these new ones?


How can I fixed these lifecycle errors?
Thank you in advance

When you say look weird, are you referring to the "-000001" suffix? This is due to index lifecycle management (ILM). This is normal; the alias (e.g. "apm-7.5.2-span") sends writes to the most recent index (e.g. "apm-7.5.2-span-0000001"), which enables things like moving older data to slower/cheaper storage.

Is it possible to change their name? Is it possible to merge apm-*-original with these new ones?

Yes. If you want to keep the data but get rid of the apm-*-original indices, then you can reindex into the aliases, e.g.

POST _reindex
{
  "source": {
    "index": "apm-7.5.2-span-original"
  },
  "dest": {
    "index": "apm-7.5.2-span"
  }
}

After that the docs will exist in two places, so you should delete the original index to prevent the APM data from being counted twice in aggregations.

How can I fixed these lifecycle errors?

Not sure - what are the error details? They're probably related to the index names not conforming to the ILM rollover action's requirements. Once you delete those indices, I would expect the errors to disappear.

Okay, got it. Thanks a lot.
I saw these lifecycle errors before applying those steps. I'm sure of that. I took a picture from each one of them.




This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.