Rally Actors exit prematurely


#1

I've written a new track and been trying to use that to run benchmarks against existing aws elasticsearch cluster. Default rally tracks work fine however for the custom tracks i see something like this

018-04-05 18:32:29,14 PID:105 rally.track INFO Skipping 116156 lines took 9.139521 s.
2018-04-05 18:32:29,406 PID:99 rally.track INFO Skipping 16594 lines took 9.514135 s.
2018-04-05 18:32:29,881 PID:104 rally.driver INFO LoadGenerator[6] is executing (no samples).
2018-04-05 18:32:29,882 PID:101 rally.driver INFO LoadGenerator[3] is executing (no samples).
2018-04-05 18:32:29,910 PID:105 rally.driver INFO LoadGenerator[7] is executing (no samples).
2018-04-05 18:32:29,912 PID:100 rally.driver INFO LoadGenerator[2] is executing (no samples).
2018-04-05 18:32:29,914 PID:103 rally.driver INFO LoadGenerator[5] is executing (no samples).
2018-04-05 18:32:29,929 PID:102 rally.driver INFO LoadGenerator[4] is executing (no samples).
2018-04-05 18:32:29,936 PID:99 rally.driver INFO LoadGenerator[1] is executing (no samples).
2018-04-05 18:32:30,883 PID:98 rally.driver INFO LoadGenerator[0] is executing (no samples).
2018-04-05 18:32:36,813 PID:21 rally.racecontrol INFO Received a benchmark failure from [ActorAddr-(T|:43747)] and will forward it now.
2018-04-05 18:32:38,98 PID:1 rally.racecontrol ERROR A benchmark failure has occurred
2018-04-05 18:32:38,323 PID:1 rally.racecontrol INFO Telling benchmark actor to exit.
2018-04-05 18:32:38,477 PID:21 rally.racecontrol INFO BenchmarkActor received unknown message [ActorExitRequest] (ignoring).
2018-04-05 18:32:38,543 PID:1 rally.main INFO Attempting to shutdown internal actor system.
2018-04-05 18:32:38,598 PID:83 rally.driver INFO Main driver received ActorExitRequest and will terminate all load generators.
2018-04-05 18:32:38,432 PID:99 rally.driver INFO LoadGenerator[1] is executing (no samples).
2018-04-05 18:32:38,832 PID:99 rally.driver INFO LoadGenerator[1] is exiting due to ActorExitRequest.
2018-04-05 18:32:39,714 PID:100 rally.driver INFO LoadGenerator[2] is exiting due to ActorExitRequest.
2018-04-05 18:32:39,695 PID:104 rally.driver INFO LoadGenerator[6] is exiting due to ActorExitRequest.
2018-04-05 18:32:39,735 PID:102 rally.driver INFO LoadGenerator[4] is exiting due to ActorExitRequest.
2018-04-05 18:32:39,756 PID:103 rally.driver INFO LoadGenerator[5] is exiting due to ActorExitRequest.
2018-04-05 18:32:39,810 PID:101 rally.driver INFO LoadGenerator[3] is exiting due to ActorExitRequest.
2018-04-05 18:32:40,560 PID:21 rally.racecontrol INFO BenchmarkActor received unknown message [ActorExitRequest] (ignoring).
2018-04-05 18:32:40,367 PID:22 rally.mechanic INFO MechanicActor#receiveMessage unrecognized(msg = [<class 'thespian.actors.ActorExitRequest'>] sender = [ActorAddr-(T|:45923)])
2018-04-05 18:32:44,835 PID:83 rally.driver INFO Load generator [4] has exited.
2018-04-05 18:32:45,27 PID:82 rally.mechanic INFO Flushing system metrics store on host [[{'host': '<es-cluster>', 'port': 80}]].
2018-04-05 18:32:45,477 PID:82 rally.mechanic INFO Stopping nodes [BrDhwV_, 7e_HqqT, rXspKn9, xQ5UagV, j3El_9e, ggT2QYD, 3ZZ72Vz, d5qVUCh] (due to ActorExitRequest)
2018-04-05 18:32:45,900 PID:22 rally.mechanic ERROR Child actor exited with [ChildActorExited:ActorAddr-(T|:39247)] while in status [benchmark_started].
2018-04-05 18:32:46,501 PID:21 rally.racecontrol INFO BenchmarkActor received unknown message [ChildActorExited:ActorAddr-(T|:33391)] (ignoring).
2018-04-05 18:32:47,482 PID:103 rally.track INFO Skipping 82969 lines took 27.553190 s.
2018-04-05 18:32:49,88 PID:1 rally.main INFO Actor system is still running. Waiting...
2018-04-05 18:32:51,946 PID:100 rally.driver INFO User cancelled execution.
2018-04-05 18:32:52,144 PID:1 rally.main INFO Actor system is still running. Waiting...
2018-04-05 18:32:52,421 PID:83 rally.driver INFO Load generator [2] has exited.
2018-04-05 18:32:52,462 PID:83 rally.driver INFO Load generator [2] has exited.
2018-04-05 18:32:53,147 PID:1 rally.main INFO Actor system is still running. Waiting...
2018-04-05 18:32:55,816 PID:104 rally.track INFO Skipping 99562 lines took 35.889850 s.
2018-04-05 18:32:56,153 PID:1 rally.main INFO Actor system is still running. Waiting...
2018-04-05 18:32:58,419 PID:101 rally.track INFO Skipping 49781 lines took 38.468635 s.
2018-04-05 18:32:59,418 PID:1 rally.main INFO Actor system is still running. Waiting...
2018-04-05 18:32:59,534 PID:104 rally.driver INFO User cancelled execution.
2018-04-05 18:33:00,56 PID:83 rally.driver INFO Load generator [6] has exited.
2018-04-05 18:33:00,65 PID:83 rally.driver INFO Load generator [6] has exited.
2018-04-05 18:33:00,69 PID:105 rally.driver INFO LoadGenerator[7] is executing (no samples).
2018-04-05 18:33:00,188 PID:105 rally.driver INFO LoadGenerator[7] is exiting due to ActorExitRequest.
2018-04-05 18:33:00,455 PID:1 rally.main INFO Actor system is still running. Waiting...
2018-04-05 18:33:00,549 PID:101 rally.driver INFO User cancelled execution.
2018-04-05 18:33:00,660 PID:83 rally.driver INFO Load generator [3] has exited.
2018-04-05 18:33:00,690 PID:83 rally.driver INFO Load generator [3] has exited.
2018-04-05 18:33:01,460 PID:1 rally.main INFO Actor system is still running. Waiting...
2018-04-05 18:33:02,464 PID:1 rally.main INFO Actor system is still running. Waiting...
2018-04-05 18:33:03,469 PID:1 rally.main INFO Actor system is still running. Waiting...
2018-04-05 18:33:04,474 PID:1 rally.main WARNING Shutdown timed out. Actor system is still running.
2018-04-05 18:33:04,486 PID:1 root ERROR Cannot run subcommand [race].
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/esrally/rally.py", line 560, in dispatch_sub_command
    race(cfg)
  File "/usr/local/lib/python3.6/site-packages/esrally/rally.py", line 490, in race
    with_actor_system(lambda c: racecontrol.run(c), cfg)
  File "/usr/local/lib/python3.6/site-packages/esrally/rally.py", line 510, in with_actor_system
    runnable(cfg)
  File "/usr/local/lib/python3.6/site-packages/esrally/rally.py", line 490, in <lambda>
    with_actor_system(lambda c: racecontrol.run(c), cfg)
  File "/usr/local/lib/python3.6/site-packages/esrally/racecontrol.py", line 377, in run
    raise e
  File "/usr/local/lib/python3.6/site-packages/esrally/racecontrol.py", line 374, in run
    pipeline(cfg)
  File "/usr/local/lib/python3.6/site-packages/esrally/racecontrol.py", line 63, in __call__
    self.target(cfg)
  File "/usr/local/lib/python3.6/site-packages/esrally/racecontrol.py", line 322, in benchmark_only
    return race(cfg, external=True)
  File "/usr/local/lib/python3.6/site-packages/esrally/racecontrol.py", line 276, in race
    raise exceptions.RallyError(result.message, result.cause)
esrally.exceptions.RallyError: ('Load generator [0] has exited prematurely.', None)

Elasticsearch is running version 6.0 and am running rally on a single container in ECS for benchmarking only


(Daniel Mitterdorfer) #2

The log output indicates that an error has occurred but unfortunately, the details are missing which seems rather odd. Can you please also check the other log files? Is it possible to share your track (just the track.json) without the data? If you cannot show it in public you can also send a PM on Discuss.

What is also a bit concerning is:

018-04-05 18:32:29,14 PID:105 rally.track INFO Skipping 116156 lines took 9.139521 s.

Each client that bulk-indexes, reads a portion of the data file(s) and needs to advance to the correct position first. We have taken precautions that this is quick with those offset files that Rally creates once. Then this skipping should be way less than a second but in your case it took 9 seconds which is alarming. Are you on a very slow disk? If yes, you should switch to an SSD otherwise you might introduce an accidental bottleneck in the load generator.


#3

@danielmitterdorfer

I am running the docker on t2.large instance and allocated 2048 MB RAM. That might explain why it is slow. Following is the track.json file

{% import "rally.helpers" as rally with context %}
{
    "version": 1,
    "description": "Benchmark for features type",
    "indices": [
      {
        "name": "features-benchmark",
        "auto-managed": false,
        "body": "index.json",
        "types": ["features"]
      }
    ],
    "corpora": [
      {
        "name": "features-benchmark",
        "documents": [
          {
            "source-file": "/root/documents.json.bz2",
            "document-count": 132750,
            "compressed-bytes": 94550166,
            "uncompressed-bytes": 7204426591
          }
        ]
      }
    ],
    "operations": [
        {{ rally.collect(parts="operations/*.json") }}
    ],
    "challenges": [
        {{ rally.collect(parts="challenges/*.json") }}
    ]
  }


(Daniel Mitterdorfer) #4

Hi,

Can you please place documents.json.bz2 either next to track.json (if you use the command line parameter --track-path) or alternatively in ~/.rally/benchmarks/data/features-benchmark? Then you just need to specify the file name:

"source-file": "documents.json.bz2"

Rally will automatically check whether the file is in the same directory as track.json or ~/.rally/benchmarks/data/features-benchmark. Specifying an absolute path seems to work but I'd like to make sure that this is not the problem here.

Apart from that, the file looks fine. Can you please also share one complete Rally log file? (i.e. ~/.rally/logs/rally_out_$RACE_TIMESTAMP_HERE.log)? Among other things it shows how the finally rendered track looks like after resolving includes etc.

Thanks

Daniel


(Daniel Mitterdorfer) #5

Hi,

I had a look at the log file that you have sent me in private. The track looks fine in general; it seems to be based on one of our standard tracks (I guess geonames).

Towards the end of your log file I see request timeouts (cluster URL redacted):

2018-04-06 14:32:23,307 PID:94 elasticsearch WARNING POST http://YOUR_CLUSTER_IP/_bulk [status:N/A request:60.331s]
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 357, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.6/http/client.py", line 1285, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.6/http/client.py", line 1234, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.6/http/client.py", line 1065, in _send_output
    self.send(chunk)
  File "/usr/local/lib/python3.6/http/client.py", line 986, in send
    self.sock.sendall(data)
socket.timeout: timed out

While I'd have expected a better error message in this case, the problem seems to be that you are injecting too much load and the cluster cannot cope with it. I'd start by:

  • Reducing the number of bulk indexing clients (start with 1 client and increase then gradually)
  • Reducing the bulk size (hard to come up with a proper value without knowing anything about your documents).

For now, I'd leave the request timeout at its default value (60 seconds).


#6

thanks @danielmitterdorfer. Will make the changes and try again!


#7

That didn't seem to work either. Gives the same error


(Daniel Mitterdorfer) #8

Hi,

  • What values did you choose for bulk size and number of clients?
  • Can you share one example document or at least its size which you can determine e.g. with head -n 1 YOUR_DOCUMENT_FILE.json | wc -c?
  • What specifications does the cluster have (especially heap size)?
  • Do you see any problems in the cluster logs (e.g. exceptions being thrown, cluster restarted, ...)?

Thanks

Daniel


#9

I set bulk size to 2500 and number of clients to 1.
Size of the document is 54311 bytes
I am actually trying to benchmark an AWS elasticsearch cluster with no reported issues.


(Daniel Mitterdorfer) #10

That results in one bulk request of roughly 130MB (54kB * 2500) which is quite large IMHO. I would definitely lower that to e.g. 200 documents per bulk. If that succeeds you can always increase the number.


#11

I decreased the bulk size from 2500 to 200 but still no luck. It did complete running the whole benchmark but when I checked the index in my ES it's actually empty.

HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
Connection: keep-alive
Content-Length: 255
Content-Type: application/json; charset=UTF-8

{
    "features-benchmark": {
        "aliases": {},
        "mappings": {},
        "settings": {
            "index": {
                "creation_date": "1523546542961",
                "number_of_replicas": "1",
                "number_of_shards": "5",
                "provided_name": "features-benchmark",
                "uuid": "YboJlWcwQtSDI13RLYwKvA",
                "version": {
                    "created": "6000199"
                }
            }
        }
    }
}

Do I need to provide a mappings.json file. I have following in index.json

{
  "settings": {
    "index.number_of_replicas": 1
  },
    "mappings" : {
      "features": {
        "dynamic_templates": [
            {
                "strings": {
                    "match_mapping_type": "string",
                    "mapping": {
                        "type": "keyword",
                        "index": true
                    }
                }
            }
        ],
        "properties": {
            "segments": {
                "type": "long"
            }
        }                
    }
    }
  }


Also how do I check if my track can work with ES version 6.2? Sorry for so many questions


(Daniel Mitterdorfer) #12

Hi,

The most drastic move would be to set the bulk size to 1. You could also try to run a benchmark against a local Elasticsearch cluster to get more insights. If you prepare a smaller data set of your track, then you could also try it with --test-mode (see docs for details).

To your index: according to the creation_date property, the features-benchmark index has been created on April 8. Does this match with your expectation? Did you maybe change the index.json file afterwards?

No, just providing index.json is fine for recent versions of Rally. mappings.json was only necessary for older versions of Rally.

No worries about the number of questions: The purpose of a forum is to ask questions. :slight_smile:

I suggest you run it with --on-error=abort to ensure that whenever you get an error, Rally aborts the benchmark immediately with a detailed error message. Apart from that I'd also check the logs of the targeted Elasticsearch cluster including the deprecation log to ensure you do not use features of Elasticsearch that have been deprecated.

Daniel


#13

Hey @danielmitterdorfer

I found the problem to be the number of fields in the mapping. Increasing the number of fields fixed the problem. One thing that I noticed that even thought the index was created successfully, mapping was not.
However rally still continued to run index append and other operations. It would be nice to atleast print a warning that mapping exceeds limit.


(Daniel Mitterdorfer) #14

Thanks for your feedback. I think I understand how this can happen:

The create index call (which also provides the mapping to Elasticsearch) fails. However, if you do not set --on-error=abort, Rally will continue by default. Afterwards, the index will be created (implicitly) by bulk-indexing. However, when --on-error=abort is set, Rally will abort the benchmark immediately (i.e. when the create index operation fails).


#15

Thanks!


(system) #16

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.