Cannot specify multiple documents in a single corpora

I have multiple input JSON documents (around 5000) and want to ingest them using Rally, I have created a track.json based on those input JSON documents like this:

(Shows the first few lines of the file)


{
  "version": 2,
  "description": "HTTP server log data",
  "indices": [
    {
      "name": "test_index",
      "body": "index.json"
    }
  ],
  "corpora": [
    {
      "name": "destination_index_1",
      "documents": [
        {
          "target-index": "test_index",
          "source-file": "json/107886_17258263_2_1_7983564.json",
          "document-count": 107886,
          "uncompressed-bytes": 17258263
        }
      ]
    },
    {
      "name": "destination_index_2",
      "documents": [
        {
          "target-index": "test_index",
          "source-file": "json/15268676_2086967807_1_1_916120590.json",
          "document-count": 15268676,
          "uncompressed-bytes": 2086967807
        }
      ]
    },

The issue is that Rally fails to start (hangs before even creating the index) when many input files are specified. For example, specifying 500 input JSON works, but specifying 1000 JSON doesn't. Is there a hard limit to specifying the number of input JSON files in track.json that I can change?

Hello, I don't believe there's a specific hard limit, but I've never seen Rally used with more than a handful of documents, so I would not be surprised if there was some inefficiency.

To help us understand the issue (and ultimately fix it), it would be very useful to make it hang again and then record a flamegraph using a tool like py-spy record (GitHub - benfred/py-spy: Sampling profiler for Python programs). This will tell us exactly what is slowing it. Is doing that an option for you?

Hey, It not slowing down just not able to start the ingestion session. Also I checked it is not with the number of files but the total size of the files I am trying to upload with Rally.

  1. With 967 documents total input size: 1023671125454 Running
  2. With 968 documents total input size: 1023689541344 Not Running

I just checked the issue is definitely with the number of files specified; we cannot specify more than 967 files in the track.json file. Even with different files of different sizes, these hard limits remain the same.

Can you please clarify "it is not able to start the ingestion session"?

  • Do you get an error? If yes, which error?
  • Can you please share the ~/.rally/logs/rally.log for one run? (By deleting the log fie first, and then starting a new run)

No, I didn't received any error the Rally session stops after printing this:
(Just after creating the file offset tables)


    ____        ____
   / __ \____ _/ / /_  __
  / /_/ / __ `/ / / / / /
 / _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
                /____/


************************************************************************
************** WARNING: A dark dungeon lies ahead of you  **************
************************************************************************

Rally does not have control over the configuration of the benchmarked
Elasticsearch cluster.

Be aware that results may be misleading due to problems with the setup.
Rally is also not able to gather lots of metrics at all (like CPU usage
of the benchmarked cluster) or may even produce misleading metrics (like
the index size).

************************************************************************
****** Use this pipeline only if you are aware of the tradeoffs.  ******
*************************** Watch your step! ***************************
************************************************************************

which is usually not the case normally.

I checked the logs they are not showing any different thing from what a normal Rally session do.

This error message was removed more than four years ago! What Rally version are you using? How did you install it?

esrally --version should give you 2.11.0