Unless you have a few hundreds of path/sub-paths (also account for files in sub-directories), I don't think scanning for files should affect the throughput much.
One problem with the tuning efforts I have is the inconsistencies in read throughput, even when testing with console output. Anyways, you might consider increasing the spool size significantly, so you can buffer more events during read-throughput-peaks. On the output size set loadbalance: true and increase the number of workers (e.g. worker: 3). A batch of events in the spooler is split into N = spooler size / bulk_max_size batches, which are finally load-balanced onto the configured workers. That is N should be a multiple of workers. Only after all sub-batches have been processed, will the next set of batches be forwarded to the outputs (lock-step loadbalancing). Filebeat 6.0 has better support for async publishing and load balancing.