In order to get everything in the correct order you need to process it all in a single thread, which as you can see dramatically limits throughput. The first question to ask is why you need to maintain the order. If you are unable to relax this, does ordering have to apply to all data? Might it perhaps be possible to partition the data based on some criteria in the data and send it through a number of pipelines?