ILM it's not optimized for such small conditions. It will also not roll over on an exact document count.
The policy triggers the next phase to happen but it does not trigger at the expense of all other workloads, it is to intended to be graceful.
When you use real or normal conditions such as GBs days or hours etc the percentage of overrun will be realistic / very small percentage.
With such a tiny use case / policy it gives the appearance that the overrun and will be a large percentage it will not be with real production value. You can use small numbers like you did just to test that your actions are correct but then you need to also test with the larger more realistic values, I suspect you will then see the desired behavior.
Example if you set a policy to roll over at 20GB it's not going to roll over at 40GB, it will roll over at 20 GB plus a very small amount.
Give it a try.