It's likely that they came from different shards. Term frequencies and IDFs are computed on a shard-local basis, which allows the search to happen in a coordination-free environment (the shards don't have to talk to each other and can execute in parallel).
Generally, this works fine because there is "enough data" to smooth out the discrepancies in TF/IDF, and scoring ends up being similar. But with few documents, or documents that aren't randomly distributed (e.g. using custom routing) you can run into more severe differences.
If you use DFS mode for search (https://www.elastic.co/guide/en/elasticsearch/reference/6.2/search-request-search-type.html#dfs-query-then-fetch), this executes a pre-search phase which collects TFs and IDFs from all the shards, compiles a "global" set of statistics and uses that on each shard for scoring.
Scoring will be more "accurate" at the cost of an extra round trip and more work. Generally it's not needed though.