Hey all, first post here. This is really a tip, not a question, that I hope saves someone else a little time. I read through the rules and didn't see anything about posting tips, so I hope I am not doing anything wrong. Here I go...
First, I will address the words I searched for when trying to resolve this issue. I searched for "elasticsearch scroll returns duplicate results". I also tried "elasticsearch search returning too many documents" . None of them pointed me to what turned out to be my problem. When all was said and done, my issue (in short) was that I had restored an index with a new name (so I could look at the old data without messing up my live data). After I did that, my live data (which I used via an alias) had too many records. This was because the restored index, even though it was renamed, still contained the same index as the live data so a search against that alias combined the results from the two index'.
Long version, in case it helps someone:
We had a users_v4 index, which had about 28k records.
We have an alias to the users_v4 index called "users" so that we can call the index "users" from code and if we have to migrate the index or re-index for some reason and we have a new name (like users_v5), we can just change the alias to point to the new version. This is how we do all of our index naming and is the recommended approach to avoid having to update code if you re-index.
We had a snapshot of it a while back with something like:
POST http://localhost:9200/_snapshot/my_snapshots/s__user_backup
I needed to look through the backup data to check something without overwriting the live data. To do this, I had to restore the data from the backup. However, I didn't want to lose the changes between then and now so I restored to an index with a different name:
POST
http://localhost:9200/_snapshot/my snapshosts/s__user_backup/_restore
{
"indices": "users_v4",
"rename_pattern": "users_v(.+)",
"rename_replacement": "restored_users_before_v$1"
}
All seemed fine and good and I was able to compare the data in the two and run searches from via HTTP to figure out what I needed. One point to add is that when I ran the queries, I was using "users_v4" and not the "users" alias to search the data.
After that, I ran our code which warms a cache of users and I noticed that the user cache was taking longer than normal. So, I stepped into it and saw that there were a lot more than 28k records. I double checked the users_v4 index and it did have 28k records. I then threw in a little code to determine if there were duplicates and there were. After various attempts to clear connections, ES cache, check health, etc., I tried running it against the restored index and it had the correct number of records. I did some comparison of the number of results from "users" to some values in the index' stats:
`POST http://localhost:9200/users/_stats`
and found that the number of results matched the all->primaries->docs->count size so for the first time I could see a correlation to a number in ES via http. After that, I continued looking through the stats and found more than one index (the "users_v4" and "restored_users_before_v4" and another test restore that I had done) listed under the results. This is strange, I thought, because the "users" alias, which is whose stats I am looking at, should only list the "users_v4" index, but it listed the others...
Then I realized what had happened. I confirmed it with a quick call to list all index' pointing to the "users" alias':
`GET http://localhost:9200/users/_alias`
I also confirmed by looking at ALL alias'
GET http://localhost:9200/_alias
Yep, sure enough, the restored index' still had their alias of "users" intact. I had not expected this as I didn't realize that the restore to a new name would keep the alias. I would argue that it shouldn't, since the whole point of restoring to a new name is to make the restored data be in a place that won't overwrite or mess with existing data that is being used. At the least, it should probably call it out in documentation to avoid confusion.