A little background. Below is probably not what ElasticSearch is designed for but this is the most efficient way I can think of for achieving what I want to achieve.
Setup and Goal: I have an index (ES 1.7), 800k documents, and <500M in size (one doc is pretty small). I have > 1 million queries that I need to run against this index and only returning the top 25 documents each time. I want to be able to achieve this below 1 hour.
Current Failed Attempt (I have tried many other ways but all failed, all end up with this same problem):
- I am using Python requests package;
- The queries are partitioned into 60 partitions, each partition around 20000 queries;
- I created 15 identical indexes (2 shards, 0 replica);
- I loop through the partitions doing the following:
4a. I am using themsearch
end point;
4b. I am using multiprocessing with 15 processes (one process using one of the 15 identical indexes);
4c. Each process, I do 10 queries at a time throughmsearch
. - One partition (at least for the first 3) actually only takes around 20 seconds to finish.
- When I loop to the 4th partition, problem arises. I am getting empty responses (response.text is empty).
- I waited for a little to run the 4th partition, it finished successfully.
- The 5th partition failed again.
- I waited for a longer time. I can run 5th, 6th and 7th but the 8th failed.
My suspicion is that: there might be some caching going on. When I run many queries at the short time period, the caching place gets filled; when I stop querying, the cache got released more and more as quiet time gets longer and longer.
I am no engineer just a user so I don't really know what is going on under the hood. Can someone point to me what the problem is? Is there anything I can do to stop that "caching"?
Thanks in advance,
Wei