I was assuming that we could use elasticsearch as a sort of document store for high performance reads when accessing documents by id in addition to the amazing search engine.
However, I am finding that when accessing a single random document by id out of our cluster, it tends to take about 50ms. Our documents are fairly complex with a number of nested documents. 50ms is derived from profiling the async call to Elasticsearch from within the C# Nest client. For comparison, we are typically searching our entire index in that same amount of time with our smallest single field searches typically taking about 2-3ms which is viewable by profiling the same method.
Is this expected behavior or should I expect results comparable to redis when accessing by id (2ms)? We are currently running 5.5.
Don't expect to be compared with redis (redis should be even faster than 2ms).
But doing a GET is a simple TERM query, so should be faster than that (at about single digit ms). Maybe your data is too big for in-memory caching ? What about repeated calls to the same _id ?
@ddorian43 Repeated calls do result in faster results but we obviously must have enough churn for our average in production to be that high. We have two 3GB shards and about 16GB of system memory with 6GB JVMs. There are 350K top level documents composed of 15M total documents with nested included. It makes sense that if it has to read from disk, even though they are very fast SSDs, it's going to experience some latency. I just wasn't expecting that much.
I am very interested to learn that a GET to /index/type/id is actually a term query under the hood.
You are right that redis can be sub-millisecond, I was looking at data marshaling as well.
@ddorian43 Since have millions of other queries running, wouldn't those also be fighting for the cache? If GETing a document is simply a term query, then I'd think it's simply a matter of cache churn on lesser used IDs.
I've profiled queries, but not the GET API. I switched to a basic ids query and the profile has about 99% of the time spent on next_doc.
Also, today I learned that issuing a GET with {"profile":true} as the payload actually overwrites that ID with that payload.
Elasticsearch does not keep all data in memory, so will read from disk, although this may naturally be cached in the file system cache. The document source is stored compressed on disk, so there is also the additional overhead of decompressing the document before it can be returned.
I guess I could try disabling compression as well. Is there anything I could do to get these "under the hood id queries" to remain in memory or even eagerly load the docs in memory assuming memory was not an issue? I'd like to avoid duplicating my docs in redis.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.