What would be the lowest-cost, highest-impact change I can make to decrease response times?

(Asked on SO but thought that this place is probably better)

Unguided beginners in any field often find themselves barking up the wrong tree in trying to solve a problem — this question is asked in hoping that it'll vector my approach in a more direct path towards solving the problem.


On to the question:

I'm about a month into working with ES and so far it's been awesome. I've been incrementally indexing to ES from a set of data I've got in CSV, and I'm beginning to encounter slow response times. I want to bring the response time down, but don't know what's a good way / the best way to approach it.

My research thus far tells me that it really depends on a number of variables. So, listed below are details on the ES variables which might help you with writing an answer:

  • Shards & Stuff
    • I say "& Stuff" because I don't know enough to know what's significant here.
    • Running the default ES settings, 5 shards, 1 node.
    • Running index-time-search-as-you-type, exactly as-is from the ES guide. There's a bit in there which PUTs settings for the indices: "number_of_shards": 1. I'm not sure how that affects things.
  • Index
    • 2 indices with similar mappings (mirror a DB, so don't want to combine them)
    • Multi-language, but at the moment I only care about English.
    • As mentioned above, configured for index-time-search-as-you-type (min: 3, max: 20).
  • Documents
    • Have currently indexed ~1mil documents.
    • Have total of ~4mil documents to index.
    • Very short documents, like 5 fields of 10 english words per doc.
    • Total CSV filesize of all ~4mil rows is only ~400MB.
  • Queries
    • Main query is run as a bool (should) query.
    • Heavy on score scripting.
    • Heavy on script-sorted aggregations.
    • Fuzzy search (fuzziness: 1).
  • Hardware
  • Response Time
    • Queries with very high frequencies (typically a single English word) in the index are taking forever (~7-9000ms) to return results.
    • More specific queries (>=2 eng words) return more acceptable response times (~2-3000ms).
    • Ideally, all response times should be <2s.

If there are other variables which are important, and I've missed out, let me know and I'll edit them in.

Thank you!

I turned caching on (kinda a stop-gap measure) and it has helped a bunch. Meets my needs for now.

With only 2Gb of RAM on the entire system, it'd be upgrade the hardware.

I would recommend actually being comfortable profiling Elasticsearch yourself, its a great way to learn and start to dig into the code. It'll point at your problems most directly.

That being said when I see this

Queries with very high frequencies (typically a single English word) in the index are taking forever (~7-9000ms) to return results.

I wonder if you use stemming and stopwords? You probably do. But anything you can do to decrease the size of the term dictionary can help here.

By

"typically a single English word"

do you mean the query is a single word? A single term query takes 7-9 seconds. I'd definitely be interested to see what else you're doing at query time like scripting, etc. A single term query should be really fast even on modest hardware.

profiling Elasticsearch yourself

@softwaredoug I began digging around my cluster for information, and found out that it's only got 1 shard (well, duh, I set it to 1 shard, see first post). Though that should actually be the most-optimised case (as compared to getting more primary shards), considering that I've only got ES running on 1 node. Increasing the number of primary shards should only decrease the response times. But I'll probably do it anyway, in order to over-allocate shards.

Also, I tested queries with varying numbers of documents in the index. I found out that the response time is proportional to the number of docs in the index (500k docs might take ~3s, and 1M docs might take ~6-7s. Didn't use any tools to count the response times, but my observations seem to be pretty consistent.

I'm guessing this is happening because of the combination of 2 factors: (1) My scores are scripted (not TF/IDF) and (2) I'm querying using a bool (should, constant_score) query. So what might be happening is that the ES shard is matching all the docs which contain that single, common term, resulting in a large set of docs to compute a score for, then evaluating the score for all of those docs, ordering them, then returning the results.

I wonder if you use stemming and stopwords?

No stopping and no stemming, in fact, I'm querying with "analyzer": "simple",. It's (probably) what I wrote about in the previous paragraph that's causing all this bloat.

With only 2Gb of RAM on the entire system, it'd be upgrade the hardware.

@warkolm Yeah, I just read the ES guide on hardware. I really should be working with at least 8Gb of RAM. I wonder if Response Time/RAM/Number of Documents can be looked at as 3 factors in a balanced equation, though? Meaning, if I double the RAM and keep the no. of docs constant, can I reasonably expect response times to be cut in half?

Not necessarily.

That's gunna play a BIG part, scripting is slow.

And in relation to your comment on mine, the thing you also need to factor into cost is your own time. What's the point in spending N hours with a fine tooth comb on this when increasing available resources could get you better results for 20% of your hourly rate?
It's not that simple, I know, but keep it in mind.

Gotcha.

What do you think about this part:

If I upgrade the RAM on the node, I probably can expect that to directly impact response times... Right?

It will, yes.

Nice. Thanks!

Enabling doc_values on fields in the mapping has greatly improved our query response times. Especially if you are aggregating and sorting. Note though that it results in 1.5-2 times the storage requirement.

That seems very high given the small size of doc values?

Sorry, I'd like to clarify - it resulted in 1.5-2 times the storage for my case where document size is ~1.7k each document has ~40 fields of various types mostly long values.