How fast can we achieve in theory using "match_phrase" query on 30M document index


(Luke Yang) #1

We are doing a project to use ElasticSearch as searching engine for the large amount of text content. The goal is to offer Full text search capacity to retrieve first 10 document based on phrases user provides within 30M documents with index data size around 50GB. Is it possible in theory we can achieve 100ms per query by using whatever the features ElasticSearch can offer, such as pre-caching to allow almost all time in-memory data access, clusters so query can be split into multiple servers with each server searching on < 10M documents?

Initial testing we did on 4GB trial cloud server gave us 3s per query, which is way too slow.

Anyone has any use case like this ? The best performance you have achieved?

Thanks in advance.

Luke


(David Pilato) #2

I think you need to find what is the typical limit of a single shard before it reaches that limit.
Then test by multiplying by 2 the number of shards and the number of documents and check.

Then increase again the number of shards and test again.

You should be able to determine the right number of shards for your use case.

Note that you need to run some typical queries multiple times for this.

ESRally might help you to automate hopefully the tests.

BTW check all the time what is the took time in elasticsearch response. May be you are measuring something else (like the network) which makes you think that elasticsearch is slow?


(Luke Yang) #3

Thanks very much for the heads up. As I can understand correctly here is that the performance is basically determined by the shard (number of documents it holds determines how fast query can be executed ). We will carry out the test as you suggested here to see how we can achieve our goals.

BTW, the 4GB trial cloud server testing is set as 5 shards, "took" showed ~3s on the 30M documents index we created, guess that is something normal counting the available RAM and number of documents per shard?

Thanks,

Luke


(David Pilato) #4

BTW, the 4GB trial cloud server testing is set as 5 shards

Just to be clear: that's not "cloud" settings. That's just the default settings for elasticsearch indices if you don't set any.

Also how many data centers are you using for your trial? 1, 2 or 3?

"took" showed ~3s on the 30M documents index we created, guess that is something normal counting the available RAM and number of documents per shard?

It depends... What does a typical document looks like? What is the mapping? What is the search you are running?

Could you share all that?


(Luke Yang) #5

Yes. The default index shard setting was 5. The trial used 2 data centers. The document is simple, mapping looks like:

"field_1": {"type": "text"},
"field_2": {"type": "keyword" },
... additional fields not indexed

The content in "field_1" is quite long, normally 100 generic English words at least. "field_2" : is quite short, converted from integer, single numeric word.

Query looks like:

{ "query " : { "bool" : { "must" : [
{"match_phrase": { "field_1": "xxxxxxx" } },
{ "term": { "field_2": "142254" }} ]
}}}

Luke


(David Pilato) #6

My naive guess is that it should not take that long.

Is the query you shared absolutely complete? No size involved?

BTW a wild guess is that field2 is not use to compute a score but more used as a filter. Am I right?

In which case it will be much better to add it to the bool filter clause.


(Luke Yang) #7

The size is set as 100 and "terminate_after" was set as 5000. "field_2" is used as a filter.


(David Pilato) #8

That might explain.

"field_2" is used as a filter.

Not in the example you shared.

So it'd better if you share exactly what you are doing including sample documents. Otherwise it's hard to diagnose.


(Luke Yang) #9

I have generated multiple testing cases to evaluate the actual "match_phrase" query on 30M document (nested fields, approximately 100M documents generated in the index), the fastest performance I have achieved so far with single 128GB server is: with Java heap size set at 64GB, indices data file pre-loaded, data stored into 20 shards, it is around 20 ms - 30ms average per query. (query phrases are randomly generated, so no extreme case like hitting > thousands; just pure single "match_phrase" query)

Is that normal? Any faster I can do? (Since I have all data in-memory with above settings, guess that is the practical limit on FTS query? )


(ddorian43) #10

You can try these::::

  1. get a cpu with more GHz
  2. number of shards == number of cores/threads (all cores work in parallel for each query, minimum overhead)
  3. Heap=30GB (less overhead)
  4. sort the index by "field_2", will help you not look into certain segments at all (slower writing)

Also, you have to look at "how many documents where matched" ! If there were many, even though data in ram, it has to score them (which takes time). So if they have a lot, you have to find a way to fast cancel.
Can you sort by another field ?


(system) #11

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.