Performance impact for searching high frequency word

Devon_Yoo · May 1, 2019, 7:02pm

Hi, I have a question which is fundamental regarding the basics of ES. I could find lots of how to use it and what the queries look like but I can hardly find a document about the backend structures.

Let's say I indexed 10 million email bodies associated to email_id for each body.
Every word in the bodies will be split and indexed into the ES in a certain way.

And, 9 million emails of them contain one or more of word "ABC"
and only 100 emails of them contain one or more of word "DEF".

And I want to query to search the top 10 emails that have the most number of the input word in desc order.

In such case, the processing time between searching "ABC" and "DEF" will be different? If it's different, what should I do for better performance. If it's same, why is that?

Thanks,

polyfractal · May 6, 2019, 5:46pm

Generally, yes. The more "exclusive" a query is, the faster. It also depends a bit if you are scoring or just filtering.

Elasticsearch uses an inverted index. This means a term dictionary is created, which maps the individual tokens to the documents that contain the tokens. So our term dictionary here is:

Term   Documents
-----------------------------------
ABC   | 2,3,4,5,6,7,8,9,10,11,12,13,14,15.....
DEF   | 1,128202

So when we search for "DEF", we go to the "DEF" part of the term dictionary and extract the list of matching documents: 1 and 128202. We then score those documents with the query.

For "ABC", we do the same thing but the list is 10 million long, which means we have many more documents to score.

It also depends on how exclusive the entire query is, not just individual queries. Elasticsearch will try to execute the least expensive portions of the query first (e.g. the parts that exclude the largest amount of documents). So even if you are searching for "ABC", if another part of the query is exclusive (like a time filter) it will cut the list of matching docs way down.

But at the end of the day, more docs matched == more docs that have to be loaded and scored and processed, so they will be slower than queries that only match a handful of documents.

Hope that helps! You can read a bit more about the inverted index stuff here: https://www.elastic.co/guide/en/elasticsearch/guide/current/inverted-index.html

system · June 3, 2019, 5:46pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Getting list of terms ranked by their document frequency? Elasticsearch	2	624	May 12, 2017
Es search optimizing question Elasticsearch	4	596	July 5, 2017
Simple Search Query Performance Elasticsearch	3	481	July 5, 2017
Further optimization to ES queries / performance Elasticsearch	1	343	September 3, 2020
Speed of query with many filters Elasticsearch	6	371	July 6, 2017

Performance impact for searching high frequency word

Related topics