Query response time not scaling well with large resultsets

We have a simple setup with 1 node, 1 index and 5 shards containing totally
about 110 million documents. Total index size is about 30Gb. All documents
are simply indexed by one field named ES_VALUE and stores a value named
ES_SUMMARY, in addition we have some other id fields indexed. We are
indexing several fields from our entities but they are all put into
ES_VALUE when indexing, i.e. there are a lot more different values for
ES_VALUE than there are documents in the index.

A simple query for documents only using a wildcard query on ES_VALUE
performs very different depending on how many documents its targeting. I.e.
when the result has a smaller total hit count it returns in a few mills but
when result has a large total hit count it takes about 10-15sec to return.
Size is set to 15 so there are only 15 documents actually returned to the
client.

Any suggestions on how to get the queris to perform better with large
resultsets?

--

Wildcard queries are generally slow, do not expect too much. Are you on a
"cold" index? If you repeat the query, you will likely see faster
turnaround times.

A side note, with upcoming ES 0.21 and Lucene 4 with the new
AutomatonQuery, the chance is high that wildcard queries can be expected to
perform faster.

Jörg

On Wednesday, November 14, 2012 11:23:05 PM UTC+1, lifo wrote:

We have a simple setup with 1 node, 1 index and 5 shards containing
totally about 110 million documents. Total index size is about 30Gb. All
documents are simply indexed by one field named ES_VALUE and stores a value
named ES_SUMMARY, in addition we have some other id fields indexed. We are
indexing several fields from our entities but they are all put into
ES_VALUE when indexing, i.e. there are a lot more different values for
ES_VALUE than there are documents in the index.

A simple query for documents only using a wildcard query on ES_VALUE
performs very different depending on how many documents its targeting. I.e.
when the result has a smaller total hit count it returns in a few mills but
when result has a large total hit count it takes about 10-15sec to return.
Size is set to 15 so there are only 15 documents actually returned to the
client.

Any suggestions on how to get the queris to perform better with large
resultsets?

--

Are you able to share with us your mappings and maybe an example of some
queries you're executing?

On Thursday, November 15, 2012 11:23:05 AM UTC+13, lifo wrote:

We have a simple setup with 1 node, 1 index and 5 shards containing
totally about 110 million documents. Total index size is about 30Gb. All
documents are simply indexed by one field named ES_VALUE and stores a value
named ES_SUMMARY, in addition we have some other id fields indexed. We are
indexing several fields from our entities but they are all put into
ES_VALUE when indexing, i.e. there are a lot more different values for
ES_VALUE than there are documents in the index.

A simple query for documents only using a wildcard query on ES_VALUE
performs very different depending on how many documents its targeting. I.e.
when the result has a smaller total hit count it returns in a few mills but
when result has a large total hit count it takes about 10-15sec to return.
Size is set to 15 so there are only 15 documents actually returned to the
client.

Any suggestions on how to get the queris to perform better with large
resultsets?

--

Just another quick remark to what you observe. Wildcard search is working
by traversing the term list and looking for possibilities to expand the
query matching the given pattern, and or'ing the set of terms before the
query is run. The complexity of the wildcard search operation is not
directly related to the number of documents in your index or to the number
of hits in your result set, it is related to the total number of possible
expansion forms of the terms in your index.

In Lucene 4, with the finite state automaton queries, it is possible to
"precompile" the terms in your index into an automaton, and this will
reduce the time required to find and match all the word forms given by a
regex to a minimum.

Best regards,

Jörg

--

useconflations instead of wildcards. of course you'll have to build the
support.

the cost of a wildcard match on the socalled term list is negligble.
although the postings list when very large is not optimized. some of Lucene
4 indexing scheme addresses this, if its been done right. i.e. allowing
skipping.

lastly, OR is used to enhance recall; if you're only pulling 15 documents
you should be after precision and change your default operator to AND. This
will also speed things up; probably quite noticeably.

ignore the "auto-warm" statement. its not relevant to query performance;
its a caching issue.

On Wednesday, November 14, 2012 5:23:05 PM UTC-5, lifo wrote:

We have a simple setup with 1 node, 1 index and 5 shards containing
totally about 110 million documents. Total index size is about 30Gb. All
documents are simply indexed by one field named ES_VALUE and stores a value
named ES_SUMMARY, in addition we have some other id fields indexed. We are
indexing several fields from our entities but they are all put into
ES_VALUE when indexing, i.e. there are a lot more different values for
ES_VALUE than there are documents in the index.

A simple query for documents only using a wildcard query on ES_VALUE
performs very different depending on how many documents its targeting. I.e.
when the result has a smaller total hit count it returns in a few mills but
when result has a large total hit count it takes about 10-15sec to return.
Size is set to 15 so there are only 15 documents actually returned to the
client.

Any suggestions on how to get the queris to perform better with large
resultsets?

--

Thanks for your answer. What does it really mean to change to AND as
default operator?

On Thursday, November 15, 2012 7:46:14 PM UTC+1, BillyEm wrote:

useconflations instead of wildcards. of course you'll have to build the
support.

the cost of a wildcard match on the socalled term list is negligble.
although the postings list when very large is not optimized. some of Lucene
4 indexing scheme addresses this, if its been done right. i.e. allowing
skipping.

lastly, OR is used to enhance recall; if you're only pulling 15 documents
you should be after precision and change your default operator to AND. This
will also speed things up; probably quite noticeably.

ignore the "auto-warm" statement. its not relevant to query performance;
its a caching issue.

On Wednesday, November 14, 2012 5:23:05 PM UTC-5, lifo wrote:

We have a simple setup with 1 node, 1 index and 5 shards containing
totally about 110 million documents. Total index size is about 30Gb. All
documents are simply indexed by one field named ES_VALUE and stores a value
named ES_SUMMARY, in addition we have some other id fields indexed. We are
indexing several fields from our entities but they are all put into
ES_VALUE when indexing, i.e. there are a lot more different values for
ES_VALUE than there are documents in the index.

A simple query for documents only using a wildcard query on ES_VALUE
performs very different depending on how many documents its targeting. I.e.
when the result has a smaller total hit count it returns in a few mills but
when result has a large total hit count it takes about 10-15sec to return.
Size is set to 15 so there are only 15 documents actually returned to the
client.

Any suggestions on how to get the queris to perform better with large
resultsets?

--

Thanks for the answers so far. If I understand this correctly the main
reason for the the time consuming queries is the large amount of terms
targeted and not the actual hit count, correct?

I understand ES 0.21 has some new features thanks to Lucene 4, but what can
be done now? More nodes, more or less number of shards, split up the index,
update the mappings?

Here are some queries end their execution time, and our mappings:

http://ourbox:9200/fruit/_search?q=ES_VALUE:201* - 14s
http://ourbox:9200/fruit/_search?q=ES_VALUE:201206* - 600ms

http://ourbox:9200/fruit/_search?q=ES_VALUE:20120625XY123456* - 100ms
*
*
{
"fruit" : {
"Banana" : {
"_all" : {
"enabled" : false
},
"_source" : {
"excludes" : [ "ES_VALUE" ]
},
"properties" : {
"ES_DISTRIBUTOR" : {
"type" : "string",
"index" : "not_analyzed",
"omit_norms" : true,
"omit_term_freq_and_positions" : true
},
"ES_OWNER" : {
"type" : "string",
"index" : "not_analyzed",
"omit_norms" : true,
"omit_term_freq_and_positions" : true
},
"ES_SUMMARY" : {
"type" : "string",
"index" : "no",
"store" : "yes"
},
"ES_VALUE" : {
"type" : "string",
"analyzer" : "fruitSimple"
}
}
},
"Apple" : {
"_all" : {
"enabled" : false
},
"_source" : {
"excludes" : [ "ES_VALUE" ]
},
"properties" : {
"ES_DISTRIBUTOR" : {
"type" : "string",
"index" : "not_analyzed",
"omit_norms" : true,
"omit_term_freq_and_positions" : true
},
"ES_OWNER" : {
"type" : "string",
"index" : "not_analyzed",
"omit_norms" : true,
"omit_term_freq_and_positions" : true
},
"ES_SUMMARY" : {
"type" : "string",
"index" : "no",
"store" : "yes"
},
"ES_VALUE" : {
"type" : "string",
"analyzer" : "fruitSimple"
}
}
},
"Pineapple" : {
"_all" : {
"enabled" : false
},
"_source" : {
"excludes" : [ "ES_VALUE" ]
},
"properties" : {
"ES_DISTRIBUTOR" : {
"type" : "string",
"index" : "not_analyzed",
"omit_norms" : true,
"omit_term_freq_and_positions" : true
},
"ES_OWNER" : {
"type" : "string",
"index" : "not_analyzed",
"omit_norms" : true,
"omit_term_freq_and_positions" : true
},
"ES_SUMMARY" : {
"type" : "string",
"index" : "no",
"store" : "yes"
},
"ES_VALUE" : {
"type" : "string",
"analyzer" : "fruitSimple"
}
}
},
"Melon" : {
"_all" : {
"enabled" : false
},
"_source" : {
"excludes" : [ "ES_VALUE" ]
},
"properties" : {
"ES_DISTRIBUTOR" : {
"type" : "string",
"index" : "not_analyzed",
"omit_norms" : true,
"omit_term_freq_and_positions" : true
},
"ES_OWNER" : {
"type" : "string",
"index" : "not_analyzed",
"omit_norms" : true,
"omit_term_freq_and_positions" : true
},
"ES_SUMMARY" : {
"type" : "string",
"index" : "no",
"store" : "yes"
},
"ES_VALUE" : {
"type" : "string",
"analyzer" : "fruitSimple"
}
}
},
"Grapefruit" : {
"_all" : {
"enabled" : false
},
"_source" : {
"excludes" : [ "ES_VALUE" ]
},
"properties" : {
"ES_DISTRIBUTOR" : {
"type" : "string",
"index" : "not_analyzed",
"omit_norms" : true,
"omit_term_freq_and_positions" : true
},
"ES_OWNER" : {
"type" : "string",
"index" : "not_analyzed",
"omit_norms" : true,
"omit_term_freq_and_positions" : true
},
"ES_SUMMARY" : {
"type" : "string",
"index" : "no",
"store" : "yes"
},
"ES_VALUE" : {
"type" : "string",
"analyzer" : "fruitSimple"
},
}
},
"Eggplant" : {
"_all" : {
"enabled" : false
},
"_source" : {
"excludes" : [ "ES_VALUE" ]
},
"properties" : {
"ES_DISTRIBUTOR" : {
"type" : "string",
"index" : "not_analyzed",
"omit_norms" : true,
"omit_term_freq_and_positions" : true
},
"ES_OWNER" : {
"type" : "string",
"index" : "not_analyzed",
"omit_norms" : true,
"omit_term_freq_and_positions" : true
},
"ES_SUMMARY" : {
"type" : "string",
"index" : "no",
"store" : "yes"
},
"ES_VALUE" : {
"type" : "string",
"analyzer" : "fruitSimple"
}
}
},
"Kiwi" : {
"_all" : {
"enabled" : false
},
"_source" : {
"excludes" : [ "ES_VALUE" ]
},
"properties" : {
"ES_DISTRIBUTOR" : {
"type" : "string",
"index" : "not_analyzed",
"omit_norms" : true,
"omit_term_freq_and_positions" : true
},
"ES_OWNER" : {
"type" : "string",
"index" : "not_analyzed",
"omit_norms" : true,
"omit_term_freq_and_positions" : true
},
"ES_SUMMARY" : {
"type" : "string",
"index" : "no",
"store" : "yes"
},
"ES_VALUE" : {
"type" : "string",
"analyzer" : "fruitSimple"
}
}
},
"Orange" : {
"_all" : {
"enabled" : false
},
"_source" : {
"excludes" : [ "ES_VALUE" ]
},
"properties" : {
"ES_DISTRIBUTOR" : {
"type" : "string",
"index" : "not_analyzed",
"omit_norms" : true,
"omit_term_freq_and_positions" : true
},
"ES_OWNER" : {
"type" : "string",
"index" : "not_analyzed",
"omit_norms" : true,
"omit_term_freq_and_positions" : true
},
"ES_SUMMARY" : {
"type" : "string",
"index" : "no",
"store" : "yes"
},
"ES_VALUE" : {
"type" : "string",
"analyzer" : "fruitSimple"
}
}
},
"Lemon" : {
"_all" : {
"enabled" : false
},
"_source" : {
"excludes" : [ "ES_VALUE" ]
},
"properties" : {
"ES_DISTRIBUTOR" : {
"type" : "string",
"index" : "not_analyzed",
"omit_norms" : true,
"omit_term_freq_and_positions" : true
},
"ES_OWNER" : {
"type" : "string",
"index" : "not_analyzed",
"omit_norms" : true,
"omit_term_freq_and_positions" : true
},
"ES_SUMMARY" : {
"type" : "string",
"index" : "no",
"store" : "yes"
},
"ES_VALUE" : {
"type" : "string",
"analyzer" : "fruitSimple"
}
}
},
"Lime" : {
"_all" : {
"enabled" : false
},
"_source" : {
"excludes" : [ "ES_VALUE" ]
},
"properties" : {
"ES_DISTRIBUTOR" : {
"type" : "string",
"index" : "not_analyzed",
"omit_norms" : true,
"omit_term_freq_and_positions" : true
},
"ES_OWNER" : {
"type" : "string",
"index" : "not_analyzed",
"omit_norms" : true,
"omit_term_freq_and_positions" : true
},
"ES_SUMMARY" : {
"type" : "string",
"index" : "no",
"store" : "yes"
},
"ES_VALUE" : {
"type" : "string",
"analyzer" : "fruitSimple"
}
}
},
}
}

On Wednesday, November 14, 2012 11:23:05 PM UTC+1, lifo wrote:

We have a simple setup with 1 node, 1 index and 5 shards containing
totally about 110 million documents. Total index size is about 30Gb. All
documents are simply indexed by one field named ES_VALUE and stores a value
named ES_SUMMARY, in addition we have some other id fields indexed. We are
indexing several fields from our entities but they are all put into
ES_VALUE when indexing, i.e. there are a lot more different values for
ES_VALUE than there are documents in the index.

A simple query for documents only using a wildcard query on ES_VALUE
performs very different depending on how many documents its targeting. I.e.
when the result has a smaller total hit count it returns in a few mills but
when result has a large total hit count it takes about 10-15sec to return.
Size is set to 15 so there are only 15 documents actually returned to the
client.

Any suggestions on how to get the queris to perform better with large
resultsets?

--

http://ourbox:9200/fruit/_search?q=ES_VALUE:201* - 14s

http://ourbox:9200/fruit/_search?q=ES_VALUE:201206* - 600ms

http://ourbox:9200/fruit/_search?q=ES_VALUE:20120625XY123456* - 100ms
*
*
I would suggest to reconsider your query method by the given key
"ES_VALUE". Right now, you use "ES_VALUE" as a compound key with the most
expensive method, i.e. wildcard pattern:

Estimate the total number of different "ES_VALUE" values in your index (the
cardinality of the value). I predict it is very high. The higher the
cardinality, the worse the performance of a wildcard search. Note, ES is
not a database, it has no b-tree index for fast wildcard prefix key lookups.

One method to reconsider the query method is using a decomposed key. To me,
it looks like it is possible to segment the key into components like year,
month, day, and a numbering scheme. By having a number of fields for the
key, you can combine them into a very fast boolean search, even over some
integer values (something like 2012 and 06 and 25 and "XY123456" ...) A
(cacheable) filter query with integer values of low cardinality is very
fast in ES.

Jörg

--