Igor,
Ash and I are working on this problem together and we wanted to thank you
for your help. Anecdotal test results on 96 Million docs on 5 shards look
promising; we are seeing sub-second response times (even with caching
turned off), we are in the process of running this in our load testing
environment to get a better picture of the performance.
We have a follow up question for you. First, we noticed that you used
nested objects in your example, is there a reason why we could not simply
use top level document properties?
On Monday, March 25, 2013 1:47:22 PM UTC-4, hanaf...@gmail.com wrote:
The general questions that we are trying to answer:
- What is the best (fastest) way to filter documents based on user
controls and documents' fields?- What is the best (fastest) way to get all the terms in an index for a
field?Our specific details follow:
We are developing an application that will use elasticsearch to index
200M+ documents spread across 16 nodes. When a user searches, we need to
filter out documents based upon the intersection of a user's tokens
(gathered by our application) and the document's tokens (indexed values).
We have evaluated a number of ways to do this and we are looking for
feedback from the elasticsearch community on our approaches and any other
methods that can be tried.The performance tests quoted below were run on a stack with 5 nodes, 5
shards, 1 replica, 150M documents, each document is about 15K each,
elasticsearch version - 0.20.5 (we briefly tested 0.90.0RC1 and found it
slower than 0.20.5)
- Native Script filter - We are submitting the user's tokens via params
to a Native Script that compares them with documents' controls:public class CustomScript extends AbstractSearchScript {
...
@Override
public Object run() {
// Profiled elasticsearch during a single query// 60% of CPU time during CustomScript.run Set<String> docControls= Sets.newHashSet(((StringDocFieldData)doc().field(DOCUMENT_CONTROL_FIELD_NAME)).getValues());
// 30% of CPU time during CustomScript.run return shouldBeAllowedToSeeDocument(this.userControls,docControls);
}
...
}As you can see, most of the time is spent pulling the fields from the
documents. If the query does not hit many documents, the filter is quick
enough (< 1 second). But if the query hits millions of documents, the
filter gets much slower (~80s).
- Nested Boolean filter - Because pulling document fields was the
bottleneck, we sought a way to avoid that step. We construct a (complex)
filter query made up of nested boolean filters that enumerates all of the
tokens from the index that the user does not have and combines them with
the token that the user does have:{
"filter": {
"bool": {
"must": [
{
"bool": {
"should": [
{
"bool": {
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlA2"
] } } } },
{
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlA0",
"controlA1"
] } } ] }
},
{
"bool": {
"should": [
{
"bool": {
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlB0",
"controlB2",
"controlB3"
] } } } },
{
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlB1"
] } } ] }
},
{
"bool": {
"should": [
{
"bool": {
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlC",
"controlD"
] } } } } ] } }
],
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlE"
] } } } }
}This was much faster. We were seeing most queries with this filter return
around 0.8 seconds. And the query was a match_all docs query.However, this requires all the terms from the index for this field to
construct the boolean filter. This was slow when we tried the following
approaches for this:2a. Faceting - Retrieving all terms for a field:
{
"facets": {
"DOCUMENT_CONTROL_FIELD_NAME": {
"terms": {
"field": "DOCUMENT_CONTROL_FIELD_NAME",
"size": 100000,
"all_terms": true
},
"global": true
} }
}Times for this query varied depending on the number of unique terms for
the field chosen, but was anywhere between 20 seconds and 140 seconds.2b. Termlist plugin (
GitHub - jprante/elasticsearch-index-termlist: Elasticsearch Index Termlist)This approach took longer than the facet query ( > 200 seconds )
Any feedback or thoughts would be much appreciated. Thanks!
- Ash
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.