Efficiently filtering documents based on user controls and documents' fields

The general questions that we are trying to answer:

  • What is the best (fastest) way to filter documents based on user controls
    and documents' fields?
  • What is the best (fastest) way to get all the terms in an index for a
    field?

Our specific details follow:

We are developing an application that will use elasticsearch to index 200M+
documents spread across 16 nodes. When a user searches, we need to filter
out documents based upon the intersection of a user's tokens (gathered by
our application) and the document's tokens (indexed values). We have
evaluated a number of ways to do this and we are looking for feedback from
the elasticsearch community on our approaches and any other methods that
can be tried.

The performance tests quoted below were run on a stack with 5 nodes, 5
shards, 1 replica, 150M documents, each document is about 15K each,
elasticsearch version - 0.20.5 (we briefly tested 0.90.0RC1 and found it
slower than 0.20.5)

  1. Native Script filter - We are submitting the user's tokens via params to
    a Native Script that compares them with documents' controls:

public class CustomScript extends AbstractSearchScript {
...
@Override
public Object run() {
// Profiled elasticsearch during a single query

    // 60% of CPU time during CustomScript.run
    Set<String> docControls= Sets.newHashSet(((StringDocFieldData) 

doc().field(DOCUMENT_CONTROL_FIELD_NAME)).getValues());

    // 30% of CPU time during CustomScript.run
    return shouldBeAllowedToSeeDocument(this.userControls, docControls);
}

...
}

As you can see, most of the time is spent pulling the fields from the
documents. If the query does not hit many documents, the filter is quick
enough (< 1 second). But if the query hits millions of documents, the
filter gets much slower (~80s).

  1. Nested Boolean filter - Because pulling document fields was the
    bottleneck, we sought a way to avoid that step. We construct a (complex)
    filter query made up of nested boolean filters that enumerates all of the
    tokens from the index that the user does not have and combines them with
    the token that the user does have:

{
"filter": {
"bool": {
"must": [
{
"bool": {
"should": [
{
"bool": {
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlA2"
] } } } },
{
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlA0",
"controlA1"
] } } ] }
},
{
"bool": {
"should": [
{
"bool": {
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlB0",
"controlB2",
"controlB3"
] } } } },
{
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlB1"
] } } ] }
},
{
"bool": {
"should": [
{
"bool": {
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlC",
"controlD"
] } } } } ] } }
],
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlE"
] } } } }
}

This was much faster. We were seeing most queries with this filter return
around 0.8 seconds. And the query was a match_all docs query.

However, this requires all the terms from the index for this field to
construct the boolean filter. This was slow when we tried the following
approaches for this:

2a. Faceting - Retrieving all terms for a field:

{
"facets": {
"DOCUMENT_CONTROL_FIELD_NAME": {
"terms": {
"field": "DOCUMENT_CONTROL_FIELD_NAME",
"size": 100000,
"all_terms": true
},
"global": true
} }
}

Times for this query varied depending on the number of unique terms for the
field chosen, but was anywhere between 20 seconds and 140 seconds.

2b. Termlist plugin
(https://github.com/jprante/elasticsearch-index-termlist)

This approach took longer than the facet query ( > 200 seconds )

Any feedback or thoughts would be much appreciated. Thanks!

  • Ash

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Could you elaborate a little bit more on requirements? A few examples of
the documents and user tokens and how they should and shouldn't match would
be really helpful.

On Monday, March 25, 2013 1:47:22 PM UTC-4, hanaf...@gmail.com wrote:

The general questions that we are trying to answer:

  • What is the best (fastest) way to filter documents based on user
    controls and documents' fields?
  • What is the best (fastest) way to get all the terms in an index for a
    field?

Our specific details follow:

We are developing an application that will use elasticsearch to index
200M+ documents spread across 16 nodes. When a user searches, we need to
filter out documents based upon the intersection of a user's tokens
(gathered by our application) and the document's tokens (indexed values).
We have evaluated a number of ways to do this and we are looking for
feedback from the elasticsearch community on our approaches and any other
methods that can be tried.

The performance tests quoted below were run on a stack with 5 nodes, 5
shards, 1 replica, 150M documents, each document is about 15K each,
elasticsearch version - 0.20.5 (we briefly tested 0.90.0RC1 and found it
slower than 0.20.5)

  1. Native Script filter - We are submitting the user's tokens via params
    to a Native Script that compares them with documents' controls:

public class CustomScript extends AbstractSearchScript {
...
@Override
public Object run() {
// Profiled elasticsearch during a single query

    // 60% of CPU time during CustomScript.run
    Set<String> docControls= Sets.newHashSet(((StringDocFieldData) 

doc().field(DOCUMENT_CONTROL_FIELD_NAME)).getValues());

    // 30% of CPU time during CustomScript.run
    return shouldBeAllowedToSeeDocument(this.userControls, 

docControls);
}
...
}

As you can see, most of the time is spent pulling the fields from the
documents. If the query does not hit many documents, the filter is quick
enough (< 1 second). But if the query hits millions of documents, the
filter gets much slower (~80s).

  1. Nested Boolean filter - Because pulling document fields was the
    bottleneck, we sought a way to avoid that step. We construct a (complex)
    filter query made up of nested boolean filters that enumerates all of the
    tokens from the index that the user does not have and combines them with
    the token that the user does have:

{
"filter": {
"bool": {
"must": [
{
"bool": {
"should": [
{
"bool": {
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlA2"
] } } } },
{
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlA0",
"controlA1"
] } } ] }
},
{
"bool": {
"should": [
{
"bool": {
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlB0",
"controlB2",
"controlB3"
] } } } },
{
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlB1"
] } } ] }
},
{
"bool": {
"should": [
{
"bool": {
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlC",
"controlD"
] } } } } ] } }
],
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlE"
] } } } }
}

This was much faster. We were seeing most queries with this filter return
around 0.8 seconds. And the query was a match_all docs query.

However, this requires all the terms from the index for this field to
construct the boolean filter. This was slow when we tried the following
approaches for this:

2a. Faceting - Retrieving all terms for a field:

{
"facets": {
"DOCUMENT_CONTROL_FIELD_NAME": {
"terms": {
"field": "DOCUMENT_CONTROL_FIELD_NAME",
"size": 100000,
"all_terms": true
},
"global": true
} }
}

Times for this query varied depending on the number of unique terms for
the field chosen, but was anywhere between 20 seconds and 140 seconds.

2b. Termlist plugin (
GitHub - jprante/elasticsearch-index-termlist: Elasticsearch Index Termlist)

This approach took longer than the facet query ( > 200 seconds )

Any feedback or thoughts would be much appreciated. Thanks!

  • Ash

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey Igor,
Thanks for getting back to us. Hopefully we can clear up our problem a
little bit. Please note that the description below details a simplified
version of the problem, but the concepts are the essentially same.

So say we write the following documents to an empty ES index:
{
"doc1": {
"controlled_field": ["A", "B", "C"]
},
"doc2": {
"controlled_field": ["X", "Y", "Z"]
},
"doc3": {
"controlled_field": ["A", "Y", "D"]
}
}

And consider a user with the following controls:
["A", "B", "D", "X", "Y"]

In order for a user to be allowed access to a document, the user must have
ALL of the controls listed in the document. So:

user DENIED access to doc1 => user has "A" and "B" but is missing "C"
user DENIED access to doc2 => user has "X" and "Y" but is missing "Z"
user ALLOWED access to doc3 => user has all of doc3's controls, ["A", "Y",
"D"]

The logic above can be expressed in set math:
((values from document's "controlled_field") - (user's controls)) MUST be
empty

But this method requires us to pull the "controlled_field" values from each
document, which we find slow for hundreds of millions of documents.

So we tried an approach using boolean filters, which follows.

Given the documents above, all the terms for "controlled_field" in the ES
index are:
index terms = ["A", "B", "C", "D", "X", "Y", "Z"]

We can filter out documents that contain controls that the user does not
have, as follows:
terms_documents_must_not_have = (index terms - user's controls)
= ["A", "B", "C", "D", "X", "Y", "Z"] - ["A",
"B", "D", "X", "Y"]
= ["C", "Z"]
So the filter becomes:
{
"filter": {
"bool": {
"must_not": {
"terms": {
"controlled_field": ["C", "Z"]
}
}
}
}
}

This approach does not require us to pull the "controlled_field" from each
document and still is correct. But the problem becomes: how do we get all
terms for a field from the ES index quickly?

Please let me know if this is still not clear. Thanks again.

  • Ash

On Tuesday, March 26, 2013 10:41:20 AM UTC-4, Igor Motov wrote:

Could you elaborate a little bit more on requirements? A few examples of
the documents and user tokens and how they should and shouldn't match would
be really helpful.

On Monday, March 25, 2013 1:47:22 PM UTC-4, hanaf...@gmail.com wrote:

The general questions that we are trying to answer:

  • What is the best (fastest) way to filter documents based on user
    controls and documents' fields?
  • What is the best (fastest) way to get all the terms in an index for a
    field?

Our specific details follow:

We are developing an application that will use elasticsearch to index
200M+ documents spread across 16 nodes. When a user searches, we need to
filter out documents based upon the intersection of a user's tokens
(gathered by our application) and the document's tokens (indexed values).
We have evaluated a number of ways to do this and we are looking for
feedback from the elasticsearch community on our approaches and any other
methods that can be tried.

The performance tests quoted below were run on a stack with 5 nodes, 5
shards, 1 replica, 150M documents, each document is about 15K each,
elasticsearch version - 0.20.5 (we briefly tested 0.90.0RC1 and found it
slower than 0.20.5)

  1. Native Script filter - We are submitting the user's tokens via params
    to a Native Script that compares them with documents' controls:

public class CustomScript extends AbstractSearchScript {
...
@Override
public Object run() {
// Profiled elasticsearch during a single query

    // 60% of CPU time during CustomScript.run
    Set<String> docControls= Sets.newHashSet(((StringDocFieldData) 

doc().field(DOCUMENT_CONTROL_FIELD_NAME)).getValues());

    // 30% of CPU time during CustomScript.run
    return shouldBeAllowedToSeeDocument(this.userControls, 

docControls);
}
...
}

As you can see, most of the time is spent pulling the fields from the
documents. If the query does not hit many documents, the filter is quick
enough (< 1 second). But if the query hits millions of documents, the
filter gets much slower (~80s).

  1. Nested Boolean filter - Because pulling document fields was the
    bottleneck, we sought a way to avoid that step. We construct a (complex)
    filter query made up of nested boolean filters that enumerates all of the
    tokens from the index that the user does not have and combines them with
    the token that the user does have:

{
"filter": {
"bool": {
"must": [
{
"bool": {
"should": [
{
"bool": {
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlA2"
] } } } },
{
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlA0",
"controlA1"
] } } ] }
},
{
"bool": {
"should": [
{
"bool": {
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlB0",
"controlB2",
"controlB3"
] } } } },
{
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlB1"
] } } ] }
},
{
"bool": {
"should": [
{
"bool": {
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlC",
"controlD"
] } } } } ] } }
],
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlE"
] } } } }
}

This was much faster. We were seeing most queries with this filter
return around 0.8 seconds. And the query was a match_all docs query.

However, this requires all the terms from the index for this field to
construct the boolean filter. This was slow when we tried the following
approaches for this:

2a. Faceting - Retrieving all terms for a field:

{
"facets": {
"DOCUMENT_CONTROL_FIELD_NAME": {
"terms": {
"field": "DOCUMENT_CONTROL_FIELD_NAME",
"size": 100000,
"all_terms": true
},
"global": true
} }
}

Times for this query varied depending on the number of unique terms for
the field chosen, but was anywhere between 20 seconds and 140 seconds.

2b. Termlist plugin (
GitHub - jprante/elasticsearch-index-termlist: Elasticsearch Index Termlist)

This approach took longer than the facet query ( > 200 seconds )

Any feedback or thoughts would be much appreciated. Thanks!

  • Ash

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I can think of only one clean way to do it. Unfortunately, it's somewhat
complicated and I am not sure how performant it's going to be on 200M+
documents, that's something that you would need to test. However, filters
can be cached, so it could be possible to minimize the impact to only the
first user's query.

You can do something like this:

Index controlled_fields as nested objects. So the first document would look
like this:

{
"controlled_fields": [{
"controlled_field": "A"
}, {
"controlled_field": "B"
}, {
"controlled_field": "C"
}]
}

and the mapping would look like this:

"properties": {
"controlled_fields": {
"type": "nested",
"properties": {
"controlled_field": {
"type": "string",
"index": "not_analyzed"
}
}
}
}

With this mapping, we can easily find all controlled fields, that are
assigned to the user:

"terms": {
"controlled_fields.controlled_field": ["A", "B", "D", "X", "Y"]
}

then we can wrap it in a not filter and therefore find all controlled
fields that are not assigned to the user

"not": {
"terms": {
"controlled_fields.controlled_field": ["A", "B", "D", "X", "Y"]
}
}

and then we can use nested filter to find all documents that contain
control fields that are not assigned to the user:

"nested": {
"path": "controlled_fields",
"filter": {
"not": {
"terms": {
"controlled_fields.controlled_field": ["A", "B", "D", "X",
"Y"]
}
}
}
}

and finally we can use not filter to filter all these documents out:

"not": {
"nested": {
"path": "controlled_fields",
"filter": {
"not": {
"terms": {
"controlled_fields.controlled_field": ["A", "B", "D",
"X", "Y"]
}
}
}
}
}

Here is the final
solution: https://github.com/imotov/elasticsearch-test-scripts/blob/master/controlled_field.sh

By default, the not filter is not cached. So, you might need to set
"_cache" attribute of the outermost not filter to true in order to get a
decent performance out of it.

On Tuesday, March 26, 2013 2:47:11 PM UTC-4, hanaf...@gmail.com wrote:

Hey Igor,
Thanks for getting back to us. Hopefully we can clear up our problem a
little bit. Please note that the description below details a simplified
version of the problem, but the concepts are the essentially same.

So say we write the following documents to an empty ES index:
{
"doc1": {
"controlled_field": ["A", "B", "C"]
},
"doc2": {
"controlled_field": ["X", "Y", "Z"]
},
"doc3": {
"controlled_field": ["A", "Y", "D"]
}
}

And consider a user with the following controls:
["A", "B", "D", "X", "Y"]

In order for a user to be allowed access to a document, the user must have
ALL of the controls listed in the document. So:

user DENIED access to doc1 => user has "A" and "B" but is missing "C"
user DENIED access to doc2 => user has "X" and "Y" but is missing "Z"
user ALLOWED access to doc3 => user has all of doc3's controls, ["A", "Y",
"D"]

The logic above can be expressed in set math:
((values from document's "controlled_field") - (user's controls)) MUST be
empty

But this method requires us to pull the "controlled_field" values from
each document, which we find slow for hundreds of millions of documents.

So we tried an approach using boolean filters, which follows.

Given the documents above, all the terms for "controlled_field" in the ES
index are:
index terms = ["A", "B", "C", "D", "X", "Y", "Z"]

We can filter out documents that contain controls that the user does not
have, as follows:
terms_documents_must_not_have = (index terms - user's controls)
= ["A", "B", "C", "D", "X", "Y", "Z"] -
["A", "B", "D", "X", "Y"]
= ["C", "Z"]
So the filter becomes:
{
"filter": {
"bool": {
"must_not": {
"terms": {
"controlled_field": ["C", "Z"]
}
}
}
}
}

This approach does not require us to pull the "controlled_field" from each
document and still is correct. But the problem becomes: how do we get all
terms for a field from the ES index quickly?

Please let me know if this is still not clear. Thanks again.

  • Ash

On Tuesday, March 26, 2013 10:41:20 AM UTC-4, Igor Motov wrote:

Could you elaborate a little bit more on requirements? A few examples of
the documents and user tokens and how they should and shouldn't match would
be really helpful.

On Monday, March 25, 2013 1:47:22 PM UTC-4, hanaf...@gmail.com wrote:

The general questions that we are trying to answer:

  • What is the best (fastest) way to filter documents based on user
    controls and documents' fields?
  • What is the best (fastest) way to get all the terms in an index for a
    field?

Our specific details follow:

We are developing an application that will use elasticsearch to index
200M+ documents spread across 16 nodes. When a user searches, we need to
filter out documents based upon the intersection of a user's tokens
(gathered by our application) and the document's tokens (indexed values).
We have evaluated a number of ways to do this and we are looking for
feedback from the elasticsearch community on our approaches and any other
methods that can be tried.

The performance tests quoted below were run on a stack with 5 nodes, 5
shards, 1 replica, 150M documents, each document is about 15K each,
elasticsearch version - 0.20.5 (we briefly tested 0.90.0RC1 and found it
slower than 0.20.5)

  1. Native Script filter - We are submitting the user's tokens via params
    to a Native Script that compares them with documents' controls:

public class CustomScript extends AbstractSearchScript {
...
@Override
public Object run() {
// Profiled elasticsearch during a single query

    // 60% of CPU time during CustomScript.run
    Set<String> docControls= Sets.newHashSet(((StringDocFieldData) 

doc().field(DOCUMENT_CONTROL_FIELD_NAME)).getValues());

    // 30% of CPU time during CustomScript.run
    return shouldBeAllowedToSeeDocument(this.userControls, 

docControls);
}
...
}

As you can see, most of the time is spent pulling the fields from the
documents. If the query does not hit many documents, the filter is quick
enough (< 1 second). But if the query hits millions of documents, the
filter gets much slower (~80s).

  1. Nested Boolean filter - Because pulling document fields was the
    bottleneck, we sought a way to avoid that step. We construct a (complex)
    filter query made up of nested boolean filters that enumerates all of the
    tokens from the index that the user does not have and combines them with
    the token that the user does have:

{
"filter": {
"bool": {
"must": [
{
"bool": {
"should": [
{
"bool": {
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlA2"
] } } } },
{
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlA0",
"controlA1"
] } } ] }
},
{
"bool": {
"should": [
{
"bool": {
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlB0",
"controlB2",
"controlB3"
] } } } },
{
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlB1"
] } } ] }
},
{
"bool": {
"should": [
{
"bool": {
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlC",
"controlD"
] } } } } ] } }
],
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlE"
] } } } }
}

This was much faster. We were seeing most queries with this filter
return around 0.8 seconds. And the query was a match_all docs query.

However, this requires all the terms from the index for this field to
construct the boolean filter. This was slow when we tried the following
approaches for this:

2a. Faceting - Retrieving all terms for a field:

{
"facets": {
"DOCUMENT_CONTROL_FIELD_NAME": {
"terms": {
"field": "DOCUMENT_CONTROL_FIELD_NAME",
"size": 100000,
"all_terms": true
},
"global": true
} }
}

Times for this query varied depending on the number of unique terms for
the field chosen, but was anywhere between 20 seconds and 140 seconds.

2b. Termlist plugin (
GitHub - jprante/elasticsearch-index-termlist: Elasticsearch Index Termlist)

This approach took longer than the facet query ( > 200 seconds )

Any feedback or thoughts would be much appreciated. Thanks!

  • Ash

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Igor,

Ash and I are working on this problem together and we wanted to thank you
for your help. Anecdotal test results on 96 Million docs on 5 shards look
promising; we are seeing sub-second response times (even with caching
turned off), we are in the process of running this in our load testing
environment to get a better picture of the performance.

We have a follow up question for you. First, we noticed that you used
nested objects in your example, is there a reason why we could not simply
use top level document properties?

On Monday, March 25, 2013 1:47:22 PM UTC-4, hanaf...@gmail.com wrote:

The general questions that we are trying to answer:

  • What is the best (fastest) way to filter documents based on user
    controls and documents' fields?
  • What is the best (fastest) way to get all the terms in an index for a
    field?

Our specific details follow:

We are developing an application that will use elasticsearch to index
200M+ documents spread across 16 nodes. When a user searches, we need to
filter out documents based upon the intersection of a user's tokens
(gathered by our application) and the document's tokens (indexed values).
We have evaluated a number of ways to do this and we are looking for
feedback from the elasticsearch community on our approaches and any other
methods that can be tried.

The performance tests quoted below were run on a stack with 5 nodes, 5
shards, 1 replica, 150M documents, each document is about 15K each,
elasticsearch version - 0.20.5 (we briefly tested 0.90.0RC1 and found it
slower than 0.20.5)

  1. Native Script filter - We are submitting the user's tokens via params
    to a Native Script that compares them with documents' controls:

public class CustomScript extends AbstractSearchScript {
...
@Override
public Object run() {
// Profiled elasticsearch during a single query

    // 60% of CPU time during CustomScript.run
    Set<String> docControls= Sets.newHashSet(((StringDocFieldData) 

doc().field(DOCUMENT_CONTROL_FIELD_NAME)).getValues());

    // 30% of CPU time during CustomScript.run
    return shouldBeAllowedToSeeDocument(this.userControls, 

docControls);
}
...
}

As you can see, most of the time is spent pulling the fields from the
documents. If the query does not hit many documents, the filter is quick
enough (< 1 second). But if the query hits millions of documents, the
filter gets much slower (~80s).

  1. Nested Boolean filter - Because pulling document fields was the
    bottleneck, we sought a way to avoid that step. We construct a (complex)
    filter query made up of nested boolean filters that enumerates all of the
    tokens from the index that the user does not have and combines them with
    the token that the user does have:

{
"filter": {
"bool": {
"must": [
{
"bool": {
"should": [
{
"bool": {
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlA2"
] } } } },
{
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlA0",
"controlA1"
] } } ] }
},
{
"bool": {
"should": [
{
"bool": {
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlB0",
"controlB2",
"controlB3"
] } } } },
{
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlB1"
] } } ] }
},
{
"bool": {
"should": [
{
"bool": {
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlC",
"controlD"
] } } } } ] } }
],
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlE"
] } } } }
}

This was much faster. We were seeing most queries with this filter return
around 0.8 seconds. And the query was a match_all docs query.

However, this requires all the terms from the index for this field to
construct the boolean filter. This was slow when we tried the following
approaches for this:

2a. Faceting - Retrieving all terms for a field:

{
"facets": {
"DOCUMENT_CONTROL_FIELD_NAME": {
"terms": {
"field": "DOCUMENT_CONTROL_FIELD_NAME",
"size": 100000,
"all_terms": true
},
"global": true
} }
}

Times for this query varied depending on the number of unique terms for
the field chosen, but was anywhere between 20 seconds and 140 seconds.

2b. Termlist plugin (
GitHub - jprante/elasticsearch-index-termlist: Elasticsearch Index Termlist)

This approach took longer than the facet query ( > 200 seconds )

Any feedback or thoughts would be much appreciated. Thanks!

  • Ash

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ariel,

Thanks for the update.

With top-level properties, it's going to break on the very first level:

"terms": {
"controlled_fields.controlled_field": ["A", "B", "D", "X", "Y"]
}

if controlled_field is nested, this query is going to return
controlled_field that user has. If controlled_field is not nested it's
going to return the documents that have at least one of the controlled
fields. Therefore when we negate this query on the next level, in the
nested case we find all controlled fields that user doesn't have, in
non-nested case we find all documents that don't have any of the controlled
fields. So, in non-nested case, the outermost not will simply reverse the
inner not and you will get all document that have at least one of the
user’s control fields. In other words, the only reason the nested case
works is because the inner not works on fields and not on documents. Does
it make sense?

On Friday, March 29, 2013 11:20:56 AM UTC-4, Ariel Valentin wrote:

Igor,

Ash and I are working on this problem together and we wanted to thank you
for your help. Anecdotal test results on 96 Million docs on 5 shards look
promising; we are seeing sub-second response times (even with caching
turned off), we are in the process of running this in our load testing
environment to get a better picture of the performance.

We have a follow up question for you. First, we noticed that you used
nested objects in your example, is there a reason why we could not simply
use top level document properties?

On Monday, March 25, 2013 1:47:22 PM UTC-4, hanaf...@gmail.com wrote:

The general questions that we are trying to answer:

  • What is the best (fastest) way to filter documents based on user
    controls and documents' fields?
  • What is the best (fastest) way to get all the terms in an index for a
    field?

Our specific details follow:

We are developing an application that will use elasticsearch to index
200M+ documents spread across 16 nodes. When a user searches, we need to
filter out documents based upon the intersection of a user's tokens
(gathered by our application) and the document's tokens (indexed values).
We have evaluated a number of ways to do this and we are looking for
feedback from the elasticsearch community on our approaches and any other
methods that can be tried.

The performance tests quoted below were run on a stack with 5 nodes, 5
shards, 1 replica, 150M documents, each document is about 15K each,
elasticsearch version - 0.20.5 (we briefly tested 0.90.0RC1 and found it
slower than 0.20.5)

  1. Native Script filter - We are submitting the user's tokens via params
    to a Native Script that compares them with documents' controls:

public class CustomScript extends AbstractSearchScript {
...
@Override
public Object run() {
// Profiled elasticsearch during a single query

    // 60% of CPU time during CustomScript.run
    Set<String> docControls= Sets.newHashSet(((StringDocFieldData) 

doc().field(DOCUMENT_CONTROL_FIELD_NAME)).getValues());

    // 30% of CPU time during CustomScript.run
    return shouldBeAllowedToSeeDocument(this.userControls, 

docControls);
}
...
}

As you can see, most of the time is spent pulling the fields from the
documents. If the query does not hit many documents, the filter is quick
enough (< 1 second). But if the query hits millions of documents, the
filter gets much slower (~80s).

  1. Nested Boolean filter - Because pulling document fields was the
    bottleneck, we sought a way to avoid that step. We construct a (complex)
    filter query made up of nested boolean filters that enumerates all of the
    tokens from the index that the user does not have and combines them with
    the token that the user does have:

{
"filter": {
"bool": {
"must": [
{
"bool": {
"should": [
{
"bool": {
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlA2"
] } } } },
{
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlA0",
"controlA1"
] } } ] }
},
{
"bool": {
"should": [
{
"bool": {
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlB0",
"controlB2",
"controlB3"
] } } } },
{
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlB1"
] } } ] }
},
{
"bool": {
"should": [
{
"bool": {
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlC",
"controlD"
] } } } } ] } }
],
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlE"
] } } } }
}

This was much faster. We were seeing most queries with this filter
return around 0.8 seconds. And the query was a match_all docs query.

However, this requires all the terms from the index for this field to
construct the boolean filter. This was slow when we tried the following
approaches for this:

2a. Faceting - Retrieving all terms for a field:

{
"facets": {
"DOCUMENT_CONTROL_FIELD_NAME": {
"terms": {
"field": "DOCUMENT_CONTROL_FIELD_NAME",
"size": 100000,
"all_terms": true
},
"global": true
} }
}

Times for this query varied depending on the number of unique terms for
the field chosen, but was anywhere between 20 seconds and 140 seconds.

2b. Termlist plugin (
GitHub - jprante/elasticsearch-index-termlist: Elasticsearch Index Termlist)

This approach took longer than the facet query ( > 200 seconds )

Any feedback or thoughts would be much appreciated. Thanks!

  • Ash

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Igor,

Thanks again for your prompt reply. It's not entirely clear what the
limitations are because I have not experimented with it yet; but if I
understand your description correctly, the nested approach is behaving more
like a subquery, which is not possible to do in flat documents.

I have one more follow up question for you. I see that in your example you
start by using a filtered query,
{
"query": {
"filtered": {
"query": {
"query_string": {
"query": "isabella"
}
},
"filter": {
//... yada yada yada
}
}
}

Compared to what Tire generates:
{
"query": {
"query_string": {
"query": "isabella"
}
},
"filter": {
//... yada yada yada
}

I would like to confirm with you that we should be using filtered queries,
because the filter is applied on the results of the hits returned from the
query and not on the entire corpus of documents. Is that correct?

On Friday, March 29, 2013 12:15:07 PM UTC-4, Igor Motov wrote:

Ariel,

Thanks for the update.

With top-level properties, it's going to break on the very first level:

"terms": {
"controlled_fields.controlled_field": ["A", "B", "D", "X", "Y"]
}

if controlled_field is nested, this query is going to return
controlled_field that user has. If controlled_field is not nested it's
going to return the documents that have at least one of the controlled
fields. Therefore when we negate this query on the next level, in the
nested case we find all controlled fields that user doesn't have, in
non-nested case we find all documents that don't have any of the controlled
fields. So, in non-nested case, the outermost not will simply reverse the
inner not and you will get all document that have at least one of the
user’s control fields. In other words, the only reason the nested case
works is because the inner not works on fields and not on documents. Does
it make sense?

On Friday, March 29, 2013 11:20:56 AM UTC-4, Ariel Valentin wrote:

Igor,

Ash and I are working on this problem together and we wanted to thank you
for your help. Anecdotal test results on 96 Million docs on 5 shards look
promising; we are seeing sub-second response times (even with caching
turned off), we are in the process of running this in our load testing
environment to get a better picture of the performance.

We have a follow up question for you. First, we noticed that you used
nested objects in your example, is there a reason why we could not simply
use top level document properties?

On Monday, March 25, 2013 1:47:22 PM UTC-4, hanaf...@gmail.com wrote:

The general questions that we are trying to answer:

  • What is the best (fastest) way to filter documents based on user
    controls and documents' fields?
  • What is the best (fastest) way to get all the terms in an index for a
    field?

Our specific details follow:

We are developing an application that will use elasticsearch to index
200M+ documents spread across 16 nodes. When a user searches, we need to
filter out documents based upon the intersection of a user's tokens
(gathered by our application) and the document's tokens (indexed values).
We have evaluated a number of ways to do this and we are looking for
feedback from the elasticsearch community on our approaches and any other
methods that can be tried.

The performance tests quoted below were run on a stack with 5 nodes, 5
shards, 1 replica, 150M documents, each document is about 15K each,
elasticsearch version - 0.20.5 (we briefly tested 0.90.0RC1 and found it
slower than 0.20.5)

  1. Native Script filter - We are submitting the user's tokens via params
    to a Native Script that compares them with documents' controls:

public class CustomScript extends AbstractSearchScript {
...
@Override
public Object run() {
// Profiled elasticsearch during a single query

    // 60% of CPU time during CustomScript.run
    Set<String> docControls= Sets.newHashSet(((StringDocFieldData) 

doc().field(DOCUMENT_CONTROL_FIELD_NAME)).getValues());

    // 30% of CPU time during CustomScript.run
    return shouldBeAllowedToSeeDocument(this.userControls, 

docControls);
}
...
}

As you can see, most of the time is spent pulling the fields from the
documents. If the query does not hit many documents, the filter is quick
enough (< 1 second). But if the query hits millions of documents, the
filter gets much slower (~80s).

  1. Nested Boolean filter - Because pulling document fields was the
    bottleneck, we sought a way to avoid that step. We construct a (complex)
    filter query made up of nested boolean filters that enumerates all of the
    tokens from the index that the user does not have and combines them with
    the token that the user does have:

{
"filter": {
"bool": {
"must": [
{
"bool": {
"should": [
{
"bool": {
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlA2"
] } } } },
{
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlA0",
"controlA1"
] } } ] }
},
{
"bool": {
"should": [
{
"bool": {
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlB0",
"controlB2",
"controlB3"
] } } } },
{
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlB1"
] } } ] }
},
{
"bool": {
"should": [
{
"bool": {
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlC",
"controlD"
] } } } } ] } }
],
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlE"
] } } } }
}

This was much faster. We were seeing most queries with this filter
return around 0.8 seconds. And the query was a match_all docs query.

However, this requires all the terms from the index for this field to
construct the boolean filter. This was slow when we tried the following
approaches for this:

2a. Faceting - Retrieving all terms for a field:

{
"facets": {
"DOCUMENT_CONTROL_FIELD_NAME": {
"terms": {
"field": "DOCUMENT_CONTROL_FIELD_NAME",
"size": 100000,
"all_terms": true
},
"global": true
} }
}

Times for this query varied depending on the number of unique terms for
the field chosen, but was anywhere between 20 seconds and 140 seconds.

2b. Termlist plugin (
GitHub - jprante/elasticsearch-index-termlist: Elasticsearch Index Termlist)

This approach took longer than the facet query ( > 200 seconds )

Any feedback or thoughts would be much appreciated. Thanks!

  • Ash

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Although, both queries will return the exactly the same results, you should
use the filtered query. There are only a few special cases when you need to
use the top level filter. It is typically used when you want to remove the
filter from facets calculations, or when filter is performing very
heavy calculation for every record that is passed to it. In your case, you
will be able to achieve better performance with filtered query.

On Friday, March 29, 2013 1:34:45 PM UTC-4, Ariel Valentin wrote:

Igor,

Thanks again for your prompt reply. It's not entirely clear what the
limitations are because I have not experimented with it yet; but if I
understand your description correctly, the nested approach is behaving more
like a subquery, which is not possible to do in flat documents.

I have one more follow up question for you. I see that in your example you
start by using a filtered query,
{
"query": {
"filtered": {
"query": {
"query_string": {
"query": "isabella"
}
},
"filter": {
//... yada yada yada
}
}
}

Compared to what Tire generates:
{
"query": {
"query_string": {
"query": "isabella"
}
},
"filter": {
//... yada yada yada
}

I would like to confirm with you that we should be using filtered queries,
because the filter is applied on the results of the hits returned from the
query and not on the entire corpus of documents. Is that correct?

On Friday, March 29, 2013 12:15:07 PM UTC-4, Igor Motov wrote:

Ariel,

Thanks for the update.

With top-level properties, it's going to break on the very first level:

"terms": {
"controlled_fields.controlled_field": ["A", "B", "D", "X", "Y"]
}

if controlled_field is nested, this query is going to return
controlled_field that user has. If controlled_field is not nested it's
going to return the documents that have at least one of the controlled
fields. Therefore when we negate this query on the next level, in the
nested case we find all controlled fields that user doesn't have, in
non-nested case we find all documents that don't have any of the controlled
fields. So, in non-nested case, the outermost not will simply reverse the
inner not and you will get all document that have at least one of the
user’s control fields. In other words, the only reason the nested case
works is because the inner not works on fields and not on documents. Does
it make sense?

On Friday, March 29, 2013 11:20:56 AM UTC-4, Ariel Valentin wrote:

Igor,

Ash and I are working on this problem together and we wanted to thank
you for your help. Anecdotal test results on 96 Million docs on 5 shards
look promising; we are seeing sub-second response times (even with caching
turned off), we are in the process of running this in our load testing
environment to get a better picture of the performance.

We have a follow up question for you. First, we noticed that you used
nested objects in your example, is there a reason why we could not simply
use top level document properties?

On Monday, March 25, 2013 1:47:22 PM UTC-4, hanaf...@gmail.com wrote:

The general questions that we are trying to answer:

  • What is the best (fastest) way to filter documents based on user
    controls and documents' fields?
  • What is the best (fastest) way to get all the terms in an index for a
    field?

Our specific details follow:

We are developing an application that will use elasticsearch to index
200M+ documents spread across 16 nodes. When a user searches, we need to
filter out documents based upon the intersection of a user's tokens
(gathered by our application) and the document's tokens (indexed values).
We have evaluated a number of ways to do this and we are looking for
feedback from the elasticsearch community on our approaches and any other
methods that can be tried.

The performance tests quoted below were run on a stack with 5 nodes, 5
shards, 1 replica, 150M documents, each document is about 15K each,
elasticsearch version - 0.20.5 (we briefly tested 0.90.0RC1 and found it
slower than 0.20.5)

  1. Native Script filter - We are submitting the user's tokens via
    params to a Native Script that compares them with documents' controls:

public class CustomScript extends AbstractSearchScript {
...
@Override
public Object run() {
// Profiled elasticsearch during a single query

    // 60% of CPU time during CustomScript.run
    Set<String> docControls= Sets.newHashSet(((StringDocFieldData) 

doc().field(DOCUMENT_CONTROL_FIELD_NAME)).getValues());

    // 30% of CPU time during CustomScript.run
    return shouldBeAllowedToSeeDocument(this.userControls, 

docControls);
}
...
}

As you can see, most of the time is spent pulling the fields from the
documents. If the query does not hit many documents, the filter is quick
enough (< 1 second). But if the query hits millions of documents, the
filter gets much slower (~80s).

  1. Nested Boolean filter - Because pulling document fields was the
    bottleneck, we sought a way to avoid that step. We construct a (complex)
    filter query made up of nested boolean filters that enumerates all of the
    tokens from the index that the user does not have and combines them with
    the token that the user does have:

{
"filter": {
"bool": {
"must": [
{
"bool": {
"should": [
{
"bool": {
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlA2"
] } } } },
{
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlA0",
"controlA1"
] } } ] }
},
{
"bool": {
"should": [
{
"bool": {
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlB0",
"controlB2",
"controlB3"
] } } } },
{
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlB1"
] } } ] }
},
{
"bool": {
"should": [
{
"bool": {
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlC",
"controlD"
] } } } } ] } }
],
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlE"
] } } } }
}

This was much faster. We were seeing most queries with this filter
return around 0.8 seconds. And the query was a match_all docs query.

However, this requires all the terms from the index for this field to
construct the boolean filter. This was slow when we tried the following
approaches for this:

2a. Faceting - Retrieving all terms for a field:

{
"facets": {
"DOCUMENT_CONTROL_FIELD_NAME": {
"terms": {
"field": "DOCUMENT_CONTROL_FIELD_NAME",
"size": 100000,
"all_terms": true
},
"global": true
} }
}

Times for this query varied depending on the number of unique terms for
the field chosen, but was anywhere between 20 seconds and 140 seconds.

2b. Termlist plugin (
GitHub - jprante/elasticsearch-index-termlist: Elasticsearch Index Termlist)

This approach took longer than the facet query ( > 200 seconds )

Any feedback or thoughts would be much appreciated. Thanks!

  • Ash

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.