I can think of only one clean way to do it. Unfortunately, it's somewhat
complicated and I am not sure how performant it's going to be on 200M+
documents, that's something that you would need to test. However, filters
can be cached, so it could be possible to minimize the impact to only the
first user's query.
You can do something like this:
Index controlled_fields as nested objects. So the first document would look
like this:
{
"controlled_fields": [{
"controlled_field": "A"
}, {
"controlled_field": "B"
}, {
"controlled_field": "C"
}]
}
and the mapping would look like this:
"properties": {
"controlled_fields": {
"type": "nested",
"properties": {
"controlled_field": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
With this mapping, we can easily find all controlled fields, that are
assigned to the user:
"terms": {
"controlled_fields.controlled_field": ["A", "B", "D", "X", "Y"]
}
then we can wrap it in a not filter and therefore find all controlled
fields that are not assigned to the user
"not": {
"terms": {
"controlled_fields.controlled_field": ["A", "B", "D", "X", "Y"]
}
}
and then we can use nested filter to find all documents that contain
control fields that are not assigned to the user:
"nested": {
"path": "controlled_fields",
"filter": {
"not": {
"terms": {
"controlled_fields.controlled_field": ["A", "B", "D", "X",
"Y"]
}
}
}
}
and finally we can use not filter to filter all these documents out:
"not": {
"nested": {
"path": "controlled_fields",
"filter": {
"not": {
"terms": {
"controlled_fields.controlled_field": ["A", "B", "D",
"X", "Y"]
}
}
}
}
}
Here is the final
solution: https://github.com/imotov/elasticsearch-test-scripts/blob/master/controlled_field.sh
By default, the not filter is not cached. So, you might need to set
"_cache" attribute of the outermost not filter to true in order to get a
decent performance out of it.
On Tuesday, March 26, 2013 2:47:11 PM UTC-4, hanaf...@gmail.com wrote:
Hey Igor,
Thanks for getting back to us. Hopefully we can clear up our problem a
little bit. Please note that the description below details a simplified
version of the problem, but the concepts are the essentially same.
So say we write the following documents to an empty ES index:
{
"doc1": {
"controlled_field": ["A", "B", "C"]
},
"doc2": {
"controlled_field": ["X", "Y", "Z"]
},
"doc3": {
"controlled_field": ["A", "Y", "D"]
}
}
And consider a user with the following controls:
["A", "B", "D", "X", "Y"]
In order for a user to be allowed access to a document, the user must have
ALL of the controls listed in the document. So:
user DENIED access to doc1 => user has "A" and "B" but is missing "C"
user DENIED access to doc2 => user has "X" and "Y" but is missing "Z"
user ALLOWED access to doc3 => user has all of doc3's controls, ["A", "Y",
"D"]
The logic above can be expressed in set math:
((values from document's "controlled_field") - (user's controls)) MUST be
empty
But this method requires us to pull the "controlled_field" values from
each document, which we find slow for hundreds of millions of documents.
So we tried an approach using boolean filters, which follows.
Given the documents above, all the terms for "controlled_field" in the ES
index are:
index terms = ["A", "B", "C", "D", "X", "Y", "Z"]
We can filter out documents that contain controls that the user does not
have, as follows:
terms_documents_must_not_have = (index terms - user's controls)
= ["A", "B", "C", "D", "X", "Y", "Z"] -
["A", "B", "D", "X", "Y"]
= ["C", "Z"]
So the filter becomes:
{
"filter": {
"bool": {
"must_not": {
"terms": {
"controlled_field": ["C", "Z"]
}
}
}
}
}
This approach does not require us to pull the "controlled_field" from each
document and still is correct. But the problem becomes: how do we get all
terms for a field from the ES index quickly?
Please let me know if this is still not clear. Thanks again.
On Tuesday, March 26, 2013 10:41:20 AM UTC-4, Igor Motov wrote:
Could you elaborate a little bit more on requirements? A few examples of
the documents and user tokens and how they should and shouldn't match would
be really helpful.
On Monday, March 25, 2013 1:47:22 PM UTC-4, hanaf...@gmail.com wrote:
The general questions that we are trying to answer:
- What is the best (fastest) way to filter documents based on user
controls and documents' fields?
- What is the best (fastest) way to get all the terms in an index for a
field?
Our specific details follow:
We are developing an application that will use elasticsearch to index
200M+ documents spread across 16 nodes. When a user searches, we need to
filter out documents based upon the intersection of a user's tokens
(gathered by our application) and the document's tokens (indexed values).
We have evaluated a number of ways to do this and we are looking for
feedback from the elasticsearch community on our approaches and any other
methods that can be tried.
The performance tests quoted below were run on a stack with 5 nodes, 5
shards, 1 replica, 150M documents, each document is about 15K each,
elasticsearch version - 0.20.5 (we briefly tested 0.90.0RC1 and found it
slower than 0.20.5)
- Native Script filter - We are submitting the user's tokens via params
to a Native Script that compares them with documents' controls:
public class CustomScript extends AbstractSearchScript {
...
@Override
public Object run() {
// Profiled elasticsearch during a single query
// 60% of CPU time during CustomScript.run
Set<String> docControls= Sets.newHashSet(((StringDocFieldData)
doc().field(DOCUMENT_CONTROL_FIELD_NAME)).getValues());
// 30% of CPU time during CustomScript.run
return shouldBeAllowedToSeeDocument(this.userControls,
docControls);
}
...
}
As you can see, most of the time is spent pulling the fields from the
documents. If the query does not hit many documents, the filter is quick
enough (< 1 second). But if the query hits millions of documents, the
filter gets much slower (~80s).
- Nested Boolean filter - Because pulling document fields was the
bottleneck, we sought a way to avoid that step. We construct a (complex)
filter query made up of nested boolean filters that enumerates all of the
tokens from the index that the user does not have and combines them with
the token that the user does have:
{
"filter": {
"bool": {
"must": [
{
"bool": {
"should": [
{
"bool": {
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlA2"
] } } } },
{
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlA0",
"controlA1"
] } } ] }
},
{
"bool": {
"should": [
{
"bool": {
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlB0",
"controlB2",
"controlB3"
] } } } },
{
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlB1"
] } } ] }
},
{
"bool": {
"should": [
{
"bool": {
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlC",
"controlD"
] } } } } ] } }
],
"must_not": {
"terms": {
"DOCUMENT_CONTROL_FIELD_NAME": [
"controlE"
] } } } }
}
This was much faster. We were seeing most queries with this filter
return around 0.8 seconds. And the query was a match_all docs query.
However, this requires all the terms from the index for this field to
construct the boolean filter. This was slow when we tried the following
approaches for this:
2a. Faceting - Retrieving all terms for a field:
{
"facets": {
"DOCUMENT_CONTROL_FIELD_NAME": {
"terms": {
"field": "DOCUMENT_CONTROL_FIELD_NAME",
"size": 100000,
"all_terms": true
},
"global": true
} }
}
Times for this query varied depending on the number of unique terms for
the field chosen, but was anywhere between 20 seconds and 140 seconds.
2b. Termlist plugin (
GitHub - jprante/elasticsearch-index-termlist: Elasticsearch Index Termlist)
This approach took longer than the facet query ( > 200 seconds )
Any feedback or thoughts would be much appreciated. Thanks!
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.