Facet counts

Hi,

During our development, we found the similar issue documented as below :


In our application, accuracy is most important. We modified class
org.elasticsearch.search.facet.terms.strings.TermsStringOrdinalsFacetCollector

to return all the facet in the sorted order on a per shard basis. The merge
facet process will finally return the top N facets based on the counts. We
are trying to do some performance testing to evaluate the performance
impact of this change as well as regression testing to verify overall
functionality.* Any comments or guidance on any other potential impact of
this change would be greatly appreciated.*

The segment of the modified code snippet is list below:

    /*if (size < EntryPriorityQueue.LIMIT) {
        // optimize to use priority size
        EntryPriorityQueue ordered = new EntryPriorityQueue(size, 

comparatorType.comparator());

        while (queue.size() > 0) {
            ReaderAggregator agg = queue.top();
            String value = agg.current;
            int count = 0;
            do {
                count += agg.counts[agg.position];
                if (agg.nextPosition()) {
                    agg = queue.updateTop();
                } else {
                    // we are done with this reader
                    queue.pop();
                    agg = queue.top();
                }
            } while (agg != null && value.equals(agg.current));

            if (count > minCount) {
                if (excluded != null && excluded.contains(value)) {
                    continue;
                }
                if (matcher != null && !matcher.reset(value).matches()) 

{
continue;
}
InternalStringTermsFacet.StringEntry entry = new
InternalStringTermsFacet.StringEntry(value, count);
ordered.insertWithOverflow(entry);
}
}
InternalStringTermsFacet.StringEntry[] list = new
InternalStringTermsFacet.StringEntry[ordered.size()];
for (int i = ordered.size() - 1; i >= 0; i--) {
list[i] = (InternalStringTermsFacet.StringEntry)
ordered.pop();
}

        for (ReaderAggregator aggregator : aggregators) {
            CacheRecycler.pushIntArray(aggregator.counts);
        }

        return new InternalStringTermsFacet(facetName, comparatorType, 

size, Arrays.asList(list), missing, total);
}*/

    BoundedTreeSet<InternalStringTermsFacet.StringEntry> ordered = new 

BoundedTreeSet<InternalStringTermsFacet.StringEntry>(comparatorType.comparator(),
configured_max);

Thanks,
KJ

As an optimization to the code we now dynamically set the [configure_max]
value based on the total unique terms found on a per-shard basis. I've
highlighed in bold the optimizations. Doing this reduced the cache size by
90% and allows increased the response time.

public Facet facet() {

.....

Set totalTerms = new HashSet();

    for (ReaderAggregator aggregator : aggregators) {
        if (aggregator.nextPosition()) {
            *totalTerms.addAll(Arrays.asList(aggregator.values));*
            queue.add(aggregator);
        }
        
    
    }

BoundedTreeSet<InternalStringTermsFacet.StringEntry> ordered = new
BoundedTreeSet<InternalStringTermsFacet.StringEntry>(comparatorType.comparator(),
totalTerms.size());

.....

}

On Thursday, July 26, 2012 2:02:53 PM UTC-4, kj wrote:

Hi,

During our development, we found the similar issue documented as below :
Inconsistent facet counts · Issue #1832 · elastic/elasticsearch · GitHub
Terms Facet: gives different results depending on size value · Issue #667 · elastic/elasticsearch · GitHub

In our application, accuracy is most important. We modified class
org.elasticsearch.search.facet.terms.strings.TermsStringOrdinalsFacetCollector

to return all the facet in the sorted order on a per shard basis. The
merge facet process will finally return the top N facets based on the
counts. We are trying to do some performance testing to evaluate the
performance impact of this change as well as regression testing to verify
overall functionality.* Any comments or guidance on any other potential
impact of this change would be greatly appreciated.*

The segment of the modified code snippet is list below:

    /*if (size < EntryPriorityQueue.LIMIT) {
        // optimize to use priority size
        EntryPriorityQueue ordered = new EntryPriorityQueue(size, 

comparatorType.comparator());

        while (queue.size() > 0) {
            ReaderAggregator agg = queue.top();
            String value = agg.current;
            int count = 0;
            do {
                count += agg.counts[agg.position];
                if (agg.nextPosition()) {
                    agg = queue.updateTop();
                } else {
                    // we are done with this reader
                    queue.pop();
                    agg = queue.top();
                }
            } while (agg != null && value.equals(agg.current));

            if (count > minCount) {
                if (excluded != null && excluded.contains(value)) {
                    continue;
                }
                if (matcher != null && 

!matcher.reset(value).matches()) {
continue;
}
InternalStringTermsFacet.StringEntry entry = new
InternalStringTermsFacet.StringEntry(value, count);
ordered.insertWithOverflow(entry);
}
}
InternalStringTermsFacet.StringEntry list = new
InternalStringTermsFacet.StringEntry[ordered.size()];
for (int i = ordered.size() - 1; i >= 0; i--) {
list[i] = (InternalStringTermsFacet.StringEntry)
ordered.pop();
}

        for (ReaderAggregator aggregator : aggregators) {
            CacheRecycler.pushIntArray(aggregator.counts);
        }

        return new InternalStringTermsFacet(facetName, comparatorType, 

size, Arrays.asList(list), missing, total);
}*/

    BoundedTreeSet<InternalStringTermsFacet.StringEntry> ordered = new 

BoundedTreeSet<InternalStringTermsFacet.StringEntry>(comparatorType.comparator(),
configured_max);

Thanks,
KJ

As an optimization to the code we now dynamically set the [configure_max]
value based on the total unique terms found on a per-shard basis. I've
highlighed in bold the optimizations. Doing this reduced the cache size by
90% and increased the response time significantly.**

public Facet facet() {

.....

Set totalTerms = new HashSet();

    for (ReaderAggregator aggregator : aggregators) {
        if (aggregator.nextPosition()) {
            *totalTerms.addAll(Arrays.asList(aggregator.values));*
            queue.add(aggregator);
        }
        
    
    }

BoundedTreeSet<
InternalStringTermsFacet.StringEntry> ordered = new
BoundedTreeSet<InternalStringTermsFacet.StringEntry>(comparatorType.comparator(),
totalTerms.size());

......

}

On Thursday, July 26, 2012 2:02:53 PM UTC-4, kj wrote:

Hi,

During our development, we found the similar issue documented as below :
Inconsistent facet counts · Issue #1832 · elastic/elasticsearch · GitHub
Terms Facet: gives different results depending on size value · Issue #667 · elastic/elasticsearch · GitHub

In our application, accuracy is most important. We modified class
org.elasticsearch.search.facet.terms.strings.TermsStringOrdinalsFacetCollector

to return all the facet in the sorted order on a per shard basis. The
merge facet process will finally return the top N facets based on the
counts. We are trying to do some performance testing to evaluate the
performance impact of this change as well as regression testing to verify
overall functionality.* Any comments or guidance on any other potential
impact of this change would be greatly appreciated.*

The segment of the modified code snippet is list below:

    /*if (size < EntryPriorityQueue.LIMIT) {
        // optimize to use priority size
        EntryPriorityQueue ordered = new EntryPriorityQueue(size, 

comparatorType.comparator());

        while (queue.size() > 0) {
            ReaderAggregator agg = queue.top();
            String value = agg.current;
            int count = 0;
            do {
                count += agg.counts[agg.position];
                if (agg.nextPosition()) {
                    agg = queue.updateTop();
                } else {
                    // we are done with this reader
                    queue.pop();
                    agg = queue.top();
                }
            } while (agg != null && value.equals(agg.current));

            if (count > minCount) {
                if (excluded != null && excluded.contains(value)) {
                    continue;
                }
                if (matcher != null && 

!matcher.reset(value).matches()) {
continue;
}
InternalStringTermsFacet.StringEntry entry = new
InternalStringTermsFacet.StringEntry(value, count);
ordered.insertWithOverflow(entry);
}
}
InternalStringTermsFacet.StringEntry list = new
InternalStringTermsFacet.StringEntry[ordered.size()];
for (int i = ordered.size() - 1; i >= 0; i--) {
list[i] = (InternalStringTermsFacet.StringEntry)
ordered.pop();
}

        for (ReaderAggregator aggregator : aggregators) {
            CacheRecycler.pushIntArray(aggregator.counts);
        }

        return new InternalStringTermsFacet(facetName, comparatorType, 

size, Arrays.asList(list), missing, total);
}*/

    BoundedTreeSet<InternalStringTermsFacet.StringEntry> ordered = new 

BoundedTreeSet<InternalStringTermsFacet.StringEntry>(comparatorType.comparator(),
configured_max);

Thanks,
KJ