Finding unique values of a field from the returned search results


(apanimesh061) #1

When I send a search request, I get a document of this format:

        {
            "_index": "texts_cluster",
            "_type": "documents",
            "_id": "5732",
            "_score": 1.8980495,
            "_source": {
               "bucket_id": "12345",
               "date": "2009-06-03",
               "text": "some large text",
            },
         }

I send a request so that ~1000 documents are returned. I want the unique bucket_ids of all the returned documents and then send search query according to the collected bucket_ids like:

POST /texts_cluster/_search
{
  "query": {
    "match": {
      "bucket_id": "12345"
    }
  },
  "sort": [
    {
      "date": {
        "order": "asc"
      }
    }
  ]
}

I thought of using the aggregations but they don't return the result in particular relevance order and also I only need the bucket_ids that occur in the 1000 documents returned by the search query.

One way that I can think of is to iterate through the 1000 documents and add the bucket_ids to a list as I wish to maintain the sequence. The append to the list would be only for a unique element. Is there any other easier method??

Thanks!


(Mark Harwood) #2

In the 2.0 release (RC1 is available now) we introduce the sampler aggregation that allows you to focus aggs on only the top-matching docs. Example below:

DELETE test
POST test/doc
{
	"bucket":1,
	"text":"we are one"
}
POST test/doc
{
	"bucket":1,
	"text":"one love"
}
POST test/doc
{
	"bucket":1,
	"text":"one time"
}
POST test/doc
{
	"bucket":2,
	"text":"two is company"
}
POST test/doc
{
	"bucket":2,
	"text":"takes two to tango"
}
GET test/doc/_search
{
   "query": {
	  "match": {
		 "text": "one"
	  }
   },
   "size": 0,
   "aggs": {
	  "bestDocs": {
		 "sampler": {
			"field": "bucket",
			"shard_size": 1000
		 },
		 "aggs": {
			"bestBuckets": {
			   "terms": {
				  "field": "bucket",
				  "size": 10
			   }
			}
		 }
	  }
   }
}

(apanimesh061) #3

Thanks for that but I am surrently using ES 1.7.3. Is there a solution for that in this version?


(Mark Harwood) #4

In <2.0 you can either look at hits (the top-ranked docs) or aggs (summaries performed on all docs) so if you want a summary of only the best docs your client code would have to trawl through hits (expensive) or you may be able to resort to some hacky groovy script [1] that tried to use PriorityQueues etc to limit stats to the top-matching docs.

In 2.0 the sampler agg makes this trivial.

Cheers
Mark

[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-scripted-metric-aggregation.html


(apanimesh061) #5

I upgraded to ES 2.0.0.

The query always returns the bucket ids in ascending order rather than in the order they appeared in the search results!!


(Mark Harwood) #6

Try this then:

GET test/doc/_search
{
   "query": {
	  "match": {
		 "text": "one"
	  }
   },
   "size": 0,
   "aggs": {
	  "bestDocs": {
		 "sampler": {
			"field": "bucket",
			"shard_size": 1000
		 },
		 "aggs": {
			"bestBuckets": {
			   "terms": {
				  "field": "bucket",
				  "size": 10,
				  "order":{
					  "bestScore":"desc"
				  }
			  
			   },
			   "aggs":{
				   "bestScore":{
					   "max":{
						   "script":"_score"
					   }
				   }
			   }
			}
		 }
	  }
   }
}

(system) #7