Finding unique values of a field from the returned search results

When I send a search request, I get a document of this format:

        {
            "_index": "texts_cluster",
            "_type": "documents",
            "_id": "5732",
            "_score": 1.8980495,
            "_source": {
               "bucket_id": "12345",
               "date": "2009-06-03",
               "text": "some large text",
            },
         }

I send a request so that ~1000 documents are returned. I want the unique bucket_ids of all the returned documents and then send search query according to the collected bucket_ids like:

POST /texts_cluster/_search
{
  "query": {
    "match": {
      "bucket_id": "12345"
    }
  },
  "sort": [
    {
      "date": {
        "order": "asc"
      }
    }
  ]
}

I thought of using the aggregations but they don't return the result in particular relevance order and also I only need the bucket_ids that occur in the 1000 documents returned by the search query.

One way that I can think of is to iterate through the 1000 documents and add the bucket_ids to a list as I wish to maintain the sequence. The append to the list would be only for a unique element. Is there any other easier method??

Thanks!

In the 2.0 release (RC1 is available now) we introduce the sampler aggregation that allows you to focus aggs on only the top-matching docs. Example below:

DELETE test
POST test/doc
{
	"bucket":1,
	"text":"we are one"
}
POST test/doc
{
	"bucket":1,
	"text":"one love"
}
POST test/doc
{
	"bucket":1,
	"text":"one time"
}
POST test/doc
{
	"bucket":2,
	"text":"two is company"
}
POST test/doc
{
	"bucket":2,
	"text":"takes two to tango"
}
GET test/doc/_search
{
   "query": {
	  "match": {
		 "text": "one"
	  }
   },
   "size": 0,
   "aggs": {
	  "bestDocs": {
		 "sampler": {
			"field": "bucket",
			"shard_size": 1000
		 },
		 "aggs": {
			"bestBuckets": {
			   "terms": {
				  "field": "bucket",
				  "size": 10
			   }
			}
		 }
	  }
   }
}

Thanks for that but I am surrently using ES 1.7.3. Is there a solution for that in this version?

In <2.0 you can either look at hits (the top-ranked docs) or aggs (summaries performed on all docs) so if you want a summary of only the best docs your client code would have to trawl through hits (expensive) or you may be able to resort to some hacky groovy script [1] that tried to use PriorityQueues etc to limit stats to the top-matching docs.

In 2.0 the sampler agg makes this trivial.

Cheers
Mark

[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-scripted-metric-aggregation.html

I upgraded to ES 2.0.0.

The query always returns the bucket ids in ascending order rather than in the order they appeared in the search results!!

Try this then:

GET test/doc/_search
{
   "query": {
	  "match": {
		 "text": "one"
	  }
   },
   "size": 0,
   "aggs": {
	  "bestDocs": {
		 "sampler": {
			"field": "bucket",
			"shard_size": 1000
		 },
		 "aggs": {
			"bestBuckets": {
			   "terms": {
				  "field": "bucket",
				  "size": 10,
				  "order":{
					  "bestScore":"desc"
				  }
			  
			   },
			   "aggs":{
				   "bestScore":{
					   "max":{
						   "script":"_score"
					   }
				   }
			   }
			}
		 }
	  }
   }
}