How to get Distinct results? (actual records, not just counters)


(Steve) #1

In relational db our data looks like this: Company -> Department -> Office

Elasticsearch version of the same data (flattened):

{
     "officeID": 123,
     "officeName": "office 1",
     "state": "CA",
     "department": {
          "departmentID": 456,
          "departmentName": "Department 1",
          "company": {
                "companyID": 789,
                "companyName": "Company 1",
          }
     } 
},{
     "officeID": 124,
     "officeName": "office 2",
     "state": "CA",
     "department": {
          "departmentID": 456,
          "departmentName": "Department 1",
          "company": {
              "companyID": 789,
              "companyName": "Company 1",
          }
      }}

We need to find department (or company) by providing office information (such as state).

For example, since all I need is a department info, I can specify it like this (we are using Nest)

searchDescriptor = searchDescriptor.Source(x => x.Include("department"));

and get all departments with qualifying offices.

The problem is - I am getting multiple "department" records with the same id (one for each office).

We are using paging and sorting.

Would it be possible to get paged and sorted Distinct results?

I have spent a few days trying to find an answer (exploring options like facets, aggregations, top_hits etc) but so far the only working option I see would be a manual one - get results from Elasticsearch, group data manually and pass to the client.

The problem with this approach is obvious - every time I grab next portion, I'll have to get X extra records just in case some of the records will be duplicate; since I don't know X in advance (and number of such records could be huge) will be forced either to get lots of data unnecessarily (every time I do the search) or to hit our search engine several times until I get required number of records.

So far I was unable to achieve my goal using aggregations (all I am getting is document count, but I want actual data; when I try to use top_hits, I am getting data, but those are really top hits (sorted by number of offices per department, ignoring sorting I have specified in the query); here is an example of the code I tried:

searchDescriptor = searchDescriptor.Aggregations(a => a
.Terms("myunique",
    t =>
        t.Field("department.departmentID")
        .Size(10)
        .Aggregations(
            x=>x.TopHits("mytophits", 
                y=>y.Source(true)
                    .Size(1)
                    .Sort(k => k.OnField("department.departmentName").Ascending())
                    )
                )
            )
);

Does anyone know if Elasticsearch can perform operations like Distinct and get unique records?


(system) #2