What is the best way to store unique values of a field

Hi,

I have a raw data index, every document has the fields: device, country, browser, utm and many more...

I need a way to extract the unique values of each field and store all of them in an index lets call it 'lookups'.

what is the best way to do that ? I think it can be done with transforms or ML, but would like to hear from you :slight_smile:

Thanks

I have been doing this but problem here is you will have new value of them country,browser or utm comes along every day.

for example you have 100 document and 10 uniq value of browser.
tomorrow you will have 100 new document and 15 uniq value of browser and from yesterday's 10 uniq value you might have 7 uniq then yesterday. that means you have to add 7 new vaule in to lookup index.

this means your search to elk index is going to grow daily. if you doing this once a day it might not be a problem depending on count/size of your docuements/index

there is a sql/REST api you can run against your index, test it out

"select device.keyword from myindex group by device.keyword" this will give you uniq device in index

This is exactly my problem.
my question is if there any way that I can do the job automatically, using any service like ML job, transform, rollups... ?

what I have done is wrote python script to run REST query against existing index, collect uniq value from last 7 days of data and put it in separate index you can call lookup.
which is automatic. runs once a day.

but your case might be complete different. you have to think through some logic.

but no, I don't think there is service like that to retrieve uniq value from index.

I suggest to look into transform. You could e.g. pivot to group by each field you are interested in.

Can you provide some data example? Are you looking for 1 document per unique combination of field values? or are you looking for 1 document that contains all possible values (like a terms aggregation)? Both is possible.

Continuous mode provides a way to keep the data up to date.

if I group by the field, I will need to create transform for each field (I think), which means every transform will look for new data every time interval (lets say 1h), suppose I have 5 fields, then its 5 transforms => 5 queries every 1h. Am I right ?

here is a data example:

the raw data:

let's call it client_data index

{
          "utm_campaign" : "none",
          "utm_medium" : "none",
          "utm_source" : "none",
          "fullUrl" : "....",
          "clientId" : "100",
          "user_id" : "JD1GODDi85x1kH1uG32j7e71iEz630",
          "country" : "Israel",
          "wz_session" : "JD1GODi85x1kH1u32j7e71iEz630",
          "date_time" : "2021-02-09T11:52:02",
          "protocol" : "https",
          "domain" : "elastic.co",
          "page" : "....",
          "raw_page" : "....",
          "operating_system" : "Android 11",
          "browser" : "Chrome Mobile",
          "browser_version" : "88.0.4324.152",
          "is_bot" : false,
          "device" : "Phone",
        }

I'm looking for a document per unique combination, because if I keep all the unique values in 1 document I could end up with a document with a field containing thousands of unique values.
so I'm looking for something like this:

let's call it client_lookups index

{
          "clientId" : 100,
          "domain" : "elastic.co"
          "page" : "/this_this_page_path", // or device, country, browser, browser version, utm...
          "count" : 3, // not mandatory, but nice to have it
          "lastHitDate" : "2021-07-07T07:30:00", // not mandatory, but nice to have it
        }

I thought about it again now, and I think there is no way to do that with 1 transform (if I pivot to group by: clientId, doman, page, value). the question is if there is a job that looks for a new data in an index and check if any of the fields [page, country,...] has a new unique value.

Every transform would group by clientID and domain. Now you can decide, I see 3 options:

  1. you can create a transform which in addition to the 2 fields above group by one of [page, device, country, browser, browser_version]. This gives you the mentioned 5 transforms and means you have 5 lookup indexes.

But I think there are 2 better options:

  1. You could group on all 5, that returns every unique combination. So you can differ between e.g. a Chrome desktop user vs a Chrome mobile user. But that means that the number of Chrome users is not available straight from the index. However, aggregations are available on the dest index, so you can query for the number of Chrome users by using an aggregation query on the dest index. This solution seems the most flexible to me and I would choose that if you still plan to use elasticsearch for querying the data.

I would go even further and in addition add a date_histogram to group e.g. monthly. That creates a couple more documents but you get more fine granular data. Aggregations on the dest index can be used to get the overall count. LBNL this helps transform to perform better: Without the date histogram it has to lookup on all (old) indexes, but with the date histogram it only looks for data of the last month/week/day.

  1. You group by clientID and domain only and create 5 terms aggregations in the aggregation part of pivot. This returns a flat list with counts:
"operating_system" : { 
    "Android_11": 24,
    "Android_10": 14,
    ...
}

The downside of this approach: The output of terms is mapped to a flattened field by default, that means you can't aggregate on it. With other words, you can't query/aggregate for the number of Android users across all clients. However, if you create the destination index yourself and map e.g. operating_system.Android_11 to numeric field like long you can workaround that.

This solution seems best if you plan to download the created data to e.g. feed it into another system and strictly want to avoid running aggregations on the destination index (if that's the case the mentioned mapping problem does not apply).

Regarding the output:

"count" : 3, // not mandatory, but nice to have it

You can get this count by using value_count on either clientId or domain (or any other group by field).

"lastHitDate" : "2021-07-07T07:30:00", // not mandatory, but nice to have it

This is a max on date_time.

1 Like