Finding distinct values from a field

Suhas_K · May 23, 2018, 9:33am

Hi,

I have made a filter in Logstash which divides log lines into different fields.

    {
      "_index": "XX_DATE",
      "_type": "doc",
      "_id": "vBhbcghstjdJgpojNbs",
      "_version": 1,
      "_score": null,
      "_source": {
        "m": "Data: ###InData###: |SId[12]|CId[6652888]|X[BC]-->SId[12]|TC=9565|RId=XXXX|CStatus=P|NoOfRec=1|TRec=1|Er=0|x=BC|CustomerID=xxxxxx|DID=yyyyyy|S=Fat|xCode=SSS|",
        "s": "XXXXXXXX",
        "@version": "1",
        "@timestamp": "2018-05-22T01:52:06.671Z"
       }
     }

In the above JSON, the CustomerID in _source.m is unique for every customer. However, since the CustomerID is not a separate field, it is not possible to get the number of distinct customers using the unique count metric in Visualize.

Is there any way to get the number of unique customers without changing the Logstash filters? (as we have already processed a huge amount of data using the same Logstash grok filters)

Thanks in advance.

Stacey_Gammon · May 23, 2018, 1:36pm

You can try parsing the string with a painless scripted field to extract out the customer id and aggregate on that.

https://www.elastic.co/guide/en/elasticsearch/painless/current/painless-examples.html

Suhas_K · May 24, 2018, 12:35pm

Thanks for the reply, @Stacey_Gammon

I used the following query to parse the string, extract the CustomerID and insert it back as a new field into the document.

POST xxxx-DATE/doc/_update_by_query
{
  "query": {
    "match_phrase": {
      "m": "CustomerID"
    }
  },
  "script": {
    "lang": "painless",
    "source": "ctx._source.CustomerID = /.*CustomerID=(\\w{0,50})\\|.*/.matcher(ctx._source.m).replaceAll('$1')"
  }
}

Using this approach, I am now able to get the unique number of customers. However, in order to get this result, I had to make the following change in elasticsearch.yml in all my Elasticsearch nodes. This also involved restarting all the elastic nodes. As the restart didn't affect my local test environment, is it recommended to add the below change and restart the nodes in production? Also, will it affect the existing Elasticsearch data in any way?

script.painless.regex.enabled: true

My logs are indexed based on the date, with a new index being created each day. Hence, I would have to execute this query manually everyday in order to achieve the same result.

Is there a better way to approach the issue?

I have also faced a problem when I tried executing the POST request without the "query" field, where it went through all the documents (~1,300,000). This resulted in a timeout error 503. Whereas, when I added the query field, it only had to parse through and update around 200 documents. My concern is that there might be a possibility of a timeout even with the query field due to a large number of documents in production. Is there any way to overcome this issue?

Thanks

system · June 21, 2018, 12:45pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.