How to join values from 2 different fields to one and manage it further?

Hello,

I am curious if anybody faced the issue I have

I have 5 documents with the following fields and values in table view:

+++++++++++++++++++++++++++++++++
| src_ch | tgt_ch | extra_field |
+++++++++++++++++++++++++++++++++
|   a    |   b    |      -      |
|   a    |   b    |      -      |
|   b    |   a    |      -      |
|   b    |   a    |      -      |
|   -    |   -    |    blabla   |
+++++++++++++++++++++++++++++++++

My goal is:

  1. Combine values from both "src_ch" and "tgt_ch" fields
  2. Get a list with unique values
  3. Calculate a number of unique values
  4. Be able to display results in a dashboard

So, the expected result is getting these values in the dashboard:

result: [a, b]
count:  2

1.Initially I tried to build it via Visualize Library but failed to do it

2.Then I tried to get it at least via Search API request:

DELETE /my-index-000001/

POST /my-index-000001/_bulk?refresh
{"index":{}}
{"src_ch" : "a", "tgt_ch" : "a", "extra_field": "blabla"}
{"index":{}}
{"src_ch" : "a", "tgt_ch" : "b", "extra_field": "blabla"}
{"index":{}}
{"src_ch" : "b", "tgt_ch" : "a", "extra_field": "blabla"}
{"index":{}}
{"src_ch" : "b", "tgt_ch" : "b", "extra_field": "blabla"}
{"index":{}}
{"extra_field": "blabla"}


GET my-index-000001/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "exists": {
            "field": "src_ch.keyword"
          }
        },
        {
          "exists": {
            "field": "tgt_ch.keyword"
          }
        }
      ]
    }
  },
  "_source": [
    "src_ch",
    "tgt_ch"
  ],
  "from" : 0, 
  "size" : 0,
  "aggs": {
    "src_channels": {
      "terms": {
        "field": "src_ch.keyword"
      }
    },
    "tgt_channels": {
      "terms": {
        "field": "tgt_ch.keyword"
      }
    }
  }
}

and I almost got what I was looking for:

{
  <...>
  "aggregations" : {
    "src_channels" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "a",
          "doc_count" : 2
        },
        {
          "key" : "b",
          "doc_count" : 2
        }
      ]
    },
    "tgt_channels" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "a",
          "doc_count" : 2
        },
        {
          "key" : "b",
          "doc_count" : 2
        }
      ]
    }
  }
}

but still no

3.Finally, I took a look at runtime fields but noticed that it requires having all data in one document while my data was distributed among 5 different docs

I changed the initial data structure to the nested one:

DELETE my-index-000001

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "docs": {
        "type": "nested"
      }
    }
  }
}

PUT my-index-000001/_doc/1
{
  "docs": [
      { "src_ch" : "a", "tgt_ch" : "a", "extra_field": "blabla"},
      { "src_ch" : "a", "tgt_ch" : "b", "extra_field": "blabla" },
      { "src_ch" : "b", "tgt_ch" : "a", "extra_field": "blabla"},
      { "src_ch" : "b", "tgt_ch" : "b", "extra_field": "blabla"},
      {"extra_field": "blabla"}
  ]
}

After this, I was able to achieve the required count value by using the following script for the runtime field:

def channels = [];
String element;
if (params['_source']['docs'] != null) {
	if (params['_source']['docs'].size() != 0) { 
		for (def i = 0; i < params['_source']['docs'].length; i++) {  
			element = params['_source']['docs'][i]['src_ch'];
			if ((element != null) && (!channels.contains(element))) {
				channels.add(element)
			}
			element = params['_source']['docs'][i]['dst_ch'];
            if ((element != null) && (!channels.contains(element))) {
				channels.add(element)
			}
		}
	}
}
emit(channels.length)

however, I failed to get a list the same way:

def channels = [];
String element;
if (params['_source']['docs'] != null) {
	if (params['_source']['docs'].size() != 0) { 
		for (def i = 0; i < params['_source']['docs'].length; i++) {  
			element = params['_source']['docs'][i]['src_ch'];
			if ((element != null) && (!channels.contains(element))) {
				channels.add(element)
			}
			element = params['_source']['docs'][i]['dst_ch'];
            if ((element != null) && (!channels.contains(element))) {
				channels.add(element)
			}
		}
	}
}
emit(channels)

due to the following error:

cannot implicitly cast def [java.util.ArrayList] to java.lang.String

I did not find how to explicitly specify in Runtime field settings that I expect that a list is returned but scripted fields helped to workaround the issue here this way:

def channels = [];
String element;
if (params['_source']['docs'] != null) {
	if (params['_source']['docs'].size() != 0) { 
		for (def i = 0; i < params['_source']['docs'].length; i++) {  
			element = params['_source']['docs'][i]['src_ch'];
			if ((element != null) && (!channels.contains(element))) {
				channels.add(element)
			}
			element = params['_source']['docs'][i]['dst_ch'];
            if ((element != null) && (!channels.contains(element))) {
				channels.add(element)
			}
		}
	}
}
return channels

So, briefly, my questions are the next:

  1. Am I correct that we cannot perform any actions with fields located in 2 different documents?
  2. Am I correct that having all data in one document there is no way to join values from two different fields except by creating scripted or/and runtime fields?
  3. Am I correct that runtime and scripted fields will be consuming a lot of memory when a number of docs will be significant? Do we have some examples of how bad everything may be?
  4. Is there the correct way to resolve my initial problem?

Thank you!

You're correct - a runtime field is scoped at the document level. Why not emit the two values to combine them and then run the "Unique Count" (cardinality under the hood) aggregation on this new field?

emit(doc['src_ch'] + ' - ' + doc['tgr_ch']);

2 Likes

Hello @ghudgins,

Thank you for the fast response!

So, briefly:

  1. Your way is correct for SQL case "SELECT COUNT(*) FROM (SELECT DISTINCT pencil_1, pencil_2 FROM data)"
  2. I am looking for the solution for SQL case "SELECT COUNT(*) FROM (SELECT pencil_1 FROM data UNION SELECT pencil_2 FROM data)"

And now, in detail:

You are correct for the case when we compare pairs, however, I am looking for a way to union values from both fields to one common array and calculate a number of unique values in it

Let me show one more example

Let's assume that we have students at university and each of them may have up to 2 colorful pencils:

DELETE /students/

POST /students/_bulk?refresh
{"index":{}}
{"name": "John_1", "pencil_1" : "red", "pencil_2" : "yellow"}
{"index":{}}
{"name": "John_2", "pencil_1" : "green", "pencil_2" : "blue"}
{"index":{}}
{"name": "John_3", "pencil_1" : "red", "pencil_2" : "green"}
{"index":{}}
{"name": "John_4", "pencil_1" : "red", "pencil_2" : "yellow"}
{"index":{}}
{"name": "John_5", "pencil_1" : "", "pencil_2" : "brown"}
{"index":{}}
{"name": "John_6", "pencil_1" : "", "pencil_2" : ""}
{"index":{}}
{"name": "John_7", "pencil_1" : "yellow", "pencil_2" : "red"}
{"index":{}}
{"name": "John_8", "pencil_1" : "blue", "pencil_2" : "green"}
{"index":{}}
{"name": "John_9", "pencil_1" : "green", "pencil_2" : "red"}

Your way answers to the question "How many unique pairs of colors do students have (an order matters meaning that 'red-yellow' pair != 'yellow-red' one)?"

So, after adding a runtime field with this syntax:

emit(doc['pencil_1.keyword'] + ' - ' + doc['pencil_2.keyword'])

we will indeed receive the correct answer "8" to your question (see screenshot.png and screenshot_2.png)

However, I would like to get an answer to the question "How many unique colors do students have together in total?"

In this case, I need to take all values from "pencil_1" field:

"red"
"green"
"red"
"red"
""
""
"yellow"
"blue"
"green"

and all values from "pencil_2" field:

"yellow"
"blue"
"green"
"yellow"
"brown"
""
"red"
"green"
"red"

union them together:

"red"
"green"
"red"
"red"
""
""
"yellow"
"blue"
"green"
"yellow"
"blue"
"green"
"yellow"
"brown"
""
"red"
"green"
"red"

take unique values only:

"red"
"green"
""
"yellow"
"blue"
"brown"

and calculate the number of unique values:

6

So, in terms of SQL it will be something like this:

SELECT COUNT(*) FROM (SELECT pencil_1 FROM data UNION SELECT pencil_2 FROM data)

I did NOT find a way how to calculate this value while my data is distributed among different documents, however, I can do this if I have all data in one document and the data is stored in a nested field:

DELETE /students/

PUT students
{
  "mappings": {
    "properties": {
      "students": {
        "type": "nested"
      }
    }
  }
}

PUT students/_doc/1
{
  "students": [
      {"name": "John_1", "pencil_1" : "red", "pencil_2" : "yellow"},
      {"name": "John_2", "pencil_1" : "green", "pencil_2" : "blue"},
      {"name": "John_3", "pencil_1" : "red", "pencil_2" : "green"},
      {"name": "John_4", "pencil_1" : "red", "pencil_2" : "yellow"},
      {"name": "John_5", "pencil_1" : "", "pencil_2" : "brown"},
      {"name": "John_6", "pencil_1" : "", "pencil_2" : ""},
      {"name": "John_7", "pencil_1" : "yellow", "pencil_2" : "red"},
      {"name": "John_8", "pencil_1" : "blue", "pencil_2" : "green"},
      {"name": "John_9", "pencil_1" : "green", "pencil_2" : "red"}
  ]
}

My runtime field has to have the next syntax:

def pencils = [];
String color;
if (params['_source']['students'] != null) {
	if (params['_source']['students'].size() != 0) { 
		for (def i = 0; i < params['_source']['students'].length; i++) {  
			color = params['_source']['students'][i]['pencil_1'];
			if ((color != null) && (!pencils.contains(color))) {
				pencils.add(color)
			}
			color = params['_source']['students'][i]['pencil_2'];
            if ((color != null) && (!pencils.contains(color))) {
				pencils.add(color)
			}
		}
	}
}
emit(pencils.length)

Finally, in Visualize Library in the metric, I get the correct result "6" (see screenshot_3.png)

So, just to summarize:

  1. Your way is correct for SQL case "SELECT COUNT(*) FROM (SELECT DISTINCT pencil_1, pencil_2 FROM data)"
  2. I am looking for the solution for SQL case "SELECT COUNT(*) FROM (SELECT pencil_1 FROM data UNION SELECT pencil_2 FROM data)"

If I missed something, just let me know

Thanks!



makes sense. not sure I have too much else to propose...we don't yet support this sort of subquerying / pipelining in the user interface editors like lens. you can definitely do a lot more with querying Elasticsearch directly and binding the result to a visualization in the Custom Visualization.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.