Terms aggregation split by whitespace

EliuFlorez · March 8, 2019, 3:19pm

I have a bunch of elastic search documents that contain information about jobs ads. I'm trying to aggregate the **title** field to extract the number of "experience" instances from the job posting. e.g. Junior, Senior, Lead, etc. Instead what I'm getting are buckets that match the title as a whole instead of the each word it the title field. e.g. "Junior Java Developer", "Senior .NET Analyst", etc.

How can I tell elastic search to split the aggregation based on each word in the title as opposed the matching the value of the whole field.

I would later like to expand the query to also extract the "skill level" and "role", but it should also be fine if the buckets contain all the words in the field as long as they are split into separate buckets.

I do not want to enable "fielddata = true" since it consumes a lot of memory and is already established as a bad practice. How else can I implement it?

Current query:

GET /jobs/_search
{
    "query": {
        "match_all": {}
    },
	"aggs": {
		"group_by_state": {
			"terms": {
				"field": "title"
			}
		}
	}
}

Unwanted Output:

{
  ...
  "hits": {
    "total": 63,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "group_by_state": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 14,
      "buckets": [{
          "key": "Junior Java Tester",
          "doc_count": 6
        },{
          "key": "Senior Java Lead",
          "doc_count": 6
        },{
          "key": "Intern Java Tester",
          "doc_count": 5
        },
        ...
      ]
    }
  }
}

Desired Output:

{
  ...
  "hits": {
    "total": 63,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "group_by_state": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 14,
      "buckets": [{
          "key": "Junior",
          "doc_count": 12
        },{
          "key": "Senior",
          "doc_count": 8
        },{
          "key": "Tester",
          "doc_count": 5
        },{
          "key": "Intern",
          "doc_count": 5
        },{
          "key": "Analyst",
          "doc_count": 5
        },
        ...
      ]
    }
  }
}

dadoonet · March 9, 2019, 6:17pm

There's no other way than using field data or splitting the text before it get indexed in your application or with a painless script processor.

EliuFlorez · March 10, 2019, 2:05pm

You can help me with an example as it would be with a painless script processor. Please.

dadoonet · March 10, 2019, 2:41pm

You have some examples of script processor here: https://www.elastic.co/guide/en/elasticsearch/reference/6.6/script-processor.html

I don't have example of splitting a text with a painless script but that should not be that hard to write one with something like https://docs.oracle.com/javase/8/docs/api/java/util/StringTokenizer.html#StringTokenizer-java.lang.String-. This class is supported by Painless. See https://www.elastic.co/guide/en/elasticsearch/painless/6.6/painless-api-reference.html

system · April 7, 2019, 2:41pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Terms aggregation split by coma Elasticsearch aggregations	6	277	March 7, 2024
Terms aggregation for emails list Elasticsearch	6	858	November 15, 2019
Elasticsearch Java api split a field Elasticsearch	2	747	July 5, 2017
Regex on Term aggregations with phrase Elasticsearch	1	352	December 15, 2016
Terms aggregation is breaking field into tokens Elasticsearch	2	695	July 5, 2017

Terms aggregation split by whitespace

Related topics