Terms aggregation split by whitespace

I have a bunch of elastic search documents that contain information about jobs ads. I'm trying to aggregate the **title** field to extract the number of "experience" instances from the job posting. e.g. Junior, Senior, Lead, etc. Instead what I'm getting are buckets that match the title as a whole instead of the each word it the title field. e.g. "Junior Java Developer", "Senior .NET Analyst", etc.

How can I tell elastic search to split the aggregation based on each word in the title as opposed the matching the value of the whole field.

I would later like to expand the query to also extract the "skill level" and "role", but it should also be fine if the buckets contain all the words in the field as long as they are split into separate buckets.

I do not want to enable "fielddata = true" since it consumes a lot of memory and is already established as a bad practice. How else can I implement it?

Current query:

GET /jobs/_search
{
    "query": {
        "match_all": {}
    },
	"aggs": {
		"group_by_state": {
			"terms": {
				"field": "title"
			}
		}
	}
}

Unwanted Output:

{
  ...
  "hits": {
    "total": 63,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "group_by_state": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 14,
      "buckets": [{
          "key": "Junior Java Tester",
          "doc_count": 6
        },{
          "key": "Senior Java Lead",
          "doc_count": 6
        },{
          "key": "Intern Java Tester",
          "doc_count": 5
        },
        ...
      ]
    }
  }
}

Desired Output:

{
  ...
  "hits": {
    "total": 63,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "group_by_state": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 14,
      "buckets": [{
          "key": "Junior",
          "doc_count": 12
        },{
          "key": "Senior",
          "doc_count": 8
        },{
          "key": "Tester",
          "doc_count": 5
        },{
          "key": "Intern",
          "doc_count": 5
        },{
          "key": "Analyst",
          "doc_count": 5
        },
        ...
      ]
    }
  }
}

There's no other way than using field data or splitting the text before it get indexed in your application or with a painless script processor.

You can help me with an example as it would be with a painless script processor. Please.

You have some examples of script processor here: https://www.elastic.co/guide/en/elasticsearch/reference/6.6/script-processor.html

I don't have example of splitting a text with a painless script but that should not be that hard to write one with something like https://docs.oracle.com/javase/8/docs/api/java/util/StringTokenizer.html#StringTokenizer-java.lang.String-. This class is supported by Painless. See https://www.elastic.co/guide/en/elasticsearch/painless/6.6/painless-api-reference.html

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.