Due to restrictions in edit, adding a separate topic for providing additional details. Sorry for inconvenience.
Original post link:
We have used the _type field to store the Hive Table Names for the example index movies.
movies/movie_intrnl
movies/movie_shows
Data used for Hive Tables stored in Elasticsearch Index movies:
movie_intrnl:
Title1,1,Action1,2003,Action
Title2,2,dire2,2007,Crime
Title3,3,dire3,2004,CrimeThriller
Title4,4,Drama1,2003,Drama
Title5,5,Action2,2005,Action
Title6,6,Drama4,2007,BiographyDrama
movie_shows:
Action1,Title1,1,2017-04-03 00:00:00,Action
Theatre2,Title2,2,2016-05-07 00:00:00,Crime
Theatre3,Title2,3,2015-06-04 00:00:00,CrimeThriller
Drama4,Title4,4,2014-08-03 00:00:00,Drama
Action5,Title1,5,2019-09-05 00:00:00,Action
Theatre6,Title6,6,2017-10-07 00:00:00,BiographyDrama
Elasticsearch Query to get the distinct table names (_type field in Elasticsearch):
GET /movies/_search?pretty
{
"size": 0,
"_source": false,
"query": {
"query_string": {
"analyze_wildcard": true,
"query": "*Drama*"
}
},
"aggs": {
"distinct_tables": {
"terms": {
"field": "_type"
}
}
}
}
We got the response given below for getting the distinct tables using the following tags:
aggregations -> buckets -> key
movie_intrnl
movie_shows
Response for the ES Query to get distinct table names:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0,
"hits": []
},
"aggregations": {
"distinct_tables": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "movie_intrnl",
"doc_count": 2
},
{
"key": "movie_shows",
"doc_count": 2
}
]
}
}
}
We are using the highlight feature to identify matching field names for the search query.
ES Query to get matching table names and column names for a search pattern:
GET /movies/_search?pretty
{
"_source": false,
"query": {
"query_string": {
"analyze_wildcard": true,
"query": "*Drama*"
}
},
"highlight": {
"fields": {
"*": {}
},
"require_field_match": false,
"fragment_size": 2147483647
}
}
We got the response below for getting the matching table names and column names.
Response - 1st matching document:
"_type": "movie_shows",
"highlight": {
"theatre": [
"Drama4"
],
"genres": [
"Drama"
]
Response - 2nd matching document:
"_type": "movie_shows",
"highlight": {
"genres": [
"BiographyDrama"
]
}
Response - 3rd matching document
"_type": "movie_intrnl",
"highlight": {
"director": [
"Drama1"
],
"genres": [
"Drama"
]
Response - 4th matching document
"_type": "movie_intrnl",
"highlight": {
"director": [
"Drama4"
],
"genres": [
"BiographyDrama"
]
But this approach does not give the distinct table names and column names across all the matched documents.
Expected response as per the requirement is given below.
"_type": "movie_shows"
"theatre"
"genres"
"_type": "movie_intrnl"
"director"
"genres"
Response for search query to get matching column names and table names:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 1,
"hits": [
{
"_index": "movies",
"_type": "movie_shows",
"_id": "AVoEcEMrxAEXKBamIeYk",
"_score": 1,
"highlight": {
"theatre": [
"<em>Drama4</em>"
],
"genres": [
"<em>Drama</em>"
]
}
},
{
"_index": "movies",
"_type": "movie_shows",
"_id": "AVoEcEMrxAEXKBamIeYm",
"_score": 1,
"highlight": {
"genres": [
"<em>BiographyDrama</em>"
]
}
},
{
"_index": "movies",
"_type": "movie_intrnl",
"_id": "AVoEbkPFxAEXKBamIeYe",
"_score": 1,
"highlight": {
"director": [
"<em>Drama1</em>"
],
"genres": [
"<em>Drama</em>"
]
}
},
{
"_index": "movies",
"_type": "movie_intrnl",
"_id": "AVoEbkPFxAEXKBamIeYg",
"_score": 1,
"highlight": {
"director": [
"<em>Drama4</em>"
],
"genres": [
"<em>BiographyDrama</em>"
]
}
}
]
}
}
In the query given below, we have tried aggregation on _type column to get distinct table names and with sub-aggregation by using a sample static field “genres”. However, since columns from the search result is dynamic, we are looking for a mechanism to use sub-aggregation on top of highlight field results to get the distinct column names within each identified distinct table name.
ES Query tried to get distinct table names and column names:
GET /movies/_search?pretty
{
"size": 0,
"_source": false,
"query": {
"query_string": {
"analyze_wildcard": true,
"query": "*Drama*"
}
},
"aggs": {
"distinct_tables": {
"terms": {
"field": "_type"
},
"aggs" : {
"unique_set_2": {
"terms": {
"field": "genres.keyword"
}
}
}
}
}
}