Adding more details - ElasticSearch query to identify relationships between Hive table fields using search for keywords


#1

Due to restrictions in edit, adding a separate topic for providing additional details. Sorry for inconvenience.

Original post link:

We have used the _type field to store the Hive Table Names for the example index movies.

movies/movie_intrnl
movies/movie_shows

Data used for Hive Tables stored in ElasticSearch Index movies:

movie_intrnl:

Title1,1,Action1,2003,Action
Title2,2,dire2,2007,Crime
Title3,3,dire3,2004,CrimeThriller
Title4,4,Drama1,2003,Drama
Title5,5,Action2,2005,Action
Title6,6,Drama4,2007,BiographyDrama

movie_shows:

Action1,Title1,1,2017-04-03 00:00:00,Action
Theatre2,Title2,2,2016-05-07 00:00:00,Crime
Theatre3,Title2,3,2015-06-04 00:00:00,CrimeThriller
Drama4,Title4,4,2014-08-03 00:00:00,Drama
Action5,Title1,5,2019-09-05 00:00:00,Action
Theatre6,Title6,6,2017-10-07 00:00:00,BiographyDrama

ElasticSearch Query to get the distinct table names (_type field in ElasticSearch):

GET /movies/_search?pretty
{
  "size": 0,
  "_source": false,
  "query": {
    "query_string": {
      "analyze_wildcard": true,
      "query": "*Drama*"
    }
  },
  "aggs": {
    "distinct_tables": {
      "terms": { 
        "field": "_type"
      }
    }
  }
}

We got the response given below for getting the distinct tables using the following tags:

aggregations -> buckets -> key

movie_intrnl
movie_shows

Response for the ES Query to get distinct table names:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "distinct_tables": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "movie_intrnl",
          "doc_count": 2
        },
        {
          "key": "movie_shows",
          "doc_count": 2
        }
      ]
    }
  }
}

We are using the highlight feature to identify matching field names for the search query.

ES Query to get matching table names and column names for a search pattern:

GET /movies/_search?pretty
{
  "_source": false,  
  "query": {
    "query_string": {
        "analyze_wildcard": true,
        "query": "*Drama*"
      }
  },
  "highlight": {
        "fields": {
              "*": {}
      },
      "require_field_match": false,
      "fragment_size": 2147483647
  }
}

We got the response below for getting the matching table names and column names.

Response - 1st matching document:

"_type": "movie_shows",
"highlight": {
"theatre": [
"Drama4"
],
"genres": [
"Drama"
]

Response - 2nd matching document:

"_type": "movie_shows",
"highlight": {
"genres": [
"BiographyDrama"
]
}

Response - 3rd matching document

"_type": "movie_intrnl",
"highlight": {
"director": [
"Drama1"
],
"genres": [
"Drama"
]

Response - 4th matching document

"_type": "movie_intrnl",
"highlight": {
"director": [
"Drama4"
],
"genres": [
"BiographyDrama"
]

But this approach does not give the distinct table names and column names across all the matched documents.

Expected response as per the requirement is given below.

"_type": "movie_shows"

"theatre"
"genres"

"_type": "movie_intrnl"

"director"
"genres"

Response for search query to get matching column names and table names:

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 1,
    "hits": [
      {
        "_index": "movies",
        "_type": "movie_shows",
        "_id": "AVoEcEMrxAEXKBamIeYk",
        "_score": 1,
        "highlight": {
          "theatre": [
            "<em>Drama4</em>"
          ],
          "genres": [
            "<em>Drama</em>"
          ]
        }
      },
      {
        "_index": "movies",
        "_type": "movie_shows",
        "_id": "AVoEcEMrxAEXKBamIeYm",
        "_score": 1,
        "highlight": {
          "genres": [
            "<em>BiographyDrama</em>"
          ]
        }
      },
      {
        "_index": "movies",
        "_type": "movie_intrnl",
        "_id": "AVoEbkPFxAEXKBamIeYe",
        "_score": 1,
        "highlight": {
          "director": [
            "<em>Drama1</em>"
          ],
          "genres": [
            "<em>Drama</em>"
          ]
        }
      },
      {
        "_index": "movies",
        "_type": "movie_intrnl",
        "_id": "AVoEbkPFxAEXKBamIeYg",
        "_score": 1,
        "highlight": {
          "director": [
            "<em>Drama4</em>"
          ],
          "genres": [
            "<em>BiographyDrama</em>"
          ]
        }
      }
    ]
  }
}

In the query given below, we have tried aggregation on _type column to get distinct table names and with sub-aggregation by using a sample static field “genres”. However, since columns from the search result is dynamic, we are looking for a mechanism to use sub-aggregation on top of highlight field results to get the distinct column names within each identified distinct table name.

ES Query tried to get distinct table names and column names:

GET /movies/_search?pretty
{
  "size": 0,
  "_source": false,
  "query": {
    "query_string": {
        "analyze_wildcard": true,
        "query": "*Drama*"
      }
  },
  "aggs": {
    "distinct_tables": {
      "terms": { 
        "field": "_type"
      },
       "aggs" : { 
        "unique_set_2": {
        "terms": { 
        "field": "genres.keyword"
          }
        }
      }
    }
  }
}

ElasticSearch query to identify relationships between Hive table fields using search for keywords
(James Baiera) #2

Please post this as a reply on the original thread.


(James Baiera) #3