ElasticSearch query to identify relationships between Hive table fields using search for keywords


#1

We are new to ElasticSearch and Kibana. We are using ElasticSearch 5.0.0 to identify relationships between Hive table fields in a database by searching across all columns for specific keywords.

We are open to use queries or ElasticSearch APIs or any other solutions to meet the requirement.

We have uploaded details into a single index in ElasticSearch by installing elasticsearch-hadoop-5.1.1 library and creating external Hive tables in ElasticSearch. We are using _type column for storing Hive table names.

Please find the initial query used along with highlight feature to identify matching Hive table and field names with values.

GET /index_1/_search?pretty
{
  "query": {
    "query_string": {
        "analyze_wildcard": true,
        "query": "Search_string"
      }
  },
  "highlight": {
        "fields": {
              "*": {}
      },
      "require_field_match": false,
      "fragment_size": 2147483647
  }
}

The issue with this approach is that only individual documents are returned and distinct table names and field names are not returned.

We tried using the following terms aggregation query to get the distinct table names (_type) along with some static field, say field_1.

GET /index_1/_search?pretty
{
  "size": 0,
  "query": {
    "query_string": {
        "analyze_wildcard": true,
        "query": "index_1"
      }
  },
  "aggs": {
    "distinct_tables": {
      "terms": { 
        "field": "_type"
      },
       "aggs" : { 
        "unique_set_2": {
        "terms": { 
        "field": "field_1"
          }
        }
      }
    }
  }
}

Then we thought of extending it by combining the two queries to get a dynamic response of matched fields from highlight and do a distinct on those to get the distinct table names and distinct column names for each table name. But we are not able to achieve this through queries and other means.

Request you to help us address the requirement. Please suggest if there are simpler or alternative approaches as well. Thanks.

Please find the ElasticSearch query link below for additional details. Added a separate topic due to body size restrictions. Sorry for the inconvenience.


#2

Adding more details to the original query.

We have used the _type field to store the Hive Table Names for the example index movies.

movies/movie_intrnl
movies/movie_shows

Data used for Hive Tables stored in ElasticSearch Index movies:

movie_intrnl:
Title1,1,Action1,2003,Action
Title2,2,dire2,2007,Crime
Title3,3,dire3,2004,CrimeThriller
Title4,4,Drama1,2003,Drama
Title5,5,Action2,2005,Action
Title6,6,Drama4,2007,BiographyDrama

movie_shows:
Action1,Title1,1,2017-04-03 00:00:00,Action
Theatre2,Title2,2,2016-05-07 00:00:00,Crime
Theatre3,Title2,3,2015-06-04 00:00:00,CrimeThriller
Drama4,Title4,4,2014-08-03 00:00:00,Drama
Action5,Title1,5,2019-09-05 00:00:00,Action
Theatre6,Title6,6,2017-10-07 00:00:00,BiographyDrama

ElasticSearch Query to get the distinct table names (_type field in ElasticSearch):

GET /movies/_search?pretty
{
  "size": 0,
  "_source": false,
  "query": {
    "query_string": {
      "analyze_wildcard": true,
      "query": "*Drama*"
    }
  },
  "aggs": {
    "distinct_tables": {
      "terms": { 
        "field": "_type"
      }
    }
  }
}

We got the response given below for getting the distinct tables using the following tags:

aggregations -> buckets -> key

movie_intrnl
movie_shows

Response for the ES Query to get distinct table names:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "distinct_tables": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "movie_intrnl",
          "doc_count": 2
        },
        {
          "key": "movie_shows",
          "doc_count": 2
        }
      ]
    }
  }
}

We are using the highlight feature to identify matching field names for the search query.

ES Query to get matching table names and column names for a search pattern:

GET /movies/_search?pretty
{
  "_source": false,  
  "query": {
    "query_string": {
        "analyze_wildcard": true,
        "query": "*Drama*"
      }
  },
  "highlight": {
        "fields": {
              "*": {}
      },
      "require_field_match": false,
      "fragment_size": 2147483647
  }
}

We got the response below for getting the matching table names and column names.

Response - 1st matching document:

"_type": "movie_shows",
"highlight": {
"theatre": [
"Drama4"
],
"genres": [
"Drama"
]

Response - 2nd matching document:

"_type": "movie_shows",
"highlight": {
"genres": [
"BiographyDrama"
]
}

Response - 3rd matching document

"_type": "movie_intrnl",
"highlight": {
"director": [
"Drama1"
],
"genres": [
"Drama"
]

Response - 4th matching document

"_type": "movie_intrnl",
"highlight": {
"director": [
"Drama4"
],
"genres": [
"BiographyDrama"
]

But this approach does not give the distinct table names and column names across all the matched documents.

Expected response as per the requirement is given below.

"_type": "movie_shows"

"theatre"
"genres"

"_type": "movie_intrnl"

"director"
"genres"

Response for search query to get matching column names and table names:

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 1,
    "hits": [
      {
        "_index": "movies",
        "_type": "movie_shows",
        "_id": "AVoEcEMrxAEXKBamIeYk",
        "_score": 1,
        "highlight": {
          "theatre": [
            "<em>Drama4</em>"
          ],
          "genres": [
            "<em>Drama</em>"
          ]
        }
      },
      {
        "_index": "movies",
        "_type": "movie_shows",
        "_id": "AVoEcEMrxAEXKBamIeYm",
        "_score": 1,
        "highlight": {
          "genres": [
            "<em>BiographyDrama</em>"
          ]
        }
      },
      {
        "_index": "movies",
        "_type": "movie_intrnl",
        "_id": "AVoEbkPFxAEXKBamIeYe",
        "_score": 1,
        "highlight": {
          "director": [
            "<em>Drama1</em>"
          ],
          "genres": [
            "<em>Drama</em>"
          ]
        }
      },
      {
        "_index": "movies",
        "_type": "movie_intrnl",
        "_id": "AVoEbkPFxAEXKBamIeYg",
        "_score": 1,
        "highlight": {
          "director": [
            "<em>Drama4</em>"
          ],
          "genres": [
            "<em>BiographyDrama</em>"
          ]
        }
      }
    ]
  }
}

In the query given below, we have tried aggregation on _type column to get distinct table names and with sub-aggregation by using a sample static field “genres”. However, since columns from the search result is dynamic, we are looking for a mechanism to use sub-aggregation on top of highlight field results to get the distinct column names within each identified distinct table name.

ES Query tried to get distinct table names and column names:

GET /movies/_search?pretty
{
  "size": 0,
  "_source": false,
  "query": {
    "query_string": {
        "analyze_wildcard": true,
        "query": "*Drama*"
      }
  },
  "aggs": {
    "distinct_tables": {
      "terms": { 
        "field": "_type"
      },
       "aggs" : { 
        "unique_set_2": {
        "terms": { 
        "field": "genres.keyword"
          }
        }
      }
    }
  }
}

#3

Adding the response to earlier query in reply due to size restrictions in body field.

Response to ES Query tried to get distinct table names and column names:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "distinct_tables": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "movie_intrnl",
          "doc_count": 2,
          "unique_set_2": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "BiographyDrama",
                "doc_count": 1
              },
              {
                "key": "Drama",
                "doc_count": 1
              }
            ]
          }
        },
        {
          "key": "movie_shows",
          "doc_count": 2,
          "unique_set_2": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "BiographyDrama",
                "doc_count": 1
              },
              {
                "key": "Drama",
                "doc_count": 1
              }
            ]
          }
        }
      ]
    }
  }
}

(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.