Hive overwhelming Elasticsearch

albertkohl · April 12, 2021, 6:11pm

i'm new to ES so bare with me

i have a hadoop cluster running hive, and i have a table of about 441 million rows, 10 columns wide (string and bigint)

i setup an external table in hive, created an index in ES (single node), and insterted the data.

it worked fine, but took a while to insert. after that, i added sharding, broke the table into 6 shards, and set the jvm heap to 10gb. again, it worked great! this time, it only took about an hour to insert the data.

so to scale out, i then created a cluster ( i need to span data across muliple nodes because i have alot of other data i'd like to add to ES). first i did 3 nodes, couldnt get the data to insert, then i did 5 nodes (3 master dedicated, and two data, dedicated) and it wouldnt work.

i dropped the sharding down to 1, and i can insert fine, but it takes 1/2 the day!

it seems like once i start the insert process, hive just absolutely SOAKS the ES cluster, and starts knocking nodes out of service. after so many failures, the job fails.

any pointers to get this data to load with sharding into a cluster?

here's the error i get:

org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[192.168.4.64:9200, 192.168.4.63:9200]]

any suggestions would be greatly appreciated!
Thanks!

warkolm · April 12, 2021, 9:58pm

Welcome to our community!

What do your Elasticsearch nodes show?
What version are you on?
What is the output from the _cluster/stats?pretty&human API?

albertkohl · April 12, 2021, 10:07pm

Thanks!

when it starts going side-ways, nodes show as dropping off.
i should have 5 nodes total, 2 data, and 3 master

      "nodes" : {
        "count" : {
          "total" : 4,
          "coordinating_only" : 0,
          "data" : 1,
          "data_cold" : 1,
          "data_content" : 1,
          "data_frozen" : 1,
          "data_hot" : 1,
          "data_warm" : 1,
          "ingest" : 4,
          "master" : 3,
          "ml" : 4,
          "remote_cluster_client" : 4,
          "transform" : 1,
          "voting_only" : 0
        },

I'm running 7.12

warkolm · April 12, 2021, 10:08pm

The full output please.

What do your Elasticsearch logs show?
What does Monitoring show?

We're looking for evidence of GC or high CPU/memory/disk use, something in the logs to show disconnects and why, etc.

albertkohl · April 12, 2021, 10:14pm

here's the full output:

{
    "_nodes": {
        "total": 4,
        "successful": 4,
        "failed": 0
    },
    "cluster_name": "my-cluster",
    "cluster_uuid": "pDNXQqzAQMa5lUFLU8iQHA",
    "timestamp": 1618265129485,
    "status": "red",
    "indices": {
        "count": 1,
        "shards": {
            "total": 3,
            "primaries": 3,
            "replication": 0.0,
            "index": {
                "shards": {
                    "min": 3,
                    "max": 3,
                    "avg": 3.0
                },
                "primaries": {
                    "min": 3,
                    "max": 3,
                    "avg": 3.0
                },
                "replication": {
                    "min": 0.0,
                    "max": 0.0,
                    "avg": 0.0
                }
            }
        },
        "docs": {
            "count": 0,
            "deleted": 0
        },
        "store": {
            "size_in_bytes": 8163250,
            "reserved_in_bytes": 0
        },
        "fielddata": {
            "memory_size_in_bytes": 0,
            "evictions": 0
        },
        "query_cache": {
            "memory_size_in_bytes": 0,
            "total_count": 0,
            "hit_count": 0,
            "miss_count": 0,
            "cache_size": 0,
            "cache_count": 0,
            "evictions": 0
        },
        "completion": {
            "size_in_bytes": 0
        },
        "segments": {
            "count": 0,
            "memory_in_bytes": 0,
            "terms_memory_in_bytes": 0,
            "stored_fields_memory_in_bytes": 0,
            "term_vectors_memory_in_bytes": 0,
            "norms_memory_in_bytes": 0,
            "points_memory_in_bytes": 0,
            "doc_values_memory_in_bytes": 0,
            "index_writer_memory_in_bytes": 71870340,
            "version_map_memory_in_bytes": 0,
            "fixed_bit_set_memory_in_bytes": 0,
            "max_unsafe_auto_id_timestamp": -1,
            "file_sizes": {}
        },
        "mappings": {
            "field_types": [
                {
                    "name": "keyword",
                    "count": 5,
                    "index_count": 1
                },
                {
                    "name": "long",
                    "count": 4,
                    "index_count": 1
                },
                {
                    "name": "text",
                    "count": 5,
                    "index_count": 1
                }
            ]
        },
        "analysis": {
            "char_filter_types": [],
            "tokenizer_types": [],
            "filter_types": [],
            "analyzer_types": [],
            "built_in_char_filters": [],
            "built_in_tokenizers": [],
            "built_in_filters": [],
            "built_in_analyzers": []
        },
        "versions": [
            {
                "version": "7.12.0",
                "index_count": 1,
                "primary_shard_count": 6,
                "total_primary_bytes": 8163250
            }
        ]
    },
    "nodes": {
        "count": {
            "total": 4,
            "coordinating_only": 0,
            "data": 1,
            "data_cold": 1,
            "data_content": 1,
            "data_frozen": 1,
            "data_hot": 1,
            "data_warm": 1,
            "ingest": 4,
            "master": 3,
            "ml": 4,
            "remote_cluster_client": 4,
            "transform": 1,
            "voting_only": 0
        },
        "versions": [
            "7.12.0"
        ],
        "os": {
            "available_processors": 16,
            "allocated_processors": 16,
            "names": [
                {
                    "name": "Linux",
                    "count": 4
                }
            ],
            "pretty_names": [
                {
                    "pretty_name": "Ubuntu 18.04.5 LTS",
                    "count": 4
                }
            ],
            "architectures": [
                {
                    "arch": "amd64",
                    "count": 4
                }
            ],
            "mem": {
                "total_in_bytes": 67276931072,
                "free_in_bytes": 40482111488,
                "used_in_bytes": 26794819584,
                "free_percent": 60,
                "used_percent": 40
            }
        },
        "process": {
            "cpu": {
                "percent": 2
            },
            "open_file_descriptors": {
                "min": 371,
                "max": 627,
                "avg": 435
            }
        },
        "jvm": {
            "max_uptime_in_millis": 18633690,
            "versions": [
                {
                    "version": "15.0.1",
                    "vm_name": "OpenJDK 64-Bit Server VM",
                    "vm_version": "15.0.1+9",
                    "vm_vendor": "AdoptOpenJDK",
                    "bundled_jdk": true,
                    "using_bundled_jdk": true,
                    "count": 4
                }
            ],
            "mem": {
                "heap_used_in_bytes": 3623540848,
                "heap_max_in_bytes": 8589934592
            },
            "threads": 151
        },
        "fs": {
            "total_in_bytes": 836973527040,
            "free_in_bytes": 770803392512,
            "available_in_bytes": 728002207744
        },
        "plugins": [],
        "network_types": {
            "transport_types": {
                "security4": 4
            },
            "http_types": {
                "security4": 4
            }
        },
        "discovery_types": {
            "zen": 4
        },
        "packaging_types": [
            {
                "flavor": "default",
                "type": "deb",
                "count": 4
            }
        ],
        "ingest": {
            "number_of_pipelines": 2,
            "processor_stats": {
                "gsub": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time_in_millis": 0
                },
                "script": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time_in_millis": 0
                }
            }
        }
    }
}

and again, i'm new at this, i hav eno monitoring yet, just elasticsearch. i'm trying to setup a proof of concept so i havnt worked monitoring out yet (any recommendations for somethign open-source, a tutorial on how to set something up would be great!)

as for logs, where should i be looking? on the node the dropped off? or at master?
should i look at the my-cluster.log or something else?

warkolm · April 12, 2021, 10:17pm

If you are running 7.12 with Kibana, then you have Monitoring included for free - Stack Monitoring | Kibana Guide [8.11] | Elastic

Both of those, yes.

albertkohl · April 12, 2021, 10:27pm

all i see in the data node is just timeouts. (for my-cluster.log)
and about the same for the master node. is there a particular thing i should search for in the log?

(a specific phrase)

i'm going to try and add kibana and see if that shines line on anything. i'm not sure if it's a GC issue as much as its like my hadoop cluster is d-dossing the ES cluster. if i do a continuous ping when i start my hive query, after a couple minutes, one of the data nodes goes from 30ms to 3000ms (i'm pinging over a vpn, which is why its 30ms at rest)

hadoop cluster and es cluster are both on the same network.

i'll report back once i get kibana going. if there's anything else i can dig up, let me know.

warkolm · April 12, 2021, 10:29pm

It's useful if you can post your logs, you may miss something we would otherwise catch.

albertkohl · April 12, 2021, 10:31pm

any particulare way to post a log file w/o blowing up the page?
i dont see an attach option

warkolm · April 12, 2021, 10:39pm

Use gist/pastebin/etc if they are to large to post.

albertkohl · April 13, 2021, 7:26pm

sorry for the delay, i got kibana and filebeat setup. seems like filebeat isnt pulling the es log entries though. but here's what i get when the node drops off:

here are the log entires from elected master, and the node that dropped off:
master-gc.log
master-my-cluster.log
datanode-gc.log
datanode-my-cluster.log

hopefully this helps. lmk if i shoudl adjust anything for kibana/filebeat or anything else to help.

thanks!

warkolm · April 14, 2021, 1:14am

Your cluster is not healthy by the looks of things. There's tonnes of master not discovered or elected yet in your logs which would suggest you have connection and/or configuration issues.

What do your elasticsearch.yml files look like?

albertkohl · April 14, 2021, 1:48am

i think those were from before i boot-strapped the cluster. is there a way i can roll the logs and re-run the insert?

here's the elasticsearch.yml files:


::MASTERNODE::
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
cluster.name: my-cluster
node.name: "es-m01"
node.data: false
node.master: true
network.host: 0.0.0.0
http.port: 9200
discovery.zen.ping.unicast.hosts: ["es-m01","es-m02","es-m03","es-d01","es-d02"]
discovery.zen.minimum_master_nodes: 2
xpack.monitoring.collection.enabled: true
cluster.initial_master_nodes: ["es-m01"]

::DATANODES::
cluster.name: my-cluster
node.name: "es-d02"
node.data: true
node.master: false
network.host: 0.0.0.0
http.port: 9200
discovery.zen.ping.unicast.hosts: ["es-m01","es-m02","es-m03","es-d01","es-d02"]
discovery.zen.minimum_master_nodes: 2
xpack.monitoring.collection.enabled: true

only difference with the other nodes is the other master nodes do not have the cluster.initial_master_nodes: ["es-m01"] setting, and node-names.

warkolm · April 14, 2021, 1:53am

What's the output from the _cat/nodes?v API? Just want to check that your cluster is ok

albertkohl · April 14, 2021, 2:10am

ip           heap.percent ram.percent cpu load_1m load_5m load_15m node.role  master name
192.168.4.63           75          33   0    0.20    0.05     0.02 cdfhilrstw -      es-d01
192.168.4.60           56          39   1    0.00    0.05     0.01 ilmr       -      es-m01
192.168.4.64           36          33   0    0.02    0.03     0.00 cdfhilrstw -      es-d02
192.168.4.62           61          28   0    0.00    0.00     0.00 ilmr       -      es-m03
192.168.4.61           20          29   0    0.00    0.00     0.01 ilmr       *      es-m02

but when i do the insert, i usually drop one of the data nodes.

odd thing is, if i do just one shard, it works fine, it's only when i do more than one.

albertkohl · April 15, 2021, 1:39pm

hi mark, just following up...

albertkohl · April 19, 2021, 5:12pm

Does anyone have thoughts here? i'm kinda lost...

any other solutions on loading a several hundred million records quickly?

warkolm · April 20, 2021, 2:34am

Is Hive using the _bulk API, and if so, what are the bulk request sizes that it is sending over?

albertkohl · April 20, 2021, 2:37am

i'm not sure, whatever is build into the elasticsearch-hadoop library.

but i have started playing with loading pipe-delimited files with the elasticsearch python library, and i seem to be running into the same problem.

for example, a file i'm trying to load now is about 11 million rows (pipe-delimited text file, only about 8 columns).

i'm doing a parallel_bulk and it's periodically timing out as well.

last run i tried was 4 threads @ 10,000 each.

Christian_Dahlqvist · April 20, 2021, 5:05am

In an Elasticsearch cluster it is the data nodes that do most of the work so this is the node type you need to scale out I order to improve write performance. Based on your monitoring graph it looks like you have configured only 2GB of heap, which is not very much at all. If you are indexing a lot I am not surprised you are having issues.

What is the specification of the nodes in terms of hardware? What type of storage are you using? SSDs? Indexing is very IO intensive so for good performance you will need fast storage.

Topic		Replies	Views
Whether I should use elasticsearch-spark-20_2.11-5.2.2.jar other than elasticsearch-hadoop-hive-.5.2.2.jar for loading hive table into Elasticsearch? Elasticsearch es-hadoop	2	1167	May 5, 2017
Hive elasticsearch integration Elasticsearch es-hadoop	2	750	August 29, 2017
Connection error when selecting from ES table in Hive Elasticsearch es-hadoop	4	1304	July 11, 2017
Data size on disk increase 15 times when moved from hive to elasticsearch Elasticsearch es-hadoop	27	755	July 7, 2022
Unable to create index and insert data? Elasticsearch es-hadoop	2	1267	March 8, 2019

Hive overwhelming Elasticsearch

Related topics