Hive overwhelming Elasticsearch

i'm new to ES so bare with me :slight_smile:

i have a hadoop cluster running hive, and i have a table of about 441 million rows, 10 columns wide (string and bigint)

i setup an external table in hive, created an index in ES (single node), and insterted the data.

it worked fine, but took a while to insert. after that, i added sharding, broke the table into 6 shards, and set the jvm heap to 10gb. again, it worked great! this time, it only took about an hour to insert the data.

so to scale out, i then created a cluster ( i need to span data across muliple nodes because i have alot of other data i'd like to add to ES). first i did 3 nodes, couldnt get the data to insert, then i did 5 nodes (3 master dedicated, and two data, dedicated) and it wouldnt work.

i dropped the sharding down to 1, and i can insert fine, but it takes 1/2 the day!

it seems like once i start the insert process, hive just absolutely SOAKS the ES cluster, and starts knocking nodes out of service. after so many failures, the job fails.

any pointers to get this data to load with sharding into a cluster?

here's the error i get:

org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[192.168.4.64:9200, 192.168.4.63:9200]]

any suggestions would be greatly appreciated!
Thanks!

Welcome to our community! :smiley:

What do your Elasticsearch nodes show?
What version are you on?
What is the output from the _cluster/stats?pretty&human API?

Thanks!

when it starts going side-ways, nodes show as dropping off.
i should have 5 nodes total, 2 data, and 3 master

      "nodes" : {
        "count" : {
          "total" : 4,
          "coordinating_only" : 0,
          "data" : 1,
          "data_cold" : 1,
          "data_content" : 1,
          "data_frozen" : 1,
          "data_hot" : 1,
          "data_warm" : 1,
          "ingest" : 4,
          "master" : 3,
          "ml" : 4,
          "remote_cluster_client" : 4,
          "transform" : 1,
          "voting_only" : 0
        },

I'm running 7.12

The full output please.

What do your Elasticsearch logs show?
What does Monitoring show?

We're looking for evidence of GC or high CPU/memory/disk use, something in the logs to show disconnects and why, etc.

here's the full output:

{
    "_nodes": {
        "total": 4,
        "successful": 4,
        "failed": 0
    },
    "cluster_name": "my-cluster",
    "cluster_uuid": "pDNXQqzAQMa5lUFLU8iQHA",
    "timestamp": 1618265129485,
    "status": "red",
    "indices": {
        "count": 1,
        "shards": {
            "total": 3,
            "primaries": 3,
            "replication": 0.0,
            "index": {
                "shards": {
                    "min": 3,
                    "max": 3,
                    "avg": 3.0
                },
                "primaries": {
                    "min": 3,
                    "max": 3,
                    "avg": 3.0
                },
                "replication": {
                    "min": 0.0,
                    "max": 0.0,
                    "avg": 0.0
                }
            }
        },
        "docs": {
            "count": 0,
            "deleted": 0
        },
        "store": {
            "size_in_bytes": 8163250,
            "reserved_in_bytes": 0
        },
        "fielddata": {
            "memory_size_in_bytes": 0,
            "evictions": 0
        },
        "query_cache": {
            "memory_size_in_bytes": 0,
            "total_count": 0,
            "hit_count": 0,
            "miss_count": 0,
            "cache_size": 0,
            "cache_count": 0,
            "evictions": 0
        },
        "completion": {
            "size_in_bytes": 0
        },
        "segments": {
            "count": 0,
            "memory_in_bytes": 0,
            "terms_memory_in_bytes": 0,
            "stored_fields_memory_in_bytes": 0,
            "term_vectors_memory_in_bytes": 0,
            "norms_memory_in_bytes": 0,
            "points_memory_in_bytes": 0,
            "doc_values_memory_in_bytes": 0,
            "index_writer_memory_in_bytes": 71870340,
            "version_map_memory_in_bytes": 0,
            "fixed_bit_set_memory_in_bytes": 0,
            "max_unsafe_auto_id_timestamp": -1,
            "file_sizes": {}
        },
        "mappings": {
            "field_types": [
                {
                    "name": "keyword",
                    "count": 5,
                    "index_count": 1
                },
                {
                    "name": "long",
                    "count": 4,
                    "index_count": 1
                },
                {
                    "name": "text",
                    "count": 5,
                    "index_count": 1
                }
            ]
        },
        "analysis": {
            "char_filter_types": [],
            "tokenizer_types": [],
            "filter_types": [],
            "analyzer_types": [],
            "built_in_char_filters": [],
            "built_in_tokenizers": [],
            "built_in_filters": [],
            "built_in_analyzers": []
        },
        "versions": [
            {
                "version": "7.12.0",
                "index_count": 1,
                "primary_shard_count": 6,
                "total_primary_bytes": 8163250
            }
        ]
    },
    "nodes": {
        "count": {
            "total": 4,
            "coordinating_only": 0,
            "data": 1,
            "data_cold": 1,
            "data_content": 1,
            "data_frozen": 1,
            "data_hot": 1,
            "data_warm": 1,
            "ingest": 4,
            "master": 3,
            "ml": 4,
            "remote_cluster_client": 4,
            "transform": 1,
            "voting_only": 0
        },
        "versions": [
            "7.12.0"
        ],
        "os": {
            "available_processors": 16,
            "allocated_processors": 16,
            "names": [
                {
                    "name": "Linux",
                    "count": 4
                }
            ],
            "pretty_names": [
                {
                    "pretty_name": "Ubuntu 18.04.5 LTS",
                    "count": 4
                }
            ],
            "architectures": [
                {
                    "arch": "amd64",
                    "count": 4
                }
            ],
            "mem": {
                "total_in_bytes": 67276931072,
                "free_in_bytes": 40482111488,
                "used_in_bytes": 26794819584,
                "free_percent": 60,
                "used_percent": 40
            }
        },
        "process": {
            "cpu": {
                "percent": 2
            },
            "open_file_descriptors": {
                "min": 371,
                "max": 627,
                "avg": 435
            }
        },
        "jvm": {
            "max_uptime_in_millis": 18633690,
            "versions": [
                {
                    "version": "15.0.1",
                    "vm_name": "OpenJDK 64-Bit Server VM",
                    "vm_version": "15.0.1+9",
                    "vm_vendor": "AdoptOpenJDK",
                    "bundled_jdk": true,
                    "using_bundled_jdk": true,
                    "count": 4
                }
            ],
            "mem": {
                "heap_used_in_bytes": 3623540848,
                "heap_max_in_bytes": 8589934592
            },
            "threads": 151
        },
        "fs": {
            "total_in_bytes": 836973527040,
            "free_in_bytes": 770803392512,
            "available_in_bytes": 728002207744
        },
        "plugins": [],
        "network_types": {
            "transport_types": {
                "security4": 4
            },
            "http_types": {
                "security4": 4
            }
        },
        "discovery_types": {
            "zen": 4
        },
        "packaging_types": [
            {
                "flavor": "default",
                "type": "deb",
                "count": 4
            }
        ],
        "ingest": {
            "number_of_pipelines": 2,
            "processor_stats": {
                "gsub": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time_in_millis": 0
                },
                "script": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time_in_millis": 0
                }
            }
        }
    }
}

and again, i'm new at this, i hav eno monitoring yet, just elasticsearch. i'm trying to setup a proof of concept so i havnt worked monitoring out yet (any recommendations for somethign open-source, a tutorial on how to set something up would be great!)

as for logs, where should i be looking? on the node the dropped off? or at master?
should i look at the my-cluster.log or something else?

If you are running 7.12 with Kibana, then you have Monitoring included for free - Stack Monitoring | Kibana Guide [8.11] | Elastic

Both of those, yes.

all i see in the data node is just timeouts. (for my-cluster.log)
and about the same for the master node. is there a particular thing i should search for in the log?

(a specific phrase)

i'm going to try and add kibana and see if that shines line on anything. i'm not sure if it's a GC issue as much as its like my hadoop cluster is d-dossing the ES cluster. if i do a continuous ping when i start my hive query, after a couple minutes, one of the data nodes goes from 30ms to 3000ms (i'm pinging over a vpn, which is why its 30ms at rest)

hadoop cluster and es cluster are both on the same network.

i'll report back once i get kibana going. if there's anything else i can dig up, let me know.

It's useful if you can post your logs, you may miss something we would otherwise catch.

any particulare way to post a log file w/o blowing up the page?
i dont see an attach option

Use gist/pastebin/etc if they are to large to post.

sorry for the delay, i got kibana and filebeat setup. seems like filebeat isnt pulling the es log entries though. but here's what i get when the node drops off:

here are the log entires from elected master, and the node that dropped off:
master-gc.log
master-my-cluster.log
datanode-gc.log
datanode-my-cluster.log

hopefully this helps. lmk if i shoudl adjust anything for kibana/filebeat or anything else to help.

thanks!

Your cluster is not healthy by the looks of things. There's tonnes of master not discovered or elected yet in your logs which would suggest you have connection and/or configuration issues.

What do your elasticsearch.yml files look like?

i think those were from before i boot-strapped the cluster. is there a way i can roll the logs and re-run the insert?

here's the elasticsearch.yml files:


::MASTERNODE::
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
cluster.name: my-cluster
node.name: "es-m01"
node.data: false
node.master: true
network.host: 0.0.0.0
http.port: 9200
discovery.zen.ping.unicast.hosts: ["es-m01","es-m02","es-m03","es-d01","es-d02"]
discovery.zen.minimum_master_nodes: 2
xpack.monitoring.collection.enabled: true
cluster.initial_master_nodes: ["es-m01"]

::DATANODES::
cluster.name: my-cluster
node.name: "es-d02"
node.data: true
node.master: false
network.host: 0.0.0.0
http.port: 9200
discovery.zen.ping.unicast.hosts: ["es-m01","es-m02","es-m03","es-d01","es-d02"]
discovery.zen.minimum_master_nodes: 2
xpack.monitoring.collection.enabled: true

only difference with the other nodes is the other master nodes do not have the cluster.initial_master_nodes: ["es-m01"] setting, and node-names.

What's the output from the _cat/nodes?v API? Just want to check that your cluster is ok :slight_smile:

ip           heap.percent ram.percent cpu load_1m load_5m load_15m node.role  master name
192.168.4.63           75          33   0    0.20    0.05     0.02 cdfhilrstw -      es-d01
192.168.4.60           56          39   1    0.00    0.05     0.01 ilmr       -      es-m01
192.168.4.64           36          33   0    0.02    0.03     0.00 cdfhilrstw -      es-d02
192.168.4.62           61          28   0    0.00    0.00     0.00 ilmr       -      es-m03
192.168.4.61           20          29   0    0.00    0.00     0.01 ilmr       *      es-m02

but when i do the insert, i usually drop one of the data nodes.

odd thing is, if i do just one shard, it works fine, it's only when i do more than one.

hi mark, just following up...

Does anyone have thoughts here? i'm kinda lost...

any other solutions on loading a several hundred million records quickly?

Is Hive using the _bulk API, and if so, what are the bulk request sizes that it is sending over?

i'm not sure, whatever is build into the elasticsearch-hadoop library.

but i have started playing with loading pipe-delimited files with the elasticsearch python library, and i seem to be running into the same problem.

for example, a file i'm trying to load now is about 11 million rows (pipe-delimited text file, only about 8 columns).

i'm doing a parallel_bulk and it's periodically timing out as well.

last run i tried was 4 threads @ 10,000 each.

In an Elasticsearch cluster it is the data nodes that do most of the work so this is the node type you need to scale out I order to improve write performance. Based on your monitoring graph it looks like you have configured only 2GB of heap, which is not very much at all. If you are indexing a lot I am not surprised you are having issues.

What is the specification of the nodes in terms of hardware? What type of storage are you using? SSDs? Indexing is very IO intensive so for good performance you will need fast storage.