Mapping keyword as date during ingestion - es 8.4.2, spark 3.2.0 on emr 6.6.0

I'm migrating our solution between elastic 7.1.0 and 8.4.2.
Previously we used spark 2.4 and scala 2.11, now upgraded to latest and greatest versions :slight_smile:

The mapping is strict

{
"mappings":
  {
    "dynamic":"strict",
    "properties":{
     ...
        "formerPrimaryNames":{
                 "properties":{
                    "endDate":{"type":"keyword","index":false},
                    "name":{"type":"text","index":false,"copy_to":["searchTerm"]},
                    "startDate":{"type":"keyword","index":false}
                 }
     }
...

}

I runned the spark with mock data, to run regression suite.
I faced following error in spark logs :

0.0 (TID 26) (ip-10-246-37-218.ec2.internal executor 1): org.elasticsearch.hadoop.EsHadoopException: Could not write all entries for bulk operation [1/282]. Error sample (first [5] error messages):
	org.elasticsearch.hadoop.rest.EsHadoopRemoteException: mapper_parsing_exception: failed to parse field [formerPrimaryNames.startDate] of type [date] in document with id '99999'. Preview of field's value: 'dtbprodsvq';org.elasticsearch.hadoop.rest.EsHadoopRemoteException: illegal_argument_exception: failed to parse date field [dtbprodsvq] with format [strict_date_optional_time||epoch_millis];org.elasticsearch.hadoop.rest.EsHadoopRemoteException: date_time_parse_exception: Failed to parse with all enclosed parsers

After a bit of googling I added following line to configuration of spark:
sparkSession.config("es.index.auto.create", "false")
.config("es.mapping.date.rich", "false")

But that did not help with the issue.
The structure of the spark job looks like:

createIndexWithMapping.
.readfile()
.flatmap(process it)  -- here it's mapped as Optional[String]
.map(toJson)
.saveJsonToEs(index, id mapping)

To be honest I did not expect that by passing Json, anything will try to convert/transform/validate the value before sending it on to elastic.
We don't need to process that, neither index, therefore it's mapped as keyword, so we can put generated stuff, during test phase, as it should not be analyzed by elastic.

Can you please share your thoughts, please?

I'm a little confused. Your mapping does not have any date fields, but your error message is complaining about parsing dates? But if I change the type of end_date in your mapping to date I get the same error. And it's not coming from any validation in es-hadoop(spark) -- that error is coming from Elasticsearch when es-hadoop sends the request. And that's expected, right? The data you are sending over is not a date.

I'm confused as well. I don't have any date in my mapping.
I do not expect any validation errors coming from elastic.
The test data contains some rubbish string for that field, because we won't index that, and would not search on it.
tempsnip

Any suggestions how to proceed with this?

Sample data used there:

"formerPrimaryNames":[
{"name":"pfnuatyfes","startDate":"bfzhsiytmj","endDate":"tqsifkgkqt"},{"name":"jrdsyiqsnh","startDate":"wxxrtbhbbq","endDate":"emhwxgfarv"}
]

and Scala class

case class FormerPrimaryName(name: String, startDate: Option[String], endDate: Option[String])

Could you post the full code (including setup and data to reproduce this)? Maybe you're not hitting the index or cluster you think you are? I'm not able to reproduce it with the following (it runs fine):

val mapping = s"""{
                     |"dynamic":"strict",
                     |      "properties": {
                     |        "formerPrimaryNames": {
                     |            "properties": {
                     |                    "endDate":{"type":"keyword","index":false},
                     |                    "name":{"type":"text","index":false},
                     |                    "startDate":{"type":"keyword","index":false}
                     |             }
                     |        }
                     |      }
                     |  }
    """.stripMargin

val index = "test"
org.elasticsearch.hadoop.rest.RestUtils.touch(index)
org.elasticsearch.hadoop.rest.RestUtils.putMapping(index, null, mapping.getBytes(StringUtils.UTF_8))
val document = """{ "formerPrimaryNames": {"endDate": "dtbprodsvq", "name": "asdf", "startDate": "9qh0kemfy5k3" }}"""
  .stripMargin
sc.makeRDD(Seq(document)).saveJsonToEs(index)
org.elasticsearch.hadoop.rest.RestUtils.refresh(index)
val df = sqc.read.format("es").load(index)
println(df.first())

Hi Keith,

Your response trigger myself to re-validate everything.
Previously I checked every config file, apart from elastic api, and I figured out that I have a split brain situation.
I double check everything and:

  • discovery-ec2 plugin is installed
  • repository-s3 is a module so does not require any intervention
  • node and node.lock files are removed from /var/lib/elasticsearch
    -I put following configuration:
http.host: 0.0.0.0
cluster.name: v15-us-east-1-es-cluster
discovery.seed_providers: ec2
discovery.ec2.tag.es_cluster_aws_env: v15-us-east-1-es-cluster
cluster.initial_master_nodes: ["10.2.3.54","10.2.3.230","10.2.3.107"]
cloud.node.auto_attributes: true
cluster.routing.allocation.awareness.attributes: aws_availability_zone
network.host: [_local_, _site_]
network.publish_host: _ec2:privateIp_
logger.org.elasticsearch.discovery: INFO
indices.query.bool.max_clause_count: 2048
search.max_buckets: 200000000
discovery.ec2.endpoint: ec2.us-east-1.amazonaws.com

  • xpack.security is completely disabled for now

But it's not finding the other nodes.
Any suggestions?

I created another environment, to have a clean start.
Api response from http://10.246.38.22:9200/

name	"ip-10-246-38-22.ec2.internal"
cluster_name	"search-devqa-j17-v16-us-east-1-es-cluster"
cluster_uuid	"fY7X3_5kTESZna90s97BBw"
version	number	"8.4.2"
build_flavor	"default"
build_type	"rpm"
build_hash	"89f8c6d8429db93b816403ee75e5c270b43a940a"
build_date	"2022-09-14T16:26:04.382547801Z"
build_snapshot	false
lucene_version	"9.3.0"
minimum_wire_compatibility_version	"7.17.0"
minimum_index_compatibility_version	"7.0.0"
tagline	"You Know, for Search"

The other one: http://10.246.37.230:9200/

	
name	"ip-10-246-37-230.ec2.internal"
cluster_name	"search-devqa-j17-v16-us-east-1-es-cluster"
cluster_uuid	"fY7X3_5kTESZna90s97BBw"
version	
number	"8.4.2"
build_flavor	"default"
build_type	"rpm"
build_hash	"89f8c6d8429db93b816403ee75e5c270b43a940a"
build_date	"2022-09-14T16:26:04.382547801Z"
build_snapshot	false
lucene_version	"9.3.0"
minimum_wire_compatibility_version	"7.17.0"
minimum_index_compatibility_version	"7.0.0"
tagline	"You Know, for Search"

Both cluster name and uuid match.

The log from the first node:

[2022-11-24T14:14:05,955][INFO ][o.e.e.NodeEnvironment    ] [ip-10-246-38-22.ec2.internal] using [1] data paths, mounts [[/ (/dev/nvme0n1p1)]], net usable_space [26gb], net total_space [29.9gb], types [xfs]
[2022-11-24T14:14:05,956][INFO ][o.e.e.NodeEnvironment    ] [ip-10-246-38-22.ec2.internal] heap size [2gb], compressed ordinary object pointers [true]
[2022-11-24T14:14:06,043][INFO ][o.e.n.Node               ] [ip-10-246-38-22.ec2.internal] node name [ip-10-246-38-22.ec2.internal], node ID [nHg6tHnpSceEspaWkQItLw], cluster name [search-devqa-j17-v16-us-east-1-es-cluster], roles [remote_cluster_client, master, data_warm, data_content, transform, data_hot, ml, data_frozen, ingest, data_cold, data]
[2022-11-24T14:14:11,261][INFO ][o.e.x.s.Security         ] [ip-10-246-38-22.ec2.internal] Security is disabled
[2022-11-24T14:14:11,416][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [ip-10-246-38-22.ec2.internal] [controller/4910] [Main.cc@123] controller (64 bit): Version 8.4.2 (Build 158cee2efc2db4) Copyright (c) 2022 Elasticsearch BV
[2022-11-24T14:14:12,007][INFO ][o.e.t.n.NettyAllocator   ] [ip-10-246-38-22.ec2.internal] creating NettyAllocator with the following configs: [name=elasticsearch_configured, chunk_size=1mb, suggested_max_allocation_size=1mb, factors={es.unsafe.use_netty_default_chunk_and_page_size=false, g1gc_enabled=true, g1gc_region_size=4mb}]
[2022-11-24T14:14:12,044][INFO ][o.e.i.r.RecoverySettings ] [ip-10-246-38-22.ec2.internal] using rate limit [40mb] with [default=40mb, read=0b, write=0b, max=0b]
[2022-11-24T14:14:12,094][INFO ][o.e.d.DiscoveryModule    ] [ip-10-246-38-22.ec2.internal] using discovery type [multi-node] and seed hosts providers [settings, ec2]
[2022-11-24T14:14:14,144][INFO ][o.e.n.Node               ] [ip-10-246-38-22.ec2.internal] initialized
[2022-11-24T14:14:14,145][INFO ][o.e.n.Node               ] [ip-10-246-38-22.ec2.internal] starting ...
[2022-11-24T14:14:14,172][INFO ][o.e.x.s.c.f.PersistentCache] [ip-10-246-38-22.ec2.internal] persistent cache index loaded
[2022-11-24T14:14:14,173][INFO ][o.e.x.d.l.DeprecationIndexingComponent] [ip-10-246-38-22.ec2.internal] deprecation component started
[2022-11-24T14:14:14,354][INFO ][o.e.t.TransportService   ] [ip-10-246-38-22.ec2.internal] publish_address {10.246.38.22:9300}, bound_addresses {10.246.38.22:9300}, {[::1]:9300}, {127.0.0.1:9300}
[2022-11-24T14:14:14,864][INFO ][o.e.b.BootstrapChecks    ] [ip-10-246-38-22.ec2.internal] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2022-11-24T14:14:14,868][WARN ][o.e.c.c.ClusterBootstrapService] [ip-10-246-38-22.ec2.internal] this node is locked into cluster UUID [fY7X3_5kTESZna90s97BBw] but [cluster.initial_master_nodes] is set to [10.246.37.230, 10.246.38.22, 10.246.38.215]; remove this setting to avoid possible data loss caused by subsequent cluster bootstrap attempts
[2022-11-24T14:14:14,998][INFO ][o.e.c.s.MasterService    ] [ip-10-246-38-22.ec2.internal] elected-as-master ([1] nodes joined)[_FINISH_ELECTION_, {ip-10-246-38-22.ec2.internal}{nHg6tHnpSceEspaWkQItLw}{gUvy4v04TOmLZTiANhj0fA}{ip-10-246-38-22.ec2.internal}{10.246.38.22}{10.246.38.22:9300}{cdfhilmrstw} completing election], term: 2, version: 28, delta: master node changed {previous [], current [{ip-10-246-38-22.ec2.internal}{nHg6tHnpSceEspaWkQItLw}{gUvy4v04TOmLZTiANhj0fA}{ip-10-246-38-22.ec2.internal}{10.246.38.22}{10.246.38.22:9300}{cdfhilmrstw}]}
[2022-11-24T14:14:15,086][INFO ][o.e.c.s.ClusterApplierService] [ip-10-246-38-22.ec2.internal] master node changed {previous [], current [{ip-10-246-38-22.ec2.internal}{nHg6tHnpSceEspaWkQItLw}{gUvy4v04TOmLZTiANhj0fA}{ip-10-246-38-22.ec2.internal}{10.246.38.22}{10.246.38.22:9300}{cdfhilmrstw}]}, term: 2, version: 28, reason: Publication{term=2, version=28}
[2022-11-24T14:14:15,218][INFO ][o.e.r.s.FileSettingsService] [ip-10-246-38-22.ec2.internal] starting file settings watcher ...
[2022-11-24T14:14:15,240][INFO ][o.e.h.AbstractHttpServerTransport] [ip-10-246-38-22.ec2.internal] publish_address {10.246.38.22:9200}, bound_addresses {[::]:9200}
[2022-11-24T14:14:15,242][INFO ][o.e.n.Node               ] [ip-10-246-38-22.ec2.internal] started {ip-10-246-38-22.ec2.internal}{nHg6tHnpSceEspaWkQItLw}{gUvy4v04TOmLZTiANhj0fA}{ip-10-246-38-22.ec2.internal}{10.246.38.22}{10.246.38.22:9300}{cdfhilmrstw}{xpack.installed=true, aws_availability_zone=us-east-1c, ml.max_jvm_size=2147483648, ml.allocated_processors=2, ml.machine_memory=4072448000}
[2022-11-24T14:14:15,238][INFO ][o.e.r.s.FileSettingsService] [ip-10-246-38-22.ec2.internal] file settings service up and running [tid=48]
[2022-11-24T14:14:15,779][INFO ][o.e.l.LicenseService     ] [ip-10-246-38-22.ec2.internal] license [aec6cce7-5f05-4fa8-9f60-86085d30ccdc] mode [basic] - valid
[2022-11-24T14:14:15,784][INFO ][o.e.g.GatewayService     ] [ip-10-246-38-22.ec2.internal] recovered [1] indices into cluster_state
[2022-11-24T14:14:15,864][ERROR][o.e.i.g.GeoIpDownloader  ] [ip-10-246-38-22.ec2.internal] exception during geoip databases update
org.elasticsearch.ElasticsearchException: not all primary shards of [.geoip_databases] index are active
        at org.elasticsearch.ingest.geoip.GeoIpDownloader.updateDatabases(GeoIpDownloader.java:134) ~[?:?]
        at org.elasticsearch.ingest.geoip.GeoIpDownloader.runDownloader(GeoIpDownloader.java:274) ~[?:?]
        at org.elasticsearch.ingest.geoip.GeoIpDownloaderTaskExecutor.nodeOperation(GeoIpDownloaderTaskExecutor.java:102) ~[?:?]
        at org.elasticsearch.ingest.geoip.GeoIpDownloaderTaskExecutor.nodeOperation(GeoIpDownloaderTaskExecutor.java:48) ~[?:?]
        at org.elasticsearch.persistent.NodePersistentTasksExecutor$1.doRun(NodePersistentTasksExecutor.java:42) ~[elasticsearch-8.4.2.jar:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:769) ~[elasticsearch-8.4.2.jar:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.4.2.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:833) ~[?:?]
[2022-11-24T14:14:16,702][INFO ][o.e.c.r.a.AllocationService] [ip-10-246-38-22.ec2.internal] current.health="GREEN" message="Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[.geoip_databases][0]]])." previous.health="RED" reason="shards started [[.geoip_databases][0]]"
[2022-11-24T14:14:17,768][INFO ][o.e.i.g.DatabaseNodeService] [ip-10-246-38-22.ec2.internal] successfully loaded geoip database file [GeoLite2-Country.mmdb]
[2022-11-24T14:14:17,867][INFO ][o.e.i.g.DatabaseNodeService] [ip-10-246-38-22.ec2.internal] successfully loaded geoip database file [GeoLite2-ASN.mmdb]
[2022-11-24T14:14:20,676][INFO ][o.e.i.g.DatabaseNodeService] [ip-10-246-38-22.ec2.internal] successfully loaded geoip database file [GeoLite2-City.mmdb]

To be honest I have no clue what's wrong. I have not have issues running it in docker, but here ...
Can you help, please?

I solved my issue, by deleting the whole directory under /var/lib/elasticsearch.

Thanks for help really appreciate it.
Cheers

1 Like