Why doesn’t dense_vector field show up in Spark schema when using Elasticsearch-Hadoop?

Hi everyone,

I created an Elasticsearch index with a dense_vector field, along with some text fields. The mapping looks like this (simplified):

{"mappings": {"properties": {"embedding": {"type": "dense_vector","dims": 3,"index": true,"index_options": { "type": "int8_hnsw" }},"title": { "type": "text" },"text": { "type": "text" }}}}

When I read this index in Spark using the Elasticsearch for Apache Hadoop connector:

df = spark.read.format("es").option("es.nodes", "172.22.10.20").option("es.port", "9200").option("es.nodes.wan.only", "true").load("bbb")
df.printSchema()
df = spark.read.format("es").option("es.nodes", "172.22.10.20").option("es.port", "9200").option("es.nodes.wan.only", "true").load("bbb")
df.printSchema()

the output only shows:

root
|-- text: string (nullable = true)
|-- title: string (nullable = true)

The embedding (dense_vector) field is completely missing.

My Questions

  • Is dense_vector officially unsupported in the ES-Hadoop connector?

  • The documentation on supported field mappings doesn’t mention vector types. Does that mean they are silently ignored?

  • Is there any workaround to read these fields into Spark (e.g., as arrays of floats), or is duplicating the field into a regular float array the only option?

Thanks in advance for clarifying!

Unfortunately dense_vector is one of the many unsupported field types: Support for all Elasticsearch field types · Issue #1813 · elastic/elasticsearch-hadoop · GitHub . Your best option might be to set es.output.json to true to dump out the raw json.