Does ElasticSearch for Hadoop work with highlighting?

We observed that EsRDD returns only the _source field instead of the whole Hit, so the highlighting field stays outside the result.

Can you explain what is that you are seeing and what would you like to see? Currently only the source is used since that's the data that can be both read/written from/to Spark.
Metadata can be returned as well - potentially we can expand this to include additional fields as well.

Hi Costin,

Thank you for answering (BTW - I spoke with you at the Elastic[ON]15 after your lecture :slight_smile: ).

What we are trying to do here is to run a query with a highlighting request.

When we run the following request in Elastic Head:


{
  "from" : 0,
  "size" : 50,
  "query" : {
    "bool" : {
      "must" : {
        "match" : {
          "_all" : {
            "query" : "asia",
            "type" : "boolean"
          }
        }
      },
      "must_not" : {
        "match" : {
          "_all" : {
            "query" : "goat",
            "type" : "boolean"
          }
        }
      }
    }
  },
  "post_filter" : {
    "term" : {
      "_type" : "Document"
    }
  },
  "highlight" : {
    "pre_tags" : [ "<b>" ],
    "post_tags" : [ "</b>" ],
    "fragment_size" : 0,
    "number_of_fragments" : 0,
    "fields" : {
      "*" : { }
    }
  }
}

We get:


{
  "took" : 37,
  "timed_out" : false,
  "_shards" : {
    "total" : 165,
    "successful" : 165,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 0.73102,
    "hits" : [ {
      "_index" : "fts-english",
      "_type" : "Document",
      "_id" : "id4",
      "_score" : 0.73102,
      "_source":{"_analyzer":"english","streamId":3,"postDate":"2013-01-30","language":"English","message":"Mongolians migrated from Mid-Asia to the asian shores around 15,000 years ago.","user":"American"},
      "highlight" : {
        "message" : [ "Mongolians migrated from Mid-<b>Asia</b> to the asian shores around 15,000 years ago." ]
      }
    }, {
      "_index" : "fts-english",
      "_type" : "Document",
      "_id" : "id2",
      "_score" : 0.6265886,
      "_source":{"_analyzer":"english","streamId":1,"postDate":"2013-01-30","language":"English","message":"Paleoindians migrated from Asia to what is now the helloworld@gmail.com mainland around 15,000 years ago.","user":"me"},
      "highlight" : {
        "message" : [ "Paleoindians migrated from <b>Asia</b> to what is now the helloworld@gmail.com mainland around 15,000 years ago." ]
      }
    }, {
      "_index" : "fts-english",
      "_type" : "Document",
      "_id" : "id8",
      "_score" : 0.6265886,
      "_source":{"_analyzer":"english","streamId":1,"postDate":"2013-01-30","language":"English","message":"Indians migrated from Asia to North America long time ago. Many years before Columbus.","user":"me"},
      "highlight" : {
        "message" : [ "Indians migrated from <b>Asia</b> to North America long time ago. Many years before Columbus." ]
      }
    } ]
  }
}

which includes the highlight section where the relevant words in the original source text are highlighted by Bold <b> </b> html tags.

But - when we submit the same query to Elastic Hadoop, the EsRDD returned contains only the "_source" section but not the "highlights" for each returned hit.

We hoped that there would be a configurable option to get the whole result including the highlighting field.

Thank you for your help!

Doron

This issue is still not solved even with later versions. This is a very basic functionality. I wonder why nobody else is complaining - is nobody using es-hadoop with spark?

1 Like