Updating documents with excluded fields


(Stefano Parmesan) #1

Hello everybody,

I hope my question is not too trivial: I have an ES index with a mapping similar to the following (only the relevant part is shown):

{
  "mappings": {
    "person": {
      "_source": {
        "excludes": [
          "websites.body",
          "websites.title",
          "websites.description"
        ]
      },
      "properties": {
        "name": {
          "type": "string"
        },
        "websites": {
          "properties": {
            "url": {
              "type": "string",
              "index": "not_analyzed"
            },
            "body": {
              "type": "string",
              "store": true
            },
            "title": {
              "type": "string",
              "store": true
            },
            "description": {
              "type": "string",
              "store": true
            }
          }
        }
      }
    }
  }
}

Basically, a person may have multiple websites, for which we have some text fields (title, description, body) which are needed only for searching (no need to retrieve them in the _source when querying) and are therefore marked as excluded.

I wrote an Apache Spark application that reads the index, transforms the documents, and writes them back on the same index. I'm using elasticsearch-hadoop 2.1.0.Beta4 for this. Everything works as expected, with the only issue that the fields marked as excluded in the mapping are not present anymore in the index after I run the job.

On the first place I thought the reason was the default write operation performed by ES, index, which means (from the documentation): "new data is added while existing data (based on its id) is replaced (reindexed)."

I then tried setting es.write.operation=update in my spark job, which (again from the documention) "updates existing data (based on its id). If no data is found, an exception is thrown". Before running my job I made sure the websites field was not set on my documents, so that not being pushed on the index the old values should have been left untouched. Unfortunately this keeps on removing the excluded fields from my documents.

How can I update my index so that I can add and modify fields without altering the value of the excluded fields?

Thanks in advance


(Costin Leau) #2

I think you misunderstand what exclude does - namely eliminate the field completely from the connector at writing time. In other words, whatever operation is used (index vs update) the data sent to Elastic will not contain the excluded fields.
If your source data also changes, it's likely you are reindexing the data - namely writing the data back to the same index in which case, the data excluded will be lost since, as it is not written at all and it overwrites existing data.

P.S Thanks for formatting the post


(Stefano Parmesan) #3

Thank you Costin,

You're right, what happens is that being those fields marked as store=true they are available for search when I insert my documents for the first time, but since they are excluded by the source, once a document is updated those fields are lost.

I did this because what I needed was a way to have long texts indexed for fulltext search, without having to retrieve them at every query; this thing of storing+excluding did the trick, but has this update limitation, I'll work on finding a better way to achieve the same goal.

All the best


(system) #4