How to reindex, to remove unwanted nested structure

Hi,

I'm trying to reindex an index of nested records into an index that is more flat and cleaner. Is there a way to handle painless scripts with using the root field name as my column and the nested value as its value?

My records can vary and have multiple nested fields, but I'm only interested in the "fieldname" to "fieldname.data.value" mappings.

This is in Elasticsearch 6.7

Example records that exist in the index

{
  "_source" : {
    "status": "complete",
    "name" : {
      "data" : {
        "value" : "Jane Doe",
        "gender" : "female"
      }
    },
    "state" : {
      "data" : {
        "value" : "OH",
        "prior": "N/A",
      }
    },
    "school" : {
      "data" : {
        "value" : "wolf high school",
        "graduated" " true
      }
    },
    "region" : {
     "data" : {
        "value" : "IV",
        "category" : "Division 4"
      }
    }
  }
},
{
  "_source" : {
    "status": "complete",
    "name" : {
      "data" : {
        "value" : "John Doe",
        "gender" : "male"
      }
    },
    "state" : {
      "data" : {
        "value" : "PA",
        "prior" : "LA"
      }
    },
    "school" : {
      "data" : {
        "value" : "timber high school",
        "graduated" " true
      }
    },
    "region" : {
      "data" : {
        "value" : "III",
        "category" : "Division 3"
      }
    }
    "schoolType" : {
      "data" : {
        "value" : "public",
        "county" : False
      }
    }
  }
}

What my goal is to reindex to a new index with the records being clean up like this:

 {
    "name" : "Jane Doe",
    "state" : "OH",
    "school" : "wolf high school",
    "region" : "IV",
},
{
    "name" :  "John Doe",
    "state" : "PA",
    "school" : "timber high school",
    "region" : "III",
    "schoolType.data.value" : "Public"
}

My reindex looks currently looks like below. Please bear in mind, I'm just getting started with painless. So I don't know if there is a regex pattern or something better I can use.

There must be a cleaner way to handle this. Because, as the number of fields grow in the first index, it would be preferred not to have spell out each filed in the current script.

Is there a regex or a way to iterate over all the top fields, so I can perform the same call each field? I know I will always want the path fieldname.data.value from each field.

POST _reindex
{
  "source": {
    "index": "test-student-info",
    "_source": {
        "includes": [ "*.data.value" ],
        "excludes": [ "*.sources" ]
    },
    "query" : {
      "term" : { "status.keyword" : "complete" }
    }
  },
  "dest": {
    "index": "test-studen-info2"
  },
  "script": {
    "inline": "ctx._source.name = ctx._source.name.data.value; ctx._source.state = ctx._source.state.data.value; ctx._source.school = ctx._source.school.data.value; ctx._source.region = ctx._source.region.data.value",
    "lang": "painless"
  }
}

Hey,

your painless script looks good so far, nice work. So what you could do here is this

DELETE my_index

PUT my_index/_doc/1?refresh=true
{
  "name": {
    "data": {
      "value": "Jane Doe",
      "gender": "female"
    }
  },
  "whatever": {
    "data": {
      "value": "whatever"
    }
  },
  "will_not_be_imported": {
    "foo" :"bar"
  }
}

POST _reindex
{
  "source": {
    "index": "my_index"
  },
  "dest": {
    "index": "my_other_index"
  },
  "script": {
    "lang": "painless",
    "source": """
def keys = ctx._source.keySet();    
for (String key : new HashSet(keys)) {
  if (ctx._source[key]['data'] != null && ctx._source[key]['data']['value'] != null) {
    ctx._source[key] = ctx._source[key]['data']['value'];
  } else {
    ctx._source.remove(key);
  }
}
"""
  }
}

GET my_other_index/_doc/1
1 Like

i couldn't figure out how to get the keys from the map and now I understand it.

The results are being populated the way we need them and I marked your answer as the solution.

Thanks,

Dean

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.