How can I handle duplicate data in Elasticsearch?

I have used parent & child mapping to normalize data but as far as I
understand there is no way to get any fields from _parent document.

Here is the mapping of my index:

{
"mappings": {
"building": {
"properties": {
"name": {
"type": "string"
}
}
},
"flat": {
"_parent": {
"type": "building"
},
"properties": {
"name": {
"type": "string"
}
}
},
"room": {
"_parent": {
"type": "flat"
},
"properties": {
"name": {
"type": "string"
},
"floor": {
"type": "long"
}
}
}
}
}

Now, I'm trying to find the best way of storing flat_name and building_name
in room type. I won't query these fields but I should be able to get them
when I query other fields like floor. There will be millions of rooms and I
don't have much memory so I suspect that these duplicate values may cause
out of memory. I found _source field in Elasticsearch documentation and it
seems good way to store these kind of values. Do you have any efficient
suggestion for avoiding duplicate values like querying multiple queries or
hacky way to get fields from _parent document or denormalized data is the
only way to handle this kindle of problem?

--

Hello,

Assuming you don't have to update rooms and flats too often, you might be
better off with nested documents.

Back to your original question, you could store the ID of the parent in the
child when you insert it. Then, you could use the GET or Multi Get[0] API
to retrieve the parents once you have the children.

Another possibility, which is just as hackish, is to use the IDs of
children that you search for, and use the has_child filter[1] to get the
parent with that ID. You might want to use an IDs query[2] within that.

[0] http://www.elasticsearch.org/guide/reference/api/multi-get.html
[1]
http://www.elasticsearch.org/guide/reference/query-dsl/has-child-filter.html
[2]
http://www.elasticsearch.org/guide/reference/query-dsl/has-child-filter.html

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Thu, Dec 27, 2012 at 11:40 PM, Burak Emre Kabakcı
emrekabakci@gmail.comwrote:

I have used parent & child mapping to normalize data but as far as I
understand there is no way to get any fields from _parent document.

Here is the mapping of my index:

{
"mappings": {
"building": {
"properties": {
"name": {
"type": "string"
}
}
},
"flat": {
"_parent": {
"type": "building"
},
"properties": {
"name": {
"type": "string"
}
}
},
"room": {
"_parent": {
"type": "flat"
},
"properties": {
"name": {
"type": "string"
},
"floor": {
"type": "long"
}
}
}
}
}

Now, I'm trying to find the best way of storing flat_name and
building_name in room type. I won't query these fields but I should be able
to get them when I query other fields like floor. There will be millions of
rooms and I don't have much memory so I suspect that these duplicate values
may cause out of memory. I found _source field in Elasticsearch
documentation and it seems good way to store these kind of values. Do you
have any efficient suggestion for avoiding duplicate values like querying
multiple queries or hacky way to get fields from _parent document or
denormalized data is the only way to handle this kindle of problem?

--

--

Hi again :slight_smile:

Actually, you don't need to filter to find the parent, since you have the
parent IDs when you search for children. You just have to add "_parent" to
the list of fields you want to be returned when you search. For example:

curl -XPOST localhost:9200/test/room/_search?pretty=true -d '{
"fields": ["_source", "_parent"],
"query": {
"match_all" :{}
}
}'

May return hits like this:
{
"_index" : "test",
"_type" : "room",
"_id" : "1",
"_score" : 1.0, "_source" : {"name": "room1", "floor": 2},
"fields" : {
"_parent" : "1"
}
}

More details here:
http://www.elasticsearch.org/guide/reference/api/search/fields.html

So once you have the "child" restults, you can use the Multi Get API to get
the needed parents. No need to duplicate the parent ID in another field or
other such hacks.

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Fri, Dec 28, 2012 at 5:05 PM, Radu Gheorghe
radu.gheorghe@sematext.comwrote:

Hello,

Assuming you don't have to update rooms and flats too often, you might be
better off with nested documents.

Back to your original question, you could store the ID of the parent in
the child when you insert it. Then, you could use the GET or Multi Get[0]
API to retrieve the parents once you have the children.

Another possibility, which is just as hackish, is to use the IDs of
children that you search for, and use the has_child filter[1] to get the
parent with that ID. You might want to use an IDs query[2] within that.

[0] http://www.elasticsearch.org/guide/reference/api/multi-get.html
[1]
http://www.elasticsearch.org/guide/reference/query-dsl/has-child-filter.html
[2]
http://www.elasticsearch.org/guide/reference/query-dsl/has-child-filter.html

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Thu, Dec 27, 2012 at 11:40 PM, Burak Emre Kabakcı <
emrekabakci@gmail.com> wrote:

I have used parent & child mapping to normalize data but as far as I
understand there is no way to get any fields from _parent document.

Here is the mapping of my index:

{

"mappings": {
"building": {

  "properties": {
    "name": {

      "type": "string"
    }

  }
},
"flat": {

  "_parent": {
    "type": "building"

  },
  "properties": {

    "name": {
      "type": "string"

    }
  }
},
"room": {

  "_parent": {
    "type": "flat"

  },
  "properties": {

    "name": {
      "type": "string"

    },
    "floor": {

      "type": "long"
    }

  }
}

}
}

Now, I'm trying to find the best way of storing flat_name and
building_name in room type. I won't query these fields but I should be able
to get them when I query other fields like floor. There will be millions of
rooms and I don't have much memory so I suspect that these duplicate values
may cause out of memory. I found _source field in Elasticsearch
documentation and it seems good way to store these kind of values. Do you
have any efficient suggestion for avoiding duplicate values like querying
multiple queries or hacky way to get fields from _parent document or
denormalized data is the only way to handle this kindle of problem?

--

--