I'm looking at keeping two documents in the backing database (CouchDB) for
some items - one would be small, storing the main information, and the
other large[er] (text content). The reason being to allow revisions to be
stored for the small document without duplicating the large document each
time. I'm using the CouchDB _changes based river.
I'd like to be able to do a query and included nested ids to also match on,
e.g. something like: name:Bill nested: { ids: [42,53,35], term: "books" }
The small document would know which large documents it referred to at the
time of indexing (but not the other way round, probably).
This seems to map quite closely to the nested document concept
(http://www.elasticsearch.org/guide/reference/mapping/nested-type.html) as
the nested content is stored as a separate document. However that
implementation does something which ensures the nested document is indexed
in the same "block" which would cause trouble here as the large documents could
be stored anywhere.
I imagine there is no way to do this at the moment but perhaps either :
a) it's an interesting idea that could be implemented by someone at some
point
b) there is another way to achieve a similar result
c) it's a stupid idea, go away
If you can index the "small" document as parent, and large document as
child, then you can use the parent child support and has_child query /
filter. Will that help?
I'm looking at keeping two documents in the backing database (CouchDB) for
some items - one would be small, storing the main information, and the
other large[er] (text content). The reason being to allow revisions to be
stored for the small document without duplicating the large document each
time. I'm using the CouchDB _changes based river.
I'd like to be able to do a query and included nested ids to also match
on, e.g. something like: name:Bill nested: { ids: [42,53,35], term: "books"
}
The small document would know which large documents it referred to at the
time of indexing (but not the other way round, probably).
This seems to map quite closely to the nested document concept ( Elasticsearch Platform — Find real-time answers at scale | Elastic) as
the nested content is stored as a separate document. However that
implementation does something which ensures the nested document is indexed
in the same "block" which would cause trouble here as the large documents could
be stored anywhere.
I imagine there is no way to do this at the moment but perhaps either :
a) it's an interesting idea that could be implemented by someone at some
point
b) there is another way to achieve a similar result
c) it's a stupid idea, go away
Ah I missed the parent/child feature - I read http://www.elasticsearch.org/guide/reference/mapping/parent-field.html but
it sounded like a way to relate a mapping inside another mapping (rather
than a document inside another document). Unfortunately though it's the
wrong way round - when indexing the child document I wouldn't know the
parent it related to. it's just a blob of content and multiple parent
documents could refer to it after the child document has been indexed (like
revisions of the parent document).
I'm not sure it's going to be solvable though - the routing would seem
tricky as the parent/child relationship is actually many-to-many (revisions
of a parent all point to the same child, and each parent could have
multiple children). Perhaps if it was limited to ONE child per parent (like
in China) the routing could be based on the child id alone, however this
doesn't seem a good policy.
Another approach would to be able to insert the child content into the
parent in the river process - it would have to go and fetch the child
content and pop it into a field in the parent. This would work fine for me
as I'm more concerned about not storing multiple copies of unchanging child
content in the backing database rather then re-indexing it when the parent
changes.
Does the lang-javascript plugin allow me to do things like make an HTTP
request and pull in more content? (using the "script" field in the couchdb
river).
Another approach that might work for me is if I could change the "id"
inside the river script - it looks l like it would just involve getting the
"id" from the ctx variable after the script has run in this bit of code
inside the couchdb river: https://gist.github.com/1510453.
I made a version of the couchdb river which lets you update the id inside
the script. However, ultimately I've concluded using the river causes more
difficulties than just indexing directly (sending documents to es and
couchdb at the same time) so I have abandoned it.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.