Hey Jörg, Thanks a lot for your response. Its as helpful as many of your ES
plugins
The solution of using multi-get, with different type mapping and same ID
for different docs, is interesting. However, it will work only when these
two types of documents have 1-1 mapping between them. I am specifically
looking for many-1 mapping here, so assigning same ID to different typed
docs is not helping me.
Generic merging on arbitrary fields, as u told, is what I am looking for.
In absense of pre-defined graph-layouts/ frames, what options do I have?
Please note that my current solution is needed for a data export utility,
not real-time search. So minor delay is acceptable.
I am thinking of doing a scroll on Index1, filter out required field from
returned batch, and do a terms query on Index2. Thinking of implementing
the logic as a generic plugin. Any idea on how to keep a scroll on Index2?
The search terms will keep changing, is there any way to make a scroll with
match_all query and get data corresponding to search_terms from it?
Also any hint on the fact if parent-child doc concept be helpful in such
scenario? Index1 is primary index, Index2 is secondary index on unique
values of some field of Index1. Index1:Index2 = many:1
Thanks and Regards
Jagdeep
On Monday, December 24, 2012 6:14:50 PM UTC+5:30, Jörg Prante wrote:
With your strategy, your required search time will grow more than
linearly. If you don't have strict upper limits on the number of involved
indexes and fields and join ops, it will not scale.
Currently, there is a static doc ID based merge, called multi get, with a
strict upper limit. If your resources are organized by ID, you can perform
multi get like this:
curl -XDELETE 'localhost:9200/test'
curl -XPUT 'localhost:9200/test/data/1' -d '
{
"sentence" : "Hello World"
}
'
curl -XPUT 'localhost:9200/test/control/1' -d '
{
"done" : true
}
'
curl -XGET 'localhost:9200/test/_refresh'
echo
echo "mget"
curl -XPOST 'localhost:9200/test/_mget?pretty' -d '
{
"docs" : [
{ "_type" : "data", "_id" : "1"},
{ "_type" : "control", "_id" : "1"}
]
}
'
For more generic merging on arbitrary fields, I am working on graph-based
merging.
This graph-based approach should work by indexing graphs with resource IDs
(IRIs) known from the W3C Resource Description Framework.
In JSON-LD, you can define IRI-based IDs and graphs in JSON. In
Elasticsearch, graphs can be represented by document sets. It is up to you
how to organize index/type coordinates for your document sets.
A graph contains resources that are identifiable by IDs. Such IDs could
also serve as Elasticsearch doc IDs and could be indexed in Elasticsearch
beforehand.
By defining so called graph layouts or "frames", JSON-LD allows a
rule-based expansion of graphs in JSON documents. In Elasticsearch, such a
framing algorithm could be performed at index time.
For example, indexing a library / book / chapter hierarchy, you first
index each entity for itself, and then you index a hierarchy document only
containing the IDs as the graph layout. See also
GitHub - ruby-rdf/json-ld: Ruby JSON-LD reader/writer for RDF.rb
If you layout your JSON-LD docs already in the Elasticsearch index, and
not later at search time, the search is still scalable. No additional merge
queries, no additional joins, which would otherwise challenge you very hard
once you need to cope with growing number of searches. Don't think adding
more ES nodes or adding more CPU/mem will help then, because it's the
algorithm which will go out of control.
Best regards,
Jörg
--