Merging multi search from 2 elasticsearch indexes

Hi,

I have been using ES for various scenarios and use cases, some of these
even include text mining. In all these scenarios there is one or the other
need for data join condition or merging response requirement.

In current requirement I am processing the data through multiple processing
pipelines. Example
Index 1 has Email id - Processing pipe line 1 is taking email id and
putting data in Index 2. Its structure is some what like this
{
"email":"abc@abc.com",
"isProcessedByPipeLine1":"false"
}

I know, its not advisable to update the index but these processes are
running continuously so i have to mark these records as processed. To do
this I am not updating a single filed i.e. "isProcessedByPipeLine1" but I
am overwriting this whole document with
{
"email":"abc@abc.com",
"isProcessedByPipeLine1":"true"
}

I am keeping different indexes and not just overwriting data in same index
because I have multiple data processing pipelines
running continuously and parallely.
I also want to export the data into csv or any other structured format.
While exporting I want to merge response from multiple indexes based on
email id.

Here is what I am planning to do

  1. Rest interface which will accept index array(all the indexes from which
    result need to be merged) and field upon which join condition to be
    performed
  2. Getting data from 1 index, extract join condition field and then perform
    search on remaining indexes
  3. Merge json response from all searches and deliver it

I don't think its a good\ optimized solution because its not just a simple
search, I am trying to dump whole data from all the indexes.

Any suggestions about overall approach or about data merge will be really
helpful.

Thanks and regards
Jagdeep

--

With your strategy, your required search time will grow more than linearly.
If you don't have strict upper limits on the number of involved indexes and
fields and join ops, it will not scale.

Currently, there is a static doc ID based merge, called multi get, with a
strict upper limit. If your resources are organized by ID, you can perform
multi get like this:

curl -XDELETE 'localhost:9200/test'

curl -XPUT 'localhost:9200/test/data/1' -d '
{
"sentence" : "Hello World"
}
'

curl -XPUT 'localhost:9200/test/control/1' -d '
{
"done" : true
}
'

curl -XGET 'localhost:9200/test/_refresh'

echo
echo "mget"
curl -XPOST 'localhost:9200/test/_mget?pretty' -d '
{
"docs" : [
{ "_type" : "data", "_id" : "1"},
{ "_type" : "control", "_id" : "1"}
]
}
'

For more generic merging on arbitrary fields, I am working on graph-based
merging.

This graph-based approach should work by indexing graphs with resource IDs
(IRIs) known from the W3C Resource Description Framework.

In JSON-LD, you can define IRI-based IDs and graphs in JSON. In
Elasticsearch, graphs can be represented by document sets. It is up to you
how to organize index/type coordinates for your document sets.

A graph contains resources that are identifiable by IDs. Such IDs could
also serve as Elasticsearch doc IDs and could be indexed in Elasticsearch
beforehand.

By defining so called graph layouts or "frames", JSON-LD allows a
rule-based expansion of graphs in JSON documents. In Elasticsearch, such a
framing algorithm could be performed at index time.

For example, indexing a library / book / chapter hierarchy, you first index
each entity for itself, and then you index a hierarchy document only
containing the IDs as the graph layout. See also

If you layout your JSON-LD docs already in the Elasticsearch index, and not
later at search time, the search is still scalable. No additional merge
queries, no additional joins, which would otherwise challenge you very hard
once you need to cope with growing number of searches. Don't think adding
more ES nodes or adding more CPU/mem will help then, because it's the
algorithm which will go out of control.

Best regards,

Jörg

--

Hey Jörg, Thanks a lot for your response. Its as helpful as many of your ES
plugins :slight_smile:

The solution of using multi-get, with different type mapping and same ID
for different docs, is interesting. However, it will work only when these
two types of documents have 1-1 mapping between them. I am specifically
looking for many-1 mapping here, so assigning same ID to different typed
docs is not helping me.

Generic merging on arbitrary fields, as u told, is what I am looking for.
In absense of pre-defined graph-layouts/ frames, what options do I have?
Please note that my current solution is needed for a data export utility,
not real-time search. So minor delay is acceptable.

I am thinking of doing a scroll on Index1, filter out required field from
returned batch, and do a terms query on Index2. Thinking of implementing
the logic as a generic plugin. Any idea on how to keep a scroll on Index2?
The search terms will keep changing, is there any way to make a scroll with
match_all query and get data corresponding to search_terms from it?

Also any hint on the fact if parent-child doc concept be helpful in such
scenario? Index1 is primary index, Index2 is secondary index on unique
values of some field of Index1. Index1:Index2 = many:1

Thanks and Regards
Jagdeep

On Monday, December 24, 2012 6:14:50 PM UTC+5:30, Jörg Prante wrote:

With your strategy, your required search time will grow more than
linearly. If you don't have strict upper limits on the number of involved
indexes and fields and join ops, it will not scale.

Currently, there is a static doc ID based merge, called multi get, with a
strict upper limit. If your resources are organized by ID, you can perform
multi get like this:

curl -XDELETE 'localhost:9200/test'

curl -XPUT 'localhost:9200/test/data/1' -d '
{
"sentence" : "Hello World"
}
'

curl -XPUT 'localhost:9200/test/control/1' -d '
{
"done" : true
}
'

curl -XGET 'localhost:9200/test/_refresh'

echo
echo "mget"
curl -XPOST 'localhost:9200/test/_mget?pretty' -d '
{
"docs" : [
{ "_type" : "data", "_id" : "1"},
{ "_type" : "control", "_id" : "1"}
]
}
'

For more generic merging on arbitrary fields, I am working on graph-based
merging.

This graph-based approach should work by indexing graphs with resource IDs
(IRIs) known from the W3C Resource Description Framework.

In JSON-LD, you can define IRI-based IDs and graphs in JSON. In
Elasticsearch, graphs can be represented by document sets. It is up to you
how to organize index/type coordinates for your document sets.

A graph contains resources that are identifiable by IDs. Such IDs could
also serve as Elasticsearch doc IDs and could be indexed in Elasticsearch
beforehand.

By defining so called graph layouts or "frames", JSON-LD allows a
rule-based expansion of graphs in JSON documents. In Elasticsearch, such a
framing algorithm could be performed at index time.

For example, indexing a library / book / chapter hierarchy, you first
index each entity for itself, and then you index a hierarchy document only
containing the IDs as the graph layout. See also

https://github.com/gkellogg/json-ld/#frame-a-document

If you layout your JSON-LD docs already in the Elasticsearch index, and
not later at search time, the search is still scalable. No additional merge
queries, no additional joins, which would otherwise challenge you very hard
once you need to cope with growing number of searches. Don't think adding
more ES nodes or adding more CPU/mem will help then, because it's the
algorithm which will go out of control.

Best regards,

Jörg

--

Yes, of course you can cascade queries. First one for key lookup, second
one is a multi get query, or just another query. This is application
dependent, no special ES solution. Merging docs is only on client-side.
Except multi get, there is no server-side merge in ES.

You can also scroll. But be aware that scroll approach is costly of you
scan through a result set with only few hits.

There is no parent/child doc merge support on the ES side, and there is no
parent/child "join".

Parent/child is rather limited to model data in a strict hierarchy. I found
that approaching parent/child should include the following characteristics:

  • you have a number of "main documents" and want to assign "sub documents"
    to them later. It's a 1:N relationship, a child can have at maximum one
    parent doc

  • addressing: you want to retrieve the parent doc even when searching in
    children-only fields

  • you may need to run queries against parent and children independently

  • it is convenient to have very few fields for a children document that
    have specific value for this child

  • the parent should carry all children fields which are not child specific
    (for easy doc construction)

  • you mainly want to process children when the parent is involved
    ("top_children" for example), not as standalone documents

If you want a generic parent/child doc solution, such as in a tree, there
is an overhead for additional queries. But, because ES is keeping children
docs in the same shard as the parent, this overhead is kept as small as
possible. The standard ES operations on parent/child hide the complexity of
the queries from the user.

Jörg

--