Hi All,
Found a nice solution to the problem stated here, hence reviving this lost
thread. Just thought of sharing so that it might help someone.
Problem Statement:
We perform processing on huge sets of documents from an index (primary),
and push the processed documents (with new field, or even existing fields
with new mapping) into a new index (secondary).
We use scroll to retrieve docs, but problem arises when the scroll breaks
due to some network problem or any other issue.
At a certain point of time, finding unprocessed docs, that is, docs
existing in the first primary index but not in the processed secondary
index, via scroll, is the problem we were trying to solve.
ES does not have any direct MINUS / EXCEPT query to find difference of two
indexes.
We did not find it comfortable overwriting the primary index itself with
processing changes, as we often needed mapping or analyzer changes.
Solution:
Instead of creating a secondary index, we create a secondary type here,
with _parent type mapping directed towards the primary type.
Insert necessary mapping changes in the secondary type, start scroll on
primary type, push processed data into secondary type via bulk insert.
In the scroll, we use the following query to get docs wich are in primary
type but not in secondary type -
POST localhost:9200/# index/# primary_type/_search?search_type=scan
{
"filter": {
"not": {
"has_child": {
"type": "# secondary_type",
"query": {
"filtered": {
"query": {
"match_all": {}
}
}
}
}
}
}
}
At any certain point of time, finding difference between two ES data sets
(something like SQL MINUS or EXCEPT clause) is thus possible.
It also helps us in reprocessing some selective documents, if required, by
modifying the has_child filter accordingly.
-- Sujoy.
On Tuesday, October 23, 2012 8:01:19 AM UTC+5:30, Igor Motov wrote:
No, elasticsearch cannot calculate difference between two indices.
Actually, I cannot think of any operation (except union) that elasticsearch
can perform with two or more indices.
You don't have to modify twitter river code. You can simply create a new
index with desired mapping before creating twitter river.
On Monday, October 22, 2012 9:48:41 PM UTC-4, Sujoy Sett wrote:
Thanks Igor, that can be quite useful. In fact, I had thought of
assigning ID to all the docs not in random, but sequential order as in DB
to achieve the same. My first river id the twitter river, and adding
timestamp means editing its code for mapping.
It's really not a problem to modify the river, but just curious, doesn't
elasticsearch has any such logic to find difference in place in itself?
Thanks,
-- Sujoy.
On Tuesday, October 23, 2012 1:40:10 AM UTC+5:30, Igor Motov wrote:
The first thing that comes to mind is to add a timestamp field to the
index A records and retrieved only records that were added/modified after
the last synchronisation time.
On Monday, October 22, 2012 4:02:54 PM UTC-4, Sujoy Sett wrote:
Hi All ....
I have two rivers working simultaneously .... twitter river fetching
data and indexing in index A , and another custom river fetching data from
index A, doing necessary processing and indexing it to index B.
At a certain instant, lets say, index A has x docs, and index B has y
docs, and now I want to fetch those (x-y) docs ..... those which are in
index A and not in index B. The count of docs are in millions.
I have tried using scroll to fetch ids from destination index B and
matching against scroll of source index A, but it is quite time consuming.
Is there any better way to achieve this?
Thanks.
Sujoy.
On Monday, September 17, 2012 5:20:29 PM UTC+5:30, Sujoy Sett wrote:
Hi Clint,
Thanks again.
Too bad ... It seems I have to fall back on good old scroll once again.
Regards,
Sujoy.
On Monday, September 17, 2012 4:46:06 PM UTC+5:30, Clinton Gormley
wrote:
I am assuming that ordering is getting done first on both the
indexes
separately, and then merging done, instead of merging results from
both index first and ordering on combined results.
Correct.
https://github.com/elasticsearch/elasticsearch/issues/1305
The query I stated above is giving expected result upto certain
point
of time after which the ordering goes wrong.
The only way around it currently is to ask for many more terms that
you
actually need... which will also use more RAM
clint
Any help?
Thanks in advance,
Sujoy.
On Monday, September 17, 2012 4:33:59 PM UTC+5:30, Clinton Gormley
wrote:
> I will surely try that. I presume there is no way other
than
> re-indexing to change the mapping of _id field to
{"store":
"yes"} for
> already existing docs? Currently nothing is mapped as
such,
so by
> default it is probably not stored.
Correct. You have to reindex
clint
>
>
> Regards,
> Sujoy.
>
> On Monday, September 17, 2012 2:02:56 PM UTC+5:30,
Clinton
Gormley
> wrote:
> Hi Sujoy
>
> > Sorry if I am asking a too obvious question,
but
is term
> facet
> > possible on the _id field of an index?
>
> It is possible, but not by default. You would
have
to reindex
> your
> indices and map the _id field to { "store": "yes"
}
>
> > I can do this by running a facet query
simultaneously on
> both indices
> > with reverse_count on any unique field
belonging
to both the
> indices,
> > and the responses with count 1 are my result. I
am
currently
> doing
> > this by indexing the _id also as a field in the
_source of
> the
> > documents, but the easier way would be a facet
on
_id.
>
> That's rather a nice approach. One warning
though:
your IDs
> are unique
> values, which mean that you have to load a LOT of
unique terms
> to facet
> on the _id field. You may well run out of memory
in
the
> future, when
> you try the same thing with millions of docs.
>
> clint
>
>
>
>
> --
>
>
--
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.