Faceting on _id field

sujoysett · September 17, 2012, 6:19am

Hi,

Sorry if I am asking a too obvious question, but is term facet possible on
the _id field of an index?

The reason I am trying to facet an already unique field is this:
I want to find the documents that are in one index but not in another.
That is, Docs(index2) is a subset of Docs(index1).
And I want to find Docs(index1) MINUS Docs(index2).

I can do this by running a facet query simultaneously on both indices with
reverse_count on any unique field belonging to both the indices, and the
responses with count 1 are my result. I am currently doing this by indexing
the _id also as a field in the _source of the documents, but the easier way
would be a facet on _id.

Is it possible?

Thanks in advance,
Sujoy.

--

sujoysett · September 17, 2012, 6:24am

Or can there be any simpler approach to my basic objective?
(For identifying documents that are in one index but not in another.)

Thanks,
Sujoy.

On Monday, September 17, 2012 11:49:30 AM UTC+5:30, Sujoy Sett wrote:

Hi,

Sorry if I am asking a too obvious question, but is term facet possible on
the _id field of an index?

The reason I am trying to facet an already unique field is this:
I want to find the documents that are in one index but not in another.
That is, Docs(index2) is a subset of Docs(index1).
And I want to find Docs(index1) MINUS Docs(index2).

I can do this by running a facet query simultaneously on both indices with
reverse_count on any unique field belonging to both the indices, and the
responses with count 1 are my result. I am currently doing this by indexing
the _id also as a field in the _source of the documents, but the easier way
would be a facet on _id.

Is it possible?

Thanks in advance,
Sujoy.

--

Clinton_Gormley · September 17, 2012, 8:32am

Hi Sujoy

Sorry if I am asking a too obvious question, but is term facet
possible on the _id field of an index?

It is possible, but not by default. You would have to reindex your
indices and map the _id field to { "store": "yes" }

I can do this by running a facet query simultaneously on both indices
with reverse_count on any unique field belonging to both the indices,
and the responses with count 1 are my result. I am currently doing
this by indexing the _id also as a field in the _source of the
documents, but the easier way would be a facet on _id.

That's rather a nice approach. One warning though: your IDs are unique
values, which mean that you have to load a LOT of unique terms to facet
on the _id field. You may well run out of memory in the future, when
you try the same thing with millions of docs.

clint

--

sujoysett · September 17, 2012, 10:20am

Thanks a lot Clint.

I will surely try that. I presume there is no way other than re-indexing to
change the mapping of _id field to {"store": "yes"} for already existing
docs? Currently nothing is mapped as such, so by default it is probably not
stored.

Regards,
Sujoy.

On Monday, September 17, 2012 2:02:56 PM UTC+5:30, Clinton Gormley wrote:

Hi Sujoy

Sorry if I am asking a too obvious question, but is term facet
possible on the _id field of an index?

It is possible, but not by default. You would have to reindex your
indices and map the _id field to { "store": "yes" }

I can do this by running a facet query simultaneously on both indices
with reverse_count on any unique field belonging to both the indices,
and the responses with count 1 are my result. I am currently doing
this by indexing the _id also as a field in the _source of the
documents, but the easier way would be a facet on _id.

That's rather a nice approach. One warning though: your IDs are unique
values, which mean that you have to load a LOT of unique terms to facet
on the _id field. You may well run out of memory in the future, when
you try the same thing with millions of docs.

clint

--

Clinton_Gormley · September 17, 2012, 11:03am

I will surely try that. I presume there is no way other than
re-indexing to change the mapping of _id field to {"store": "yes"} for
already existing docs? Currently nothing is mapped as such, so by
default it is probably not stored.

Correct. You have to reindex

clint

Regards,
Sujoy.

On Monday, September 17, 2012 2:02:56 PM UTC+5:30, Clinton Gormley
wrote:
Hi Sujoy

    > Sorry if I am asking a too obvious question, but is term
    facet 
    > possible on the _id field of an index? 
    
    It is possible, but not by default.  You would have to reindex
    your 
    indices and map the _id field to { "store": "yes" } 
    
    > I can do this by running a facet query simultaneously on
    both indices 
    > with reverse_count on any unique field belonging to both the
    indices, 
    > and the responses with count 1 are my result. I am currently
    doing 
    > this by indexing the _id also as a field in the _source of
    the 
    > documents, but the easier way would be a facet on _id. 
    
    That's rather a nice approach.  One warning though: your IDs
    are unique 
    values, which mean that you have to load a LOT of unique terms
    to facet 
    on the _id field.  You may well run out of memory in the
    future, when 
    you try the same thing with millions of docs. 
    
    clint

--

sujoysett · September 17, 2012, 11:12am

Another strange problem based on the above assumption. I am doing terms
facet on a field storing unique values in both the indexes.

I am querying like this :

http://localhost:9200/index1,index2/_search
{
"from": 0,
"size": 0,
"query": {
"match_all": {}
},
"facets": {
"temporaryFacetName": {
"terms": {
"field": "fieldName",
"order": "reverse_count",
"size": 100
}
}
}
}

But this reverse_count ordering is not working correctly, neither is using
count in place (I tried that too just as a wild guess).

I am assuming that ordering is getting done first on both the indexes
separately, and then merging done, instead of merging results from both
index first and ordering on combined results.

The query I stated above is giving expected result upto certain point of
time after which the ordering goes wrong.

Any help?

Thanks in advance,
Sujoy.

On Monday, September 17, 2012 4:33:59 PM UTC+5:30, Clinton Gormley wrote:

I will surely try that. I presume there is no way other than
re-indexing to change the mapping of _id field to {"store": "yes"} for
already existing docs? Currently nothing is mapped as such, so by
default it is probably not stored.

Correct. You have to reindex

clint

Regards,
Sujoy.

On Monday, September 17, 2012 2:02:56 PM UTC+5:30, Clinton Gormley
wrote:
Hi Sujoy

    > Sorry if I am asking a too obvious question, but is term 
    facet 
    > possible on the _id field of an index? 
    
    It is possible, but not by default.  You would have to reindex 
    your 
    indices and map the _id field to { "store": "yes" } 
    
    > I can do this by running a facet query simultaneously on 
    both indices 
    > with reverse_count on any unique field belonging to both the 
    indices, 
    > and the responses with count 1 are my result. I am currently 
    doing 
    > this by indexing the _id also as a field in the _source of 
    the 
    > documents, but the easier way would be a facet on _id. 
    
    That's rather a nice approach.  One warning though: your IDs 
    are unique 
    values, which mean that you have to load a LOT of unique terms 
    to facet 
    on the _id field.  You may well run out of memory in the 
    future, when 
    you try the same thing with millions of docs. 
    
    clint

--

Clinton_Gormley · September 17, 2012, 11:16am

I am assuming that ordering is getting done first on both the indexes
separately, and then merging done, instead of merging results from
both index first and ordering on combined results.

Correct.

github.com/elastic/elasticsearch

terms facet gives wrong count with n_shards > 1

opened 09:32AM - 06 Sep 11 UTC

closed 08:29PM - 14 Jul 15 UTC

jmchambers

>enhancement high hanging fruit

I'm working with nested documents and have noticed that my faceted search interf…ace is giving the wrong counts when I have more than one shard. To be more specific, I'm working with RDF triples (entity > attribute > value) and I'm nesting the attributes (called predicates in my example): ``` { "_id" : "512a2c022f0b4e3daa341e6c8bcf6c2f", "url": "http://dbpedia.org/resource/Alan_Shepard", "predicates": [ { "type": "type", "string_value": ["thing", "person", "astronaut"] }, { "type": "label", "string_value": ["Alan Shepard"] }, { "type": "time in space", "float_value": [216.950] }, ... lots more ] } ``` I've created a shell script (https://gist.github.com/1196986) that recreates the problem with a fresh index. The created data set has these totals: - thing (30) - creative work (20) - video game (10) - tv show (10) - people (10) With only **one shard** the following query gives the correct counts no matter what the size parameter is set to: ``` { "size": 0, "query": { "match_all": {} }, "facets": { "type_counts": { "terms": { "field": "string_value", "size": 5 }, "nested": "predicates", "facet_filter": { "term": { "type": "type" } } } } } ``` However, with **more than one shard** the size parameter affects the accuracy of the counts. If it is equal to or greater than the number of terms returned by the facet query (5 in this case) then it works fine. However, the terms at the bottom of the list start to display low counts as you reduce the size parameter: With "size" : 4 - thing (30) - creative work (20) - video game (10) - **tv show (9)** With "size" : 3 - thing (30) - **creative work (15)** - **video game (9)** With "size" : 2 - thing (30) - **creative work (15)** So it looks like the sub-totals from some of the shards aren't being included for some reason. BTW I'm on ubuntu and the problem seems to affect all versions of ES I've tried (17.0, 17.1 and 17.6). Any ideas...? P.S. absolutely loving ES - it's made my life a lot easier :)

The query I stated above is giving expected result upto certain point
of time after which the ordering goes wrong.

The only way around it currently is to ask for many more terms that you
actually need... which will also use more RAM

clint

Any help?

Thanks in advance,
Sujoy.

On Monday, September 17, 2012 4:33:59 PM UTC+5:30, Clinton Gormley
wrote:

    > I will surely try that. I presume there is no way other
    than 
    > re-indexing to change the mapping of _id field to {"store":
    "yes"} for 
    > already existing docs? Currently nothing is mapped as such,
    so by 
    > default it is probably not stored. 
    
    Correct. You have to reindex 
    
    clint 
    
    > 
    > 
    > Regards, 
    > Sujoy. 
    > 
    > On Monday, September 17, 2012 2:02:56 PM UTC+5:30, Clinton
    Gormley 
    > wrote: 
    >         Hi Sujoy 
    >         
    >         > Sorry if I am asking a too obvious question, but
    is term 
    >         facet 
    >         > possible on the _id field of an index? 
    >         
    >         It is possible, but not by default.  You would have
    to reindex 
    >         your 
    >         indices and map the _id field to { "store": "yes" } 
    >         
    >         > I can do this by running a facet query
    simultaneously on 
    >         both indices 
    >         > with reverse_count on any unique field belonging
    to both the 
    >         indices, 
    >         > and the responses with count 1 are my result. I am
    currently 
    >         doing 
    >         > this by indexing the _id also as a field in the
    _source of 
    >         the 
    >         > documents, but the easier way would be a facet on
    _id. 
    >         
    >         That's rather a nice approach.  One warning though:
    your IDs 
    >         are unique 
    >         values, which mean that you have to load a LOT of
    unique terms 
    >         to facet 
    >         on the _id field.  You may well run out of memory in
    the 
    >         future, when 
    >         you try the same thing with millions of docs. 
    >         
    >         clint 
    >         
    >         
    >         
    > 
    > -- 
    >   
    >

--

sujoysett · September 17, 2012, 11:50am

Hi Clint,

Thanks again.
Too bad ... It seems I have to fall back on good old scroll once again.

Regards,
Sujoy.

On Monday, September 17, 2012 4:46:06 PM UTC+5:30, Clinton Gormley wrote:

I am assuming that ordering is getting done first on both the indexes
separately, and then merging done, instead of merging results from
both index first and ordering on combined results.

Correct.

https://github.com/elasticsearch/elasticsearch/issues/1305

The query I stated above is giving expected result upto certain point
of time after which the ordering goes wrong.

The only way around it currently is to ask for many more terms that you
actually need... which will also use more RAM

clint

Any help?

Thanks in advance,
Sujoy.

On Monday, September 17, 2012 4:33:59 PM UTC+5:30, Clinton Gormley
wrote:

    > I will surely try that. I presume there is no way other 
    than 
    > re-indexing to change the mapping of _id field to {"store": 
    "yes"} for 
    > already existing docs? Currently nothing is mapped as such, 
    so by 
    > default it is probably not stored. 
    
    Correct. You have to reindex 
    
    clint 
    
    > 
    > 
    > Regards, 
    > Sujoy. 
    > 
    > On Monday, September 17, 2012 2:02:56 PM UTC+5:30, Clinton 
    Gormley 
    > wrote: 
    >         Hi Sujoy 
    >         
    >         > Sorry if I am asking a too obvious question, but 
    is term 
    >         facet 
    >         > possible on the _id field of an index? 
    >         
    >         It is possible, but not by default.  You would have 
    to reindex 
    >         your 
    >         indices and map the _id field to { "store": "yes" } 
    >         
    >         > I can do this by running a facet query 
    simultaneously on 
    >         both indices 
    >         > with reverse_count on any unique field belonging 
    to both the 
    >         indices, 
    >         > and the responses with count 1 are my result. I am 
    currently 
    >         doing 
    >         > this by indexing the _id also as a field in the 
    _source of 
    >         the 
    >         > documents, but the easier way would be a facet on 
    _id. 
    >         
    >         That's rather a nice approach.  One warning though: 
    your IDs 
    >         are unique 
    >         values, which mean that you have to load a LOT of 
    unique terms 
    >         to facet 
    >         on the _id field.  You may well run out of memory in 
    the 
    >         future, when 
    >         you try the same thing with millions of docs. 
    >         
    >         clint 
    >         
    >         
    >         
    > 
    > -- 
    >   
    >

--

sujoysett · October 22, 2012, 8:02pm

Hi All ....

I have two rivers working simultaneously .... twitter river fetching data
and indexing in index A , and another custom river fetching data from index
A, doing necessary processing and indexing it to index B.

At a certain instant, lets say, index A has x docs, and index B has y docs,
and now I want to fetch those (x-y) docs ..... those which are in index A
and not in index B. The count of docs are in millions.

I have tried using scroll to fetch ids from destination index B and
matching against scroll of source index A, but it is quite time consuming.
Is there any better way to achieve this?

Thanks.
Sujoy.

On Monday, September 17, 2012 5:20:29 PM UTC+5:30, Sujoy Sett wrote:

Hi Clint,

Thanks again.
Too bad ... It seems I have to fall back on good old scroll once again.

Regards,
Sujoy.

On Monday, September 17, 2012 4:46:06 PM UTC+5:30, Clinton Gormley wrote:

I am assuming that ordering is getting done first on both the indexes
separately, and then merging done, instead of merging results from
both index first and ordering on combined results.

Correct.

https://github.com/elasticsearch/elasticsearch/issues/1305

The query I stated above is giving expected result upto certain point
of time after which the ordering goes wrong.

The only way around it currently is to ask for many more terms that you
actually need... which will also use more RAM

clint

Any help?

Thanks in advance,
Sujoy.

On Monday, September 17, 2012 4:33:59 PM UTC+5:30, Clinton Gormley
wrote:

    > I will surely try that. I presume there is no way other 
    than 
    > re-indexing to change the mapping of _id field to {"store": 
    "yes"} for 
    > already existing docs? Currently nothing is mapped as such, 
    so by 
    > default it is probably not stored. 
    
    Correct. You have to reindex 
    
    clint 
    
    > 
    > 
    > Regards, 
    > Sujoy. 
    > 
    > On Monday, September 17, 2012 2:02:56 PM UTC+5:30, Clinton 
    Gormley 
    > wrote: 
    >         Hi Sujoy 
    >         
    >         > Sorry if I am asking a too obvious question, but 
    is term 
    >         facet 
    >         > possible on the _id field of an index? 
    >         
    >         It is possible, but not by default.  You would have 
    to reindex 
    >         your 
    >         indices and map the _id field to { "store": "yes" } 
    >         
    >         > I can do this by running a facet query 
    simultaneously on 
    >         both indices 
    >         > with reverse_count on any unique field belonging 
    to both the 
    >         indices, 
    >         > and the responses with count 1 are my result. I am 
    currently 
    >         doing 
    >         > this by indexing the _id also as a field in the 
    _source of 
    >         the 
    >         > documents, but the easier way would be a facet on 
    _id. 
    >         
    >         That's rather a nice approach.  One warning though: 
    your IDs 
    >         are unique 
    >         values, which mean that you have to load a LOT of 
    unique terms 
    >         to facet 
    >         on the _id field.  You may well run out of memory in 
    the 
    >         future, when 
    >         you try the same thing with millions of docs. 
    >         
    >         clint 
    >         
    >         
    >         
    > 
    > -- 
    >   
    >

--

Igor_Motov · October 22, 2012, 8:10pm

The first thing that comes to mind is to add a timestamp field to the index
A records and retrieved only records that were added/modified after the
last synchronisation time.

On Monday, October 22, 2012 4:02:54 PM UTC-4, Sujoy Sett wrote:

Hi All ....

I have two rivers working simultaneously .... twitter river fetching data
and indexing in index A , and another custom river fetching data from index
A, doing necessary processing and indexing it to index B.

At a certain instant, lets say, index A has x docs, and index B has y
docs, and now I want to fetch those (x-y) docs ..... those which are in
index A and not in index B. The count of docs are in millions.

I have tried using scroll to fetch ids from destination index B and
matching against scroll of source index A, but it is quite time consuming.
Is there any better way to achieve this?

Thanks.
Sujoy.

On Monday, September 17, 2012 5:20:29 PM UTC+5:30, Sujoy Sett wrote:
Hi Clint,

Thanks again.
Too bad ... It seems I have to fall back on good old scroll once again.

Regards,
Sujoy.

On Monday, September 17, 2012 4:46:06 PM UTC+5:30, Clinton Gormley wrote:
I am assuming that ordering is getting done first on both the indexes
separately, and then merging done, instead of merging results from
both index first and ordering on combined results.

Correct.

https://github.com/elasticsearch/elasticsearch/issues/1305

The query I stated above is giving expected result upto certain point
of time after which the ordering goes wrong.

The only way around it currently is to ask for many more terms that you
actually need... which will also use more RAM

clint
Any help?

Thanks in advance,
Sujoy.

On Monday, September 17, 2012 4:33:59 PM UTC+5:30, Clinton Gormley
wrote:
    > I will surely try that. I presume there is no way other 
    than 
    > re-indexing to change the mapping of _id field to {"store": 
    "yes"} for 
    > already existing docs? Currently nothing is mapped as such, 
    so by 
    > default it is probably not stored. 
    
    Correct. You have to reindex 
    
    clint 
    
    > 
    > 
    > Regards, 
    > Sujoy. 
    > 
    > On Monday, September 17, 2012 2:02:56 PM UTC+5:30, Clinton 
    Gormley 
    > wrote: 
    >         Hi Sujoy 
    >         
    >         > Sorry if I am asking a too obvious question, but 
    is term 
    >         facet 
    >         > possible on the _id field of an index? 
    >         
    >         It is possible, but not by default.  You would have 
    to reindex 
    >         your 
    >         indices and map the _id field to { "store": "yes" } 
    >         
    >         > I can do this by running a facet query 
    simultaneously on 
    >         both indices 
    >         > with reverse_count on any unique field belonging 
    to both the 
    >         indices, 
    >         > and the responses with count 1 are my result. I am 
    currently 
    >         doing 
    >         > this by indexing the _id also as a field in the 
    _source of 
    >         the 
    >         > documents, but the easier way would be a facet on 
    _id. 
    >         
    >         That's rather a nice approach.  One warning though: 
    your IDs 
    >         are unique 
    >         values, which mean that you have to load a LOT of 
    unique terms 
    >         to facet 
    >         on the _id field.  You may well run out of memory in 
    the 
    >         future, when 
    >         you try the same thing with millions of docs. 
    >         
    >         clint 
    >         
    >         
    >         
    > 
    > -- 
    >   
    >   
--

--

sujoysett · October 23, 2012, 1:48am

Thanks Igor, that can be quite useful. In fact, I had thought of assigning
ID to all the docs not in random, but sequential order as in DB to achieve
the same. My first river id the twitter river, and adding timestamp means
editing its code for mapping.
It's really not a problem to modify the river, but just curious, doesn't
elasticsearch has any such logic to find difference in place in itself?

Thanks,
-- Sujoy.

On Tuesday, October 23, 2012 1:40:10 AM UTC+5:30, Igor Motov wrote:

The first thing that comes to mind is to add a timestamp field to the
index A records and retrieved only records that were added/modified after
the last synchronisation time.

On Monday, October 22, 2012 4:02:54 PM UTC-4, Sujoy Sett wrote:
Hi All ....

I have two rivers working simultaneously .... twitter river fetching data
and indexing in index A , and another custom river fetching data from index
A, doing necessary processing and indexing it to index B.

At a certain instant, lets say, index A has x docs, and index B has y
docs, and now I want to fetch those (x-y) docs ..... those which are in
index A and not in index B. The count of docs are in millions.

I have tried using scroll to fetch ids from destination index B and
matching against scroll of source index A, but it is quite time consuming.
Is there any better way to achieve this?

Thanks.
Sujoy.

On Monday, September 17, 2012 5:20:29 PM UTC+5:30, Sujoy Sett wrote:
Hi Clint,

Thanks again.
Too bad ... It seems I have to fall back on good old scroll once again.

Regards,
Sujoy.

On Monday, September 17, 2012 4:46:06 PM UTC+5:30, Clinton Gormley wrote:
I am assuming that ordering is getting done first on both the indexes
separately, and then merging done, instead of merging results from
both index first and ordering on combined results.

Correct.

https://github.com/elasticsearch/elasticsearch/issues/1305

The query I stated above is giving expected result upto certain point
of time after which the ordering goes wrong.

The only way around it currently is to ask for many more terms that you
actually need... which will also use more RAM

clint
Any help?

Thanks in advance,
Sujoy.

On Monday, September 17, 2012 4:33:59 PM UTC+5:30, Clinton Gormley
wrote:
    > I will surely try that. I presume there is no way other 
    than 
    > re-indexing to change the mapping of _id field to {"store": 
    "yes"} for 
    > already existing docs? Currently nothing is mapped as such, 
    so by 
    > default it is probably not stored. 
    
    Correct. You have to reindex 
    
    clint 
    
    > 
    > 
    > Regards, 
    > Sujoy. 
    > 
    > On Monday, September 17, 2012 2:02:56 PM UTC+5:30, Clinton 
    Gormley 
    > wrote: 
    >         Hi Sujoy 
    >         
    >         > Sorry if I am asking a too obvious question, but 
    is term 
    >         facet 
    >         > possible on the _id field of an index? 
    >         
    >         It is possible, but not by default.  You would have 
    to reindex 
    >         your 
    >         indices and map the _id field to { "store": "yes" } 
    >         
    >         > I can do this by running a facet query 
    simultaneously on 
    >         both indices 
    >         > with reverse_count on any unique field belonging 
    to both the 
    >         indices, 
    >         > and the responses with count 1 are my result. I 
am
    currently 
    >         doing 
    >         > this by indexing the _id also as a field in the 
    _source of 
    >         the 
    >         > documents, but the easier way would be a facet on 
    _id. 
    >         
    >         That's rather a nice approach.  One warning though: 
    your IDs 
    >         are unique 
    >         values, which mean that you have to load a LOT of 
    unique terms 
    >         to facet 
    >         on the _id field.  You may well run out of memory 
in
    the 
    >         future, when 
    >         you try the same thing with millions of docs. 
    >         
    >         clint 
    >         
    >         
    >         
    > 
    > -- 
    >   
    >   
--

--

Igor_Motov · October 23, 2012, 2:31am

No, elasticsearch cannot calculate difference between two indices.
Actually, I cannot think of any operation (except union) that elasticsearch
can perform with two or more indices.

You don't have to modify twitter river code. You can simply create a new
index with desired mapping before creating twitter river.

On Monday, October 22, 2012 9:48:41 PM UTC-4, Sujoy Sett wrote:

Thanks Igor, that can be quite useful. In fact, I had thought of assigning
ID to all the docs not in random, but sequential order as in DB to achieve
the same. My first river id the twitter river, and adding timestamp means
editing its code for mapping.
It's really not a problem to modify the river, but just curious, doesn't
elasticsearch has any such logic to find difference in place in itself?

Thanks,
-- Sujoy.

On Tuesday, October 23, 2012 1:40:10 AM UTC+5:30, Igor Motov wrote:
The first thing that comes to mind is to add a timestamp field to the
index A records and retrieved only records that were added/modified after
the last synchronisation time.

On Monday, October 22, 2012 4:02:54 PM UTC-4, Sujoy Sett wrote:
Hi All ....

I have two rivers working simultaneously .... twitter river fetching
data and indexing in index A , and another custom river fetching data from
index A, doing necessary processing and indexing it to index B.

At a certain instant, lets say, index A has x docs, and index B has y
docs, and now I want to fetch those (x-y) docs ..... those which are in
index A and not in index B. The count of docs are in millions.

I have tried using scroll to fetch ids from destination index B and
matching against scroll of source index A, but it is quite time consuming.
Is there any better way to achieve this?

Thanks.
Sujoy.

On Monday, September 17, 2012 5:20:29 PM UTC+5:30, Sujoy Sett wrote:
Hi Clint,

Thanks again.
Too bad ... It seems I have to fall back on good old scroll once again.

Regards,
Sujoy.

On Monday, September 17, 2012 4:46:06 PM UTC+5:30, Clinton Gormley
wrote:
I am assuming that ordering is getting done first on both the
indexes
separately, and then merging done, instead of merging results from
both index first and ordering on combined results.

Correct.

https://github.com/elasticsearch/elasticsearch/issues/1305

The query I stated above is giving expected result upto certain
point
of time after which the ordering goes wrong.

The only way around it currently is to ask for many more terms that
you
actually need... which will also use more RAM

clint
Any help?

Thanks in advance,
Sujoy.

On Monday, September 17, 2012 4:33:59 PM UTC+5:30, Clinton Gormley
wrote:
    > I will surely try that. I presume there is no way other 
    than 
    > re-indexing to change the mapping of _id field to 
{"store":
    "yes"} for 
    > already existing docs? Currently nothing is mapped as 
such,
    so by 
    > default it is probably not stored. 
    
    Correct. You have to reindex 
    
    clint 
    
    > 
    > 
    > Regards, 
    > Sujoy. 
    > 
    > On Monday, September 17, 2012 2:02:56 PM UTC+5:30, Clinton 
    Gormley 
    > wrote: 
    >         Hi Sujoy 
    >         
    >         > Sorry if I am asking a too obvious question, but 
    is term 
    >         facet 
    >         > possible on the _id field of an index? 
    >         
    >         It is possible, but not by default.  You would 
have
    to reindex 
    >         your 
    >         indices and map the _id field to { "store": "yes" 
}
    >         
    >         > I can do this by running a facet query 
    simultaneously on 
    >         both indices 
    >         > with reverse_count on any unique field belonging 
    to both the 
    >         indices, 
    >         > and the responses with count 1 are my result. I 
am
    currently 
    >         doing 
    >         > this by indexing the _id also as a field in the 
    _source of 
    >         the 
    >         > documents, but the easier way would be a facet 
on
    _id. 
    >         
    >         That's rather a nice approach.  One warning 
though:
    your IDs 
    >         are unique 
    >         values, which mean that you have to load a LOT of 
    unique terms 
    >         to facet 
    >         on the _id field.  You may well run out of memory 
in
    the 
    >         future, when 
    >         you try the same thing with millions of docs. 
    >         
    >         clint 
    >         
    >         
    >         
    > 
    > -- 
    >   
    >   
--

--

sujoysett · May 15, 2013, 10:54am

Hi All,

Found a nice solution to the problem stated here, hence reviving this lost
thread. Just thought of sharing so that it might help someone.

Problem Statement:

We perform processing on huge sets of documents from an index (primary),
and push the processed documents (with new field, or even existing fields
with new mapping) into a new index (secondary).
We use scroll to retrieve docs, but problem arises when the scroll breaks
due to some network problem or any other issue.
At a certain point of time, finding unprocessed docs, that is, docs
existing in the first primary index but not in the processed secondary
index, via scroll, is the problem we were trying to solve.
ES does not have any direct MINUS / EXCEPT query to find difference of two
indexes.
We did not find it comfortable overwriting the primary index itself with
processing changes, as we often needed mapping or analyzer changes.

Solution:

Instead of creating a secondary index, we create a secondary type here,
with _parent type mapping directed towards the primary type.
Insert necessary mapping changes in the secondary type, start scroll on
primary type, push processed data into secondary type via bulk insert.
In the scroll, we use the following query to get docs wich are in primary
type but not in secondary type -
POST localhost:9200/# index/# primary_type/_search?search_type=scan
{
"filter": {
"not": {
"has_child": {
"type": "# secondary_type",
"query": {
"filtered": {
"query": {
"match_all": {}
}
}
}
}
}
}
}
At any certain point of time, finding difference between two ES data sets
(something like SQL MINUS or EXCEPT clause) is thus possible.
It also helps us in reprocessing some selective documents, if required, by
modifying the has_child filter accordingly.

-- Sujoy.

On Tuesday, October 23, 2012 8:01:19 AM UTC+5:30, Igor Motov wrote:

No, elasticsearch cannot calculate difference between two indices.
Actually, I cannot think of any operation (except union) that elasticsearch
can perform with two or more indices.

You don't have to modify twitter river code. You can simply create a new
index with desired mapping before creating twitter river.

On Monday, October 22, 2012 9:48:41 PM UTC-4, Sujoy Sett wrote:
Thanks Igor, that can be quite useful. In fact, I had thought of
assigning ID to all the docs not in random, but sequential order as in DB
to achieve the same. My first river id the twitter river, and adding
timestamp means editing its code for mapping.
It's really not a problem to modify the river, but just curious, doesn't
elasticsearch has any such logic to find difference in place in itself?

Thanks,
-- Sujoy.

On Tuesday, October 23, 2012 1:40:10 AM UTC+5:30, Igor Motov wrote:
The first thing that comes to mind is to add a timestamp field to the
index A records and retrieved only records that were added/modified after
the last synchronisation time.

On Monday, October 22, 2012 4:02:54 PM UTC-4, Sujoy Sett wrote:
Hi All ....

I have two rivers working simultaneously .... twitter river fetching
data and indexing in index A , and another custom river fetching data from
index A, doing necessary processing and indexing it to index B.

At a certain instant, lets say, index A has x docs, and index B has y
docs, and now I want to fetch those (x-y) docs ..... those which are in
index A and not in index B. The count of docs are in millions.

I have tried using scroll to fetch ids from destination index B and
matching against scroll of source index A, but it is quite time consuming.
Is there any better way to achieve this?

Thanks.
Sujoy.

On Monday, September 17, 2012 5:20:29 PM UTC+5:30, Sujoy Sett wrote:
Hi Clint,

Thanks again.
Too bad ... It seems I have to fall back on good old scroll once again.

Regards,
Sujoy.

On Monday, September 17, 2012 4:46:06 PM UTC+5:30, Clinton Gormley
wrote:
I am assuming that ordering is getting done first on both the
indexes
separately, and then merging done, instead of merging results from
both index first and ordering on combined results.

Correct.

https://github.com/elasticsearch/elasticsearch/issues/1305

The query I stated above is giving expected result upto certain
point
of time after which the ordering goes wrong.

The only way around it currently is to ask for many more terms that
you
actually need... which will also use more RAM

clint
Any help?

Thanks in advance,
Sujoy.

On Monday, September 17, 2012 4:33:59 PM UTC+5:30, Clinton Gormley
wrote:
    > I will surely try that. I presume there is no way other 
    than 
    > re-indexing to change the mapping of _id field to 
{"store":
    "yes"} for 
    > already existing docs? Currently nothing is mapped as 
such,
    so by 
    > default it is probably not stored. 
    
    Correct. You have to reindex 
    
    clint 
    
    > 
    > 
    > Regards, 
    > Sujoy. 
    > 
    > On Monday, September 17, 2012 2:02:56 PM UTC+5:30, 
Clinton
    Gormley 
    > wrote: 
    >         Hi Sujoy 
    >         
    >         > Sorry if I am asking a too obvious question, 
but
    is term 
    >         facet 
    >         > possible on the _id field of an index? 
    >         
    >         It is possible, but not by default.  You would 
have
    to reindex 
    >         your 
    >         indices and map the _id field to { "store": "yes" 
}
    >         
    >         > I can do this by running a facet query 
    simultaneously on 
    >         both indices 
    >         > with reverse_count on any unique field 
belonging
    to both the 
    >         indices, 
    >         > and the responses with count 1 are my result. I 
am
    currently 
    >         doing 
    >         > this by indexing the _id also as a field in the 
    _source of 
    >         the 
    >         > documents, but the easier way would be a facet 
on
    _id. 
    >         
    >         That's rather a nice approach.  One warning 
though:
    your IDs 
    >         are unique 
    >         values, which mean that you have to load a LOT of 
    unique terms 
    >         to facet 
    >         on the _id field.  You may well run out of memory 
in
    the 
    >         future, when 
    >         you try the same thing with millions of docs. 
    >         
    >         clint 
    >         
    >         
    >         
    > 
    > -- 
    >   
    >   
--

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jagdeep · May 15, 2013, 11:03am

Its awesome Sujoy. We have been struggling with this for long and now it
will empower our thought for ES based processing engine.
Cheers to Elasticsearch!!!

Regards
Jagdeep

On Wednesday, May 15, 2013 4:24:45 PM UTC+5:30, Sujoy Sett wrote:

Hi All,

Found a nice solution to the problem stated here, hence reviving this lost
thread. Just thought of sharing so that it might help someone.

Problem Statement:

We perform processing on huge sets of documents from an index (primary),
and push the processed documents (with new field, or even existing fields
with new mapping) into a new index (secondary).
We use scroll to retrieve docs, but problem arises when the scroll breaks
due to some network problem or any other issue.
At a certain point of time, finding unprocessed docs, that is, docs
existing in the first primary index but not in the processed secondary
index, via scroll, is the problem we were trying to solve.
ES does not have any direct MINUS / EXCEPT query to find difference of two
indexes.
We did not find it comfortable overwriting the primary index itself with
processing changes, as we often needed mapping or analyzer changes.

Solution:

Instead of creating a secondary index, we create a secondary type here,
with _parent type mapping directed towards the primary type.
Insert necessary mapping changes in the secondary type, start scroll on
primary type, push processed data into secondary type via bulk insert.
In the scroll, we use the following query to get docs wich are in primary
type but not in secondary type -
POST localhost:9200/# index/# primary_type/_search?search_type=scan
{
"filter": {
"not": {
"has_child": {
"type": "# secondary_type",
"query": {
"filtered": {
"query": {
"match_all": {}
}
}
}
}
}
}
}
At any certain point of time, finding difference between two ES data sets
(something like SQL MINUS or EXCEPT clause) is thus possible.
It also helps us in reprocessing some selective documents, if required, by
modifying the has_child filter accordingly.

-- Sujoy.

On Tuesday, October 23, 2012 8:01:19 AM UTC+5:30, Igor Motov wrote:
No, elasticsearch cannot calculate difference between two indices.
Actually, I cannot think of any operation (except union) that elasticsearch
can perform with two or more indices.

You don't have to modify twitter river code. You can simply create a new
index with desired mapping before creating twitter river.

On Monday, October 22, 2012 9:48:41 PM UTC-4, Sujoy Sett wrote:
Thanks Igor, that can be quite useful. In fact, I had thought of
assigning ID to all the docs not in random, but sequential order as in DB
to achieve the same. My first river id the twitter river, and adding
timestamp means editing its code for mapping.
It's really not a problem to modify the river, but just curious, doesn't
elasticsearch has any such logic to find difference in place in itself?

Thanks,
-- Sujoy.

On Tuesday, October 23, 2012 1:40:10 AM UTC+5:30, Igor Motov wrote:
The first thing that comes to mind is to add a timestamp field to the
index A records and retrieved only records that were added/modified after
the last synchronisation time.

On Monday, October 22, 2012 4:02:54 PM UTC-4, Sujoy Sett wrote:
Hi All ....

I have two rivers working simultaneously .... twitter river fetching
data and indexing in index A , and another custom river fetching data from
index A, doing necessary processing and indexing it to index B.

At a certain instant, lets say, index A has x docs, and index B has y
docs, and now I want to fetch those (x-y) docs ..... those which are in
index A and not in index B. The count of docs are in millions.

I have tried using scroll to fetch ids from destination index B and
matching against scroll of source index A, but it is quite time consuming.
Is there any better way to achieve this?

Thanks.
Sujoy.

On Monday, September 17, 2012 5:20:29 PM UTC+5:30, Sujoy Sett wrote:
Hi Clint,

Thanks again.
Too bad ... It seems I have to fall back on good old scroll once
again.

Regards,
Sujoy.

On Monday, September 17, 2012 4:46:06 PM UTC+5:30, Clinton Gormley
wrote:
I am assuming that ordering is getting done first on both the
indexes
separately, and then merging done, instead of merging results from
both index first and ordering on combined results.

Correct.

https://github.com/elasticsearch/elasticsearch/issues/1305

The query I stated above is giving expected result upto certain
point
of time after which the ordering goes wrong.

The only way around it currently is to ask for many more terms that
you
actually need... which will also use more RAM

clint
Any help?

Thanks in advance,
Sujoy.

On Monday, September 17, 2012 4:33:59 PM UTC+5:30, Clinton Gormley
wrote:
    > I will surely try that. I presume there is no way other 
    than 
    > re-indexing to change the mapping of _id field to 
{"store":
    "yes"} for 
    > already existing docs? Currently nothing is mapped as 
such,
    so by 
    > default it is probably not stored. 
    
    Correct. You have to reindex 
    
    clint 
    
    > 
    > 
    > Regards, 
    > Sujoy. 
    > 
    > On Monday, September 17, 2012 2:02:56 PM UTC+5:30, 
Clinton
    Gormley 
    > wrote: 
    >         Hi Sujoy 
    >         
    >         > Sorry if I am asking a too obvious question, 
but
    is term 
    >         facet 
    >         > possible on the _id field of an index? 
    >         
    >         It is possible, but not by default.  You would 
have
    to reindex 
    >         your 
    >         indices and map the _id field to { "store": 
"yes" }
    >         
    >         > I can do this by running a facet query 
    simultaneously on 
    >         both indices 
    >         > with reverse_count on any unique field 
belonging
    to both the 
    >         indices, 
    >         > and the responses with count 1 are my result. 
I am
    currently 
    >         doing 
    >         > this by indexing the _id also as a field in 
the
    _source of 
    >         the 
    >         > documents, but the easier way would be a facet 
on
    _id. 
    >         
    >         That's rather a nice approach.  One warning 
though:
    your IDs 
    >         are unique 
    >         values, which mean that you have to load a LOT 
of
    unique terms 
    >         to facet 
    >         on the _id field.  You may well run out of 
memory in
    the 
    >         future, when 
    >         you try the same thing with millions of docs. 
    >         
    >         clint 
    >         
    >         
    >         
    > 
    > -- 
    >   
    >   
--

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.