In case of Massive number of Filter items, which is faster? Filter by _id or Filter by a long field..?


(Pradeep-2) #1

In my query I have to pass many values to filter on.. like

"ids" : {
"values" : [1, 4, 100 ....... (upto 300k such values) ]
}

as _id is of type String.. I am wondering if ,I store the value of _id as a
long filed and filter on it, will it be faster?

please help...

Thanks a lot..

an elaborate explanation is here
https://groups.google.com/forum/?fromgroups=#!searchin/elasticsearch/fastest$20way$20to/elasticsearch/zgO_qK6kbxE/C26ltlWt7WsJ

--


(Clinton Gormley) #2

On Mon, 2012-09-10 at 15:06 -0700, Pradeep wrote:

In my query I have to pass many values to filter on.. like

"ids" : {
"values" : [1, 4, 100 ....... (upto 300k such values) ]
}

as _id is of type String.. I am wondering if ,I store the value of _id
as a long filed and filter on it, will it be faster?

Very difficult to answer - your use case is not clearly described I'm
afraid.

clint

please help...

Thanks a lot..

an elaborate explanation is here
https://groups.google.com/forum/?fromgroups=#!
searchin/elasticsearch/fastest$20way
$20to/elasticsearch/zgO_qK6kbxE/C26ltlWt7WsJ

--

--


(Pradeep-2) #3

Hi Clinton,

I have a use-case where I have to query elastic with multiple (huge number
of) values of a field. ( I get these values from another system, so no
other way but query with all these values)

This field also happens to be the _id of the documents I have indexed.

So in my query, I have a filter like below...

"filter": { "ids" :{
"values" : [1, 3, 4, 6, 7,... (upto 300,000 such
values) ]
}
}

( ...eventually I do a facet with term_stats ... on the docs matching this
filter)

I noticed for 300,000 such values, the query was taking ( on my single
node) .. 3-4 secs to return..

So I am looking for ways to reduce this time.

One thought is, as _id is 'string' type it is probably slower than,
filtering on a long type ?.. am not sure. Actually that's my question.. is
it faster ?

If it is faster, I can duplicate the value of _id as a filed of 'long' type
in the document and have a "terms" filter

Hope it's clear now... Thanks a lot.
....

( Btw.. I have tested solr with this use case, and it was many times slower
than elastic search for such massive list of query terms !! )

On Monday, September 10, 2012 6:06:06 PM UTC-4, Pradeep wrote:

In my query I have to pass many values to filter on.. like

"ids" : {
"values" : [1, 4, 100 ....... (upto 300k such values) ]
}

as _id is of type String.. I am wondering if ,I store the value of _id as
a long filed and filter on it, will it be faster?

please help...

Thanks a lot..

an elaborate explanation is here

https://groups.google.com/forum/?fromgroups=#!searchin/elasticsearch/fastest$20way$20to/elasticsearch/zgO_qK6kbxE/C26ltlWt7WsJ

--


(Clinton Gormley) #4

Hi Pradeep

I have a use-case where I have to query elastic with multiple (huge
number of) values of a field. ( I get these values from another
system, so no other way but query with all these values)

This field also happens to be the _id of the documents I have indexed.

OK - I understand now. Unfortunately, I don't know what the answer
is :slight_smile: I'd suggest just trying it out and seeing what happens.

I think that the ids filter is using an internal UIDs cache, and is
probably going to be faster than reindexing the values. But that's just
a guess.

try it and see. would be interested in hearing the results.

clint

So in my query, I have a filter like below...

"filter": { "ids" :{
"values" : [1, 3, 4, 6, 7,... (upto 300,000
such values) ]
}
}

( ...eventually I do a facet with term_stats ... on the docs matching
this filter)

I noticed for 300,000 such values, the query was taking ( on my single
node) .. 3-4 secs to return..

So I am looking for ways to reduce this time.

One thought is, as _id is 'string' type it is probably slower than,
filtering on a long type ?.. am not sure. Actually that's my
question.. is it faster ?

If it is faster, I can duplicate the value of _id as a filed of
'long' type in the document and have a "terms" filter

Hope it's clear now... Thanks a lot.
....

( Btw.. I have tested solr with this use case, and it was many times
slower than elastic search for such massive list of query terms !! )

On Monday, September 10, 2012 6:06:06 PM UTC-4, Pradeep wrote:
In my query I have to pass many values to filter on.. like

    "ids" : {
              "values" : [1, 4, 100 ....... (upto 300k such
    values) ]
             } 
    
    
    as _id is of type String.. I am wondering if ,I store the
    value of _id as a long filed and filter on it, will it be
    faster?
    
    
    
    
    please help...
    
    
    Thanks a lot..
    
    
    
    
    an elaborate explanation is here 
    https://groups.google.com/forum/?fromgroups=#!
    searchin/elasticsearch/fastest$20way
    $20to/elasticsearch/zgO_qK6kbxE/C26ltlWt7WsJ

--

--


(phill) #5

On Monday, September 10, 2012 6:06:06 PM UTC-4, Pradeep wrote:

In my query I have to pass many values to filter on.. like

"ids" : {
          "values" : [1, 4, 100 ....... (upto 300k such values) ]
         }

Is it possible you could do some pre-processing to work out your large
sets and already have that information in the index?
If you can pre-process and generate data before the time of the users
query do it! I'd bet that finding fewer terms in an (untokenized)
additional field would be even faster than sending Ks of IDs.

-Paul

--


(Pradeep-2) #6

Thanks Clinton. I'll try that out and post.

Hi Paul, I am afraid, that's not possible, this list of values is a
result of another search, which is served by a different search engine..
(this data cant be put into that engine. nor the other way around) .. so
only way is to pass so many values...

On Tuesday, September 11, 2012 12:03:43 PM UTC-4, P Hill wrote:

On Monday, September 10, 2012 6:06:06 PM UTC-4, Pradeep wrote:

In my query I have to pass many values to filter on.. like 

"ids" : { 
          "values" : [1, 4, 100 ....... (upto 300k such values) ] 
         } 

Is it possible you could do some pre-processing to work out your large
sets and already have that information in the index?
If you can pre-process and generate data before the time of the users
query do it! I'd bet that finding fewer terms in an (untokenized)
additional field would be even faster than sending Ks of IDs.

-Paul

--


(system) #7