Extracting all values for a term


(plaflamme) #1

Hi,

I'm brand new in the elasticsearch world and I have to say I'm quite
impressed with its quality. Kudos!

One of my use cases is to extract all values (and document ID) of a
particular term sorted by document ID:

id,HEIGHT
1234,165.5
4321,170
...

I'm using this query:

{
"query": {
"match_all":{}
},
"fields" : ["HEIGHT"],
"sort": ["_id"]
}

This works fine, but I'm wondering if there's a way to improve this. I've
indexed 63K documents that have ~600 terms each; the index size is 1.2G
(single node for now). It takes ~10s to read all values:

$ curl -XPOST http://localhost:9200/_search -d '' >
out.json

% Total % Received % Xferd Average Speed Time Time Time
Current
Dload Upload Total Spent Left
Speed
100 9185k 100 9185k 0 106 960k 11 0:00:09 0:00:09 --:--:--
2017k

(the value for "took" was 9162)

What can I do to improve performance here? Would increasing replicas, shards
and or nodes help in this case? Is there a more efficient way to extract all
values of a term? Also, the result file is 10Mb, is there a way to get a
more compact representation ? Getting rid of "index", "type", "score" and
"sort" from the result would help a lot here.

Thanks,
Philippe


(Clinton Gormley) #2

Hi Philippe

One of my use cases is to extract all values (and document ID) of a
particular term sorted by document ID:

{
"query": {
"match_all":{}
},
"fields" : ["HEIGHT"],
"sort": ["_id"]
}

This works fine, but I'm wondering if there's a way to improve this.
I've indexed 63K documents that have ~600 terms each; the index size
is 1.2G (single node for now). It takes ~10s to read all values:

By default, ES stores the JSON that you index as the _source field. All
other fields are (by default) not stored separately, but you can change
that when you create the mapping, by setting {..., "store": "yes"}

If you request a particular field then either:
(a) that field is stored, and is returned to you directly, or
(b) it decodes your JSON, extracts that field and returns it

If your JSON doc is big, this can have quite a performance impact.

So I'd try setting your HEIGHT field to {"store": "yes"}

You can read more about mapping here:

http://www.elasticsearch.org/guide/reference/api/admin-indices-put-mapping.html
http://www.elasticsearch.org/guide/reference/mapping/core-types.html

clint


(plaflamme) #3

Hi Clinton,

Thanks for the quick reply.

So you're saying that most of the time is spent in extracting the field's
value from the source.

Is there a way to get some metrics on that? For example, having a
"query_time" and "fetch_time" in the result?

I've played with "query_and_fetch" vs. "query_then_fetch", but it's hard to
determine their impact without these kind of metrics... Are they available?

Thanks,
Philippe

On Tue, Apr 19, 2011 at 10:30, Clinton Gormley clinton@iannounce.co.ukwrote:

Hi Philippe

One of my use cases is to extract all values (and document ID) of a
particular term sorted by document ID:

{
"query": {
"match_all":{}
},
"fields" : ["HEIGHT"],
"sort": ["_id"]
}

This works fine, but I'm wondering if there's a way to improve this.
I've indexed 63K documents that have ~600 terms each; the index size
is 1.2G (single node for now). It takes ~10s to read all values:

By default, ES stores the JSON that you index as the _source field. All
other fields are (by default) not stored separately, but you can change
that when you create the mapping, by setting {..., "store": "yes"}

If you request a particular field then either:
(a) that field is stored, and is returned to you directly, or
(b) it decodes your JSON, extracts that field and returns it

If your JSON doc is big, this can have quite a performance impact.

So I'd try setting your HEIGHT field to {"store": "yes"}

You can read more about mapping here:

http://www.elasticsearch.org/guide/reference/api/admin-indices-put-mapping.html
http://www.elasticsearch.org/guide/reference/mapping/core-types.html

clint


(Clinton Gormley) #4

Hi Philippe

So you're saying that most of the time is spent in extracting the
field's value from the source.

That would be my guess, yes.

I've played with "query_and_fetch" vs. "query_then_fetch", but it's
hard to determine their impact without these kind of metrics... Are
they available?

These are not quite the same thing as what you are talking about. These
have more to do with how the results are selected, not the fields that
are returned.

I'd suggest just trying to set the HEIGHT field to stored, and see what
difference there is in performance.

clint


(plaflamme) #5

Right. So with store=true on my fields:

  • first request takes approx same time: ~10s
  • subsequent requests take ~1.5s

With store=false, all requests take ~10s.

Will clustering help also? If I increase the number of replicas and nodes,
will this have an impact on such such a query?

Philippe

On Tue, Apr 19, 2011 at 10:56, Clinton Gormley clinton@iannounce.co.ukwrote:

Hi Philippe

So you're saying that most of the time is spent in extracting the
field's value from the source.

That would be my guess, yes.

I've played with "query_and_fetch" vs. "query_then_fetch", but it's
hard to determine their impact without these kind of metrics... Are
they available?

These are not quite the same thing as what you are talking about. These
have more to do with how the results are selected, not the fields that
are returned.

I'd suggest just trying to set the HEIGHT field to stored, and see what
difference there is in performance.

clint


(Clinton Gormley) #6

Hi Philippe

On Tue, 2011-04-19 at 14:54 -0400, Philippe Laflamme wrote:

Right. So with store=true on my fields:

  • first request takes approx same time: ~10s
  • subsequent requests take ~1.5s

Makes sense - ES doesn't cache these values itself (as far as I'm aware)
but your filesystem would have cached the data, making it faster on
subsequent requests

With store=false, all requests take ~10s.

Will clustering help also? If I increase the number of replicas and
nodes, will this have an impact on such such a query?

It should do. My question is why does this query take so long in the
first place? Does your node have sufficient memory/CPU to handle the
data that you have stored in it?

How many results are you asking for?

clint


(Shay Banon) #7

Hi,

When you sort by _id, then all the ids values need to be loaded to memory. This is highly not recommended as it can take quite a bit of memory. This is why the first request is slower, it simply needs to load all those values to mem.

You said that you need to iterate over all the values, but I don't see how you do it. You issue a single search request that will return only 10 hits, or are you doing something more?
On Wednesday, April 20, 2011 at 12:49 PM, Clinton Gormley wrote:

Hi Philippe

On Tue, 2011-04-19 at 14:54 -0400, Philippe Laflamme wrote:

Right. So with store=true on my fields:

  • first request takes approx same time: ~10s
  • subsequent requests take ~1.5s

Makes sense - ES doesn't cache these values itself (as far as I'm aware)
but your filesystem would have cached the data, making it faster on
subsequent requests

With store=false, all requests take ~10s.

Will clustering help also? If I increase the number of replicas and
nodes, will this have an impact on such such a query?

It should do. My question is why does this query take so long in the
first place? Does your node have sufficient memory/CPU to handle the
data that you have stored in it?

How many results are you asking for?

clint


(plaflamme) #8

It should do. My question is why does this query take so long in the
first place? Does your node have sufficient memory/CPU to handle the
data that you have stored in it?

There are 63K documents in the index with ~600 fields each.

I'm currently running ES on a single node that has 8G of RAM and an SSD
disk. I think the machine has plenty of horsepower to handle this, but I
started ES with default settings. So I'll try increasing the RAM allocated
to ES.

How many results are you asking for?

I'm requesting all docs (63K). I need to fetch all values for a single term.

Thanks,
Philippe


(plaflamme) #9

When you sort by _id, then all the ids values need to be loaded to
memory. This is highly not recommended as it can take quite a bit of memory.
This is why the first request is slower, it simply needs to load all those
values to mem.

Right, but is there any way to get results sorted by document ID besides
asking for sorting on _id ?

You said that you need to iterate over all the values, but I don't see
how you do it. You issue a single search request that will return only 10
hits, or are you doing something more?

Oops, my request should have shown "size":63000 I'm fetching every result
in one go.

I'm currently only testing things out, see what I can do with ES. One of my
use cases is to iterate on all values for a term in document ID order. Since
I'm only testing, I'm using curl to fetch everything in one go, but I would
eventually use the scrolling abilities in ES. I thought that fetching
everything would provide a reasonable estimate of the performance for this
use case. Is this a valid assumption?

Philippe


(Clinton Gormley) #10

Hi Philippe

Right, but is there any way to get results sorted by document ID
besides asking for sorting on _id ?

Why do you need them sorted by ID? Is it really necessary?

Oops, my request should have shown "size":63000 I'm fetching every
result in one go.

OK - that is quite heavy. And will only get heavier as you add more
docs. You don't want to do that. There is a reason that Google never
returns more than 1,000 results for any search request.

I'm currently only testing things out, see what I can do with ES. One
of my use cases is to iterate on all values for a term in document ID
order. Since I'm only testing, I'm using curl to fetch everything in
one go, but I would eventually use the scrolling abilities in ES. I
thought that fetching everything would provide a reasonable estimate
of the performance for this use case. Is this a valid assumption?

What would be better is to use the 'scan' search_type (added in master)
which is a lightweight way of iterating through all matching docs. But
it doesn't allow sorting, which is why I ask if you really need that

clint


(plaflamme) #11

Hi Clinton,

Right, but is there any way to get results sorted by document ID
besides asking for sorting on _id ?

Why do you need them sorted by ID? Is it really necessary?

Yeah, right now, I need a consistent and predictable ordering for values
returned. It doesn't really need to be sorted, it just needs to be
predictable. Since I'm looking up values from different sources (RDBMS, flat
files, etc), there's no way to get the same ordering from all of them, so I
force the source of data to return values in a common way (sorted by PK).

I'm evaluating ES as a new source of data for the app I'm writing. One of
the methods that needs to be implemented by a source of data is one that
returns all values for a "column", sorted by PK. If the source cannot sort,
then it either doesn't implement that part of the API or has to sort
"client-side".

Oops, my request should have shown "size":63000 I'm fetching every
result in one go.

OK - that is quite heavy. And will only get heavier as you add more
docs. You don't want to do that. There is a reason that Google never
returns more than 1,000 results for any search request.

Noted. I did plan to use the scroll capability instead of fetching
everything.

What would be better is to use the 'scan' search_type (added in master)

which is a lightweight way of iterating through all matching docs. But
it doesn't allow sorting, which is why I ask if you really need that

I'll look into it, but if ES won't sort for me, I'll have to implement it
client-side. Considering the current requirements, I'm not sure how I can
get away without ordering.

Out of curiosity, does ES distribute the sorting? For example, does each
node return sorted results when a query requests sorting?

Thanks!
Philippe


(Clinton Gormley) #12

Yeah, right now, I need a consistent and predictable ordering for
values returned. It doesn't really need to be sorted, it just needs to
be predictable. Since I'm looking up values from different sources
(RDBMS, flat files, etc), there's no way to get the same ordering from
all of them, so I force the source of data to return values in a
common way (sorted by PK).

Why not just ask for the field you want AND the ID? That would probably
be more efficient for all of your data sources.

Out of curiosity, does ES distribute the sorting? For example, does
each node return sorted results when a query requests sorting?

that's what they query_and_fetch, query_then_fetch, dfs_query_then|
and_fetch as about.

http://www.elasticsearch.org/guide/reference/api/search/search-type.html

clint


(plaflamme) #13

Why not just ask for the field you want AND the ID? That would probably
be more efficient for all of your data sources.

The next version of the API will probably have such a signature for the
method because not all calls need ordering. But some calls will still
require ordering when it needs to combine the common "rows" from several
sources. For example, a call may read two "columns" and return their sum.

Since ordering is required for some calls, it's much more efficient to get
the sources of the data to sort instead of the client. This allows the
sorting to happen at the source which may be on a different machine, or even
on several machines (such as with ES).

Out of curiosity, does ES distribute the sorting? For example, does
each node return sorted results when a query requests sorting?

that's what they query_and_fetch, query_then_fetch, dfs_query_then|
and_fetch as about.

http://www.elasticsearch.org/guide/reference/api/search/search-type.html

I did play with this setting, but I don't have many nodes, so I didn't see
any impact. I'll try to setup more nodes.

Thanks a bunch!
Philippe


(plaflamme) #14

So I added more nodes (all on the same machine though), and I got some
considerable improvements.

This is the approximate time for issuing the request (with query_then_fetch)
after nodes have been freshly started:

1 node: ~10s
2 nodes: ~3s
3 nodes: ~2.5s
4 nodes: ~2.5s

The performance increases as nodes are added, up to a certain point
obviously. Since all of them are running on one machine, the improvement is
probably due to the use of several CPU-cores for searching/sorting. The
machine has a dual-core I7 M620 (with HT, so is handled as 4 CPUs by the
OS).

This is a very dumb benchmarking setup (everything on one machine, issuing a
single request, etc.), but I thought it may be interesting to report
anyway...

Cheers,
Philippe

On Wed, Apr 20, 2011 at 12:17, Philippe Laflamme <
philippe.laflamme@obiba.org> wrote:

Why not just ask for the field you want AND the ID? That would probably
be more efficient for all of your data sources.

The next version of the API will probably have such a signature for the
method because not all calls need ordering. But some calls will still
require ordering when it needs to combine the common "rows" from several
sources. For example, a call may read two "columns" and return their sum.

Since ordering is required for some calls, it's much more efficient to get
the sources of the data to sort instead of the client. This allows the
sorting to happen at the source which may be on a different machine, or even
on several machines (such as with ES).

Out of curiosity, does ES distribute the sorting? For example, does
each node return sorted results when a query requests sorting?

that's what they query_and_fetch, query_then_fetch, dfs_query_then|
and_fetch as about.

http://www.elasticsearch.org/guide/reference/api/search/search-type.html

I did play with this setting, but I don't have many nodes, so I didn't see
any impact. I'll try to setup more nodes.

Thanks a bunch!
Philippe


(Barsk) #15
Well, as another newbie into the ES world that has taken the first
steps already, I would like to offer my 2 cents on the issue. I may
be wrong in some assumptions, but I am sure I will be corrected in
that case.




ES is great at finding a *limited* subset of  documents based on the
indexing from a huge volume of data. It does that blazingly fast. 


So once you have a query that will limit the result set then you
will get great results. 




What you are doing here is asking ES to sort and return a very large
result set. It does that, but it is not at the heart of what ES and
other Lucene based products does well. If you *really* want to
iterate over huge amounts of data then the new SCAN search type does
that, but it does not support sorting.




If you try to set up your testing to use some queries that target a
subset of data you will find ES a very nice companion indeed. I use
ES to index OCR interpreted text. And I can query the index with a
very complex and rich query language that specifically pinpoints
exactly what I need. It does this faster and better than any SQL
server of my knowledge. And it is scalable too.




So the key to use ES is to use the query language to pinpoint what
you search for to get a limited search result. If you then need it
to be sorted, it would not slow down much at all.




If you really, really need to iterate over a set of data as you say,
then an SQL solution is probably better with some cursor based
approach. 


ES is about <u>indexing </u>and <u>search</u>.




/Kristian






Philippe Laflamme skrev 2011-04-20 15:22:
<blockquote cite="mid:BANLkTimX+W5oQ_EukHYSvSD6EM66ALRcDw@mail.gmail.com" type="cite">
It should do. My question is why does this query take so long in the
      first place?  Does your node have sufficient memory/CPU to
      handle the


      data that you have stored in it?

There are 63K documents in the index with ~600 fields each.

I'm currently running ES on a single node that has 8G of
RAM and an SSD disk. I think the machine has plenty of
horsepower to handle this, but I started ES with default
settings. So I'll try increasing the RAM allocated to ES.

      How many results are you asking for?

I'm requesting all docs (63K). I need to fetch all values
for a single term.

Thanks,Philippe


(system) #16