Hey mate,
0.11 will help, but still, the mentioned query will cause all the links to
be loaded to memory (at least per segment in an index) and processed. 1gb
might not be enough... .
But, the good news is that I have been thinking hard on exactly the
scenario you face. Basically, run a "heavy" search, usually with some heavy
lifting components (facets and so on) every once in a while (and possibly,
the result, is indexed as another document. Something like a time series
db.). For this, loading the data into memory is not always desired, and
either loading a stored field, or even parsing the source and fetching the
relevant data is enough.
For that, the ability to access the _source in a script is already there.
In master I have already added _fields option to just load stored fields.
The next step is to enhance the terms facet to allow for a script to
provide the terms. Also, provide reacher options when it comes to scripts
(more lang support). And last, allow to provide complete custom code that
defines a facet.
0.12 will have the above, master already has some of it. 0.11 will at
least give you better caching management of facet fields that I hope will
mean that you will hit the wall in a much later time (and that wall will be
hit eventually when staying with the same number of nodes and same amount of
mem and keep indexing new data).
-shay.banon
On Wed, Oct 6, 2010 at 10:35 PM, Thiago Souza tcostasouza@gmail.com wrote:
Shay,
This query is buggy, I posted the older version, here is the correct
one:
{
"query" : {
"bool" : {
"must" : [{
"term" : { "links.expanded.domain" : "www.youtube.com" }
},{
"range": {
"timestamp": {
"from" : "${header.start}",
"to" : "${header.end}",
"include_lower" : false,
"include_upper": true
}
}
}],
"must_not" : {
"term" : { "links.expanded" : "YouTube" }
}
}
},
"facets" : {
"links" : {
"terms" : {
"field" : "links.expanded",
"script" : "term.contains('watch')",
"size" : 5
}
}
}
}
On Wed, Oct 6, 2010 at 17:24, Thiago Souza tcostasouza@gmail.com wrote:
Hi shay,
Here is the youtube query: (header.end - header.start is always 2h
back from current time)
{
"query" : {
"bool" : {
"must" : {
"term" : { "links.expanded.domain" : "www.youtube.com" }
},
"must_not" : {
"term" : { "links.expanded" : "YouTube" }
},
"must": {
"range": {
"timestamp": {
"from" : "${header.start}",
"to" : "${header.end}",
"include_lower" : false,
"include_upper": true
}
}
}
}
},
"facets" : {
"links" : {
"terms" : {
"field" : "links.expanded",
"script" : "term.contains('watch')",
"size" : 5
}
}
}
}
I'm planning moving to .11, but not yet. I'll try to do it ASAP.
Regards,
Thiago Souza
On Wed, Oct 6, 2010 at 17:19, Shay Banon shay.banon@elasticsearch.comwrote:
Great!. Can you share the youtube query? Also, can you move to 0.11?
On Wed, Oct 6, 2010 at 10:11 PM, Thiago Souza tcostasouza@gmail.comwrote:
Hi Shay,
First of all, I forgot to mention, I'm using ES 0.10.
The only query that is made is the 5 most youtube video mentioned
in the last 2h. This query is made every 5 min.
The work dir is in a local disk.
I'll restart the indexing process without the youtube report and
see if it lasts longer than 10-20h.
Regards,
Thiago Souza
On Wed, Oct 6, 2010 at 17:04, Shay Banon shay.banon@elasticsearch.comwrote:
You might need to increase it. There is a limit to what can fit into
memory. For example, to provide fast search, terms are loaded in interval to
memory, so a lot of terms means more memory required. There are other
aspects like sorting and faceting that would require more memory as well.
The indexing speed is strange. Simpel tweets should be indexed much
faster. How do you interact with elasticsearch?
Another point, having the "work" dir on a local disk makes more sense
then having it on a remote dir. Local will (usually) be much faster.
-shay.banon
On Wed, Oct 6, 2010 at 9:31 PM, Thiago Souza tcostasouza@gmail.comwrote:
Unfortunately I can not increase it to 2G
On Wed, Oct 6, 2010 at 14:31, Pablo Borges pablort@gmail.com wrote:
I've seen the same problem, which was solved increasing -Xmx to 2G,
but also got index corruption, which required reindexing.
On Wed, Oct 6, 2010 at 1:41 PM, Thiago Souza tcostasouza@gmail.comwrote:
Hello ppl,
I currently experiencing performance problem when indexing. The
system is indexing ~30tweets/sec, and after a period (10h-20h) indexing
Elasticsearch stops responding consuming 100% of a cpu's core.
My current setup is:
2Ghz Quad Core Xeon CPU
6GB RAM
Currently, there's ~15.000.000 indexed
2 elasticsearch node setup:
Xms256M and Xmx768M
Default shard configuration
FS gateway data dir and node work dir in separated
physical disks (although the work dir is shared in the same disk by the 2
nodes)
When Elasticsearch starts consuming 100% of a cpu's core (that is
25% in total, since it's a quadcore) I get the following in log files:
First, a serie of OutOfMemoryError: Java heap space (from
all sort of stacktrace points).
Then tons of Long GC collection occurred, took [15.1s],
breached threshold [10s] (with the first number varying from 10s to 60s)
And finally all sort of exception
like java.io.IOException: No commit point data and
org.elasticsearch.transport.SendRequestTransportException
The situation is only normalized after I shutdown the cluster,
clean up the work dir and wait for a full clust recovery.
Any clue anyone?
Regards,
Thiago Souza