Terms entropy computation by fields

Hrach_Pelibossian · October 28, 2013, 11:29am

Hello,
I need something like the entropy computation for terms by fields for a
subset of documents. For this I need to know the frequency of terms in
fields that I can get by termsFacet and / or termsStatsFacet.

If I understood, the aggregation of results is done in the node that
receives the request.

I'd like the calculation of frequency will be distributed by field for the
reasons of performance (speed and memory).
Because the number of all terms may be too large.

Can you advise me how to do it in easy way?

is what I have to distribute the calculated via http request to the
server with different ES nodes of the same cluster?
or is what I have directly asked by a node to support this claim?

Thank you in advance for your answer

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Boaz_Leskes · October 28, 2013, 12:00pm

Hrach,

Just want to double check things are clear - terms facet distribute the
request to all nodes hosting the index where the top N terms are calculated
per shard. Those top N terms are streamed t the originating node where they
are again merged and a global top M is extracted and returned. On v0.90.5 N
and M were equal and were set by the size parameter. For 1.0beta and 0.90.6
you can control those separately using the shard_size parameter
( Add support for `shard_size` for terms & terms_stats facets · Issue #3821 · elastic/elasticsearch · GitHub).

Currently there is no way of changing the above. If you want to distribute
the 2 reduce phase (where the top M is calculated) and you need the top
terms of multiple fields as well, you can call multiple nodes in parallel
for every field.

Cheers,
Boaz

On Monday, October 28, 2013 12:29:32 PM UTC+1, Hrach Pelibossian wrote:

Hello,
I need something like the entropy computation for terms by fields for a
subset of documents. For this I need to know the frequency of terms in
fields that I can get by termsFacet and / or termsStatsFacet.

If I understood, the aggregation of results is done in the node that
receives the request.

I'd like the calculation of frequency will be distributed by field for the
reasons of performance (speed and memory).
Because the number of all terms may be too large.

Can you advise me how to do it in easy way?

is what I have to distribute the calculated via http request to the
server with different ES nodes of the same cluster?

or is what I have directly asked by a node to support this claim?

Thank you in advance for your answer

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hrach_Pelibossian · October 28, 2013, 3:30pm

Thank you for your response.
If I have news I will write it here.

Le lundi 28 octobre 2013 13:00:03 UTC+1, Boaz Leskes a écrit :

Hrach,

Just want to double check things are clear - terms facet distribute the
request to all nodes hosting the index where the top N terms are calculated
per shard. Those top N terms are streamed t the originating node where they
are again merged and a global top M is extracted and returned. On v0.90.5 N
and M were equal and were set by the size parameter. For 1.0beta and 0.90.6
you can control those separately using the shard_size parameter (
Add support for `shard_size` for terms & terms_stats facets · Issue #3821 · elastic/elasticsearch · GitHub).

Currently there is no way of changing the above. If you want to distribute
the 2 reduce phase (where the top M is calculated) and you need the top
terms of multiple fields as well, you can call multiple nodes in parallel
for every field.

Cheers,
Boaz

On Monday, October 28, 2013 12:29:32 PM UTC+1, Hrach Pelibossian wrote:

Hello,
I need something like the entropy computation for terms by fields for a
subset of documents. For this I need to know the frequency of terms in
fields that I can get by termsFacet and / or termsStatsFacet.

If I understood, the aggregation of results is done in the node that
receives the request.

I'd like the calculation of frequency will be distributed by field for
the reasons of performance (speed and memory).
Because the number of all terms may be too large.

Can you advise me how to do it in easy way?

is what I have to distribute the calculated via http request to the
server with different ES nodes of the same cluster?

or is what I have directly asked by a node to support this claim?

Thank you in advance for your answer

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hrach_Pelibossian · October 30, 2013, 1:54pm

Hello

How do I have to call several nodes to run queries / facets in parallel?
Should I use TransportClient ?

I did a test with elastcsearch 0.90.5 creating two nodes:

Node node1 = nodeBuilder().settings(settingsBuilder().put(settings).put(
"name", "node1")).data(false).node();

Node node2 =
nodeBuilder().settings(settingsBuilder().put(settings).put("name",
"node2")).data(false).node();
//data is false!

then with TransportClient I create a client

TransportClient clientt TransportClient = new ()
. addTransportAddress (new InetSocketTransportAddress ( " 192.168.0.195 " ,
9300 ) )
. addTransportAddress (new InetSocketTransportAddress ( " 192.168.0.195 " ,
9301 ) )

then I get the list of DiscoveryNode

ImmutableList dnodes clientt.connectedNodes();

I ' get the info for two nodes
but isDataNode() get true for all nodes

dnode1.isMasterNode () : true
dnode1.isClientNode () : false
dnode1.isDataNode () : true

dnode2.isMasterNode () : true
dnode2.isClientNode () : false
dnode2.isDataNode () : true

Yet both nodes must be no data .

So I do not know if I can start queries/facets in parallel on two different
nodes with TransportClient. May be tho is just a small bug for
DiscoveryNode.
My question just to known if I can to use TransportClient for this or I
will try an other way.

On Mon, Oct 28, 2013 at 4:30 PM, Hrach Pelibossian pelibossian@gmail.comwrote:

Thank you for your response.
If I have news I will write it here.

Le lundi 28 octobre 2013 13:00:03 UTC+1, Boaz Leskes a écrit :

Hrach,

Just want to double check things are clear - terms facet distribute the
request to all nodes hosting the index where the top N terms are calculated
per shard. Those top N terms are streamed t the originating node where they
are again merged and a global top M is extracted and returned. On v0.90.5 N
and M were equal and were set by the size parameter. For 1.0beta and 0.90.6
you can control those separately using the shard_size parameter (
https://github.com/**elasticsearch/elasticsearch/**issues/3821 https://github.com/elasticsearch/elasticsearch/issues/3821
).

Currently there is no way of changing the above. If you want to
distribute the 2 reduce phase (where the top M is calculated) and you need
the top terms of multiple fields as well, you can call multiple nodes in
parallel for every field.

Cheers,
Boaz

On Monday, October 28, 2013 12:29:32 PM UTC+1, Hrach Pelibossian wrote:

Hello,
I need something like the entropy computation for terms by fields for a
subset of documents. For this I need to know the frequency of terms in
fields that I can get by termsFacet and / or termsStatsFacet.

If I understood, the aggregation of results is done in the node that
receives the request.

I'd like the calculation of frequency will be distributed by field for
the reasons of performance (speed and memory).
Because the number of all terms may be too large.

Can you advise me how to do it in easy way?

is what I have to distribute the calculated via http request to the
server with different ES nodes of the same cluster?

or is what I have directly asked by a node to support this claim?

Thank you in advance for your answer

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/IGDwHuLo7gw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Boaz_Leskes · November 1, 2013, 10:19am

Hi Hrach,

The transportclient will round robin on the nodes you give it, so yes, you
can use it to load balance.

About the info you get back - it's indeed not updated to the node in
question. It can be confusing - I'll open up an issue for that and fix it.
Thx!

Cheers,
Boaz

On Wed, Oct 30, 2013 at 2:54 PM, Hrach Pelibossian pelibossian@gmail.comwrote:

Hello

How do I have to call several nodes to run queries / facets in parallel?
Should I use TransportClient ?

I did a test with elastcsearch 0.90.5 creating two nodes:

Node node1 = nodeBuilder().settings(settingsBuilder().put(settings).put(
"name", "node1")).data(false).node();

Node node2 =
nodeBuilder().settings(settingsBuilder().put(settings).put("name",
"node2")).data(false).node();
//data is false!

then with TransportClient I create a client

TransportClient clientt TransportClient = new ()
. addTransportAddress (new InetSocketTransportAddress ( " 192.168.0.195 "
, 9300 ) )
. addTransportAddress (new InetSocketTransportAddress ( " 192.168.0.195 "
, 9301 ) )

then I get the list of DiscoveryNode

ImmutableList dnodes clientt.connectedNodes();

I ' get the info for two nodes
but isDataNode() get true for all nodes

dnode1.isMasterNode () : true
dnode1.isClientNode () : false
dnode1.isDataNode () : true

dnode2.isMasterNode () : true
dnode2.isClientNode () : false
dnode2.isDataNode () : true

Yet both nodes must be no data .

So I do not know if I can start queries/facets in parallel on two
different nodes with TransportClient. May be tho is just a small bug for
DiscoveryNode.
My question just to known if I can to use TransportClient for this or I
will try an other way.

On Mon, Oct 28, 2013 at 4:30 PM, Hrach Pelibossian pelibossian@gmail.comwrote:

Thank you for your response.
If I have news I will write it here.

Le lundi 28 octobre 2013 13:00:03 UTC+1, Boaz Leskes a écrit :

Hrach,

Just want to double check things are clear - terms facet distribute the
request to all nodes hosting the index where the top N terms are calculated
per shard. Those top N terms are streamed t the originating node where they
are again merged and a global top M is extracted and returned. On v0.90.5 N
and M were equal and were set by the size parameter. For 1.0beta and 0.90.6
you can control those separately using the shard_size parameter (
https://github.com/**elasticsearch/elasticsearch/**issues/3821 https://github.com/elasticsearch/elasticsearch/issues/3821
).

Currently there is no way of changing the above. If you want to
distribute the 2 reduce phase (where the top M is calculated) and you need
the top terms of multiple fields as well, you can call multiple nodes in
parallel for every field.

Cheers,
Boaz

On Monday, October 28, 2013 12:29:32 PM UTC+1, Hrach Pelibossian wrote:

Hello,
I need something like the entropy computation for terms by fields for
a subset of documents. For this I need to know the frequency of terms in
fields that I can get by termsFacet and / or termsStatsFacet.

If I understood, the aggregation of results is done in the node that
receives the request.

I'd like the calculation of frequency will be distributed by field for
the reasons of performance (speed and memory).
Because the number of all terms may be too large.

Can you advise me how to do it in easy way?

is what I have to distribute the calculated via http request to the
server with different ES nodes of the same cluster?

or is what I have directly asked by a node to support this claim?

Thank you in advance for your answer

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/IGDwHuLo7gw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/IGDwHuLo7gw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.