Real time vs On demand cluster


(Prashant Agrawal) #1

What is the limit for clustering the document in case of real time cluster and on demand cluster?


(Binh Ly-2) #2

If you're referring to Carrot2 clustering, you might find that information
here:

http://project.carrot2.org/faq.html#scalability

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a85ff027-2200-4985-a309-932ae418b19d%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Prashant Agrawal) #3

I just wanted to know that is there any difference between Real time vs On demand cluster wrt no of document to be indexed.


(Mark Walkom) #4

ES indexes data as soon as it receives it, it is then available right after
that. It's as close to real time as it can get.

There is no concept of on demand, unless you are thinking of something
else.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 7 March 2014 16:08, prashy prashant.agrawal@paladion.net wrote:

I just wanted to know that is there any difference between Real time vs On
demand cluster wrt no of document to be indexed.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Real-time-vs-On-demand-cluster-tp4051151p4051246.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1394168898486-4051246.post%40n3.nabble.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624bOEanziZgihnzb7NKBwHgEaYmydM3KYTc3KQn9_E3S-Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Prashant Agrawal) #5

Hi Mark,

I read some where that there are clustering like real time and on demand so was keen to know about that.

I have one more concern that might be the silly one but still wanted to know about that "whether clustering happens at the time when we are indexing the data to ES or it happens while retrieving the data from ES by any of search query (e.g. by carrot2 plugin)"

So in other words is it like that "indexing the data to ES does clustering of documents as well?"


(Mark Walkom) #6

I don't think that applies to ES.

The indexing happens as soon as you post a document to ES, not as you query
it.
ES will automatically replicate the document (depending on your setup) at
the same time.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 7 March 2014 17:30, prashy prashant.agrawal@paladion.net wrote:

Hi Mark,

I read some where that there are clustering like real time and on demand so
was keen to know about that.

I have one more concern that might be the silly one but still wanted to
know
about that "whether clustering happens at the time when we are indexing
the
data to ES or it happens while retrieving the data from ES by any of search
query (e.g. by carrot2 plugin)"

So in other words is it like that "indexing the data to ES does clustering
of documents as well?"

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Real-time-vs-On-demand-cluster-tp4051151p4051253.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1394173839749-4051253.post%40n3.nabble.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624Zdr89PRANLrzoVpFvtakSvyhZxZhADbKTHJXkwwZ6Cww%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Prashant Agrawal) #7

The indexing happens as soon as you post a document to ES, not as you query it.
ES will automatically replicate the document (depending on your setup) at the same time.

This is fine for indexing.

But what happens exactly wrt clustering. Clustering means I wanted to know like I submitted one docs to ES so Indexing will happen at that time. So is it like that clustering of documents will also happens at the same time.

Or the clustering will happen, while we will fire the _search_with_clusters query to ES ?


(Mark Walkom) #8

Cluster doesn't happen at a document level, it happens on a node level, ie
you have a cluster of N nodes.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 7 March 2014 17:42, prashy prashant.agrawal@paladion.net wrote:

The indexing happens as soon as you post a document to ES, not as you
query
it.
ES will automatically replicate the document (depending on your setup) at
the same time.

This is fine for indexing.

But what happens exactly wrt clustering. Clustering means I wanted to know
like I submitted one docs to ES so Indexing will happen at that time. So is
it like that clustering of documents will also happens at the same time.

Or the clustering will happen, while we will fire the _search_with_clusters
query to ES ?

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Real-time-vs-On-demand-cluster-tp4051151p4051256.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1394174526533-4051256.post%40n3.nabble.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624a_H5vdJNJJr7su%3DpViZpuW9vt8nZ0WTe8BThvmfY_Ghg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Prashant Agrawal) #9

There are two scenario.

  1. I am submitting the documents to ES for indexing.
  2. I am executing a search query using cluster plugin.

Just to elaborate through example.

Step 1:
curl -xPOST 'http://192.168.0.179:9200/prashant' -d
{
"mappings": {
"emp": {

"properties": {
	"empid": {"type":"long", "store":"no",
		"precision_step":"0" },

	"empname": {"type":"string", "store":"yes",
		"index":"analyzed" },

	"empage": {"type":"integer", "store":"yes",
		"precision_step":"0" },

	"empstatus": {"type":"string", "store":"yes",
	"index":"analyzed" }
		}

}
}
}

Step 2:
Now adding the documnets by :
curl -xPOST 'http://192.168.0.179:9200/prashant/emp' -d
{

"empid": 1,

"empname": "ABC",

"empage": 25,

"empstatus":"permanent"

}

Step 3:
Now if I _search_with_clusters for the documents like

{
"search_request": {
"query": {
"match": {
"_all": "ABC"
}
},
"size": 1000,
"from":0
},
"query_hint": "",
"field_mapping": {
"content": [
"_source.empname"
]
},
"algorithm": "lingo3g"
}

It will return 1000 of record in hierarchical structure with the cluster label and all.

So my question is whether the clustering of documents will happen at step2 or step3?


(Mark Walkom) #10

Step 2.

But as I said, you don't cluster a document, you might want to recheck your
terminology :slight_smile:

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 7 March 2014 17:55, prashy prashant.agrawal@paladion.net wrote:

There are two scenario.

  1. I am submitting the documents to ES for indexing.
  2. I am executing a search query using cluster plugin.

Just to elaborate through example.

*Step 1:
*curl -xPOST 'http://192.168.0.179:9200/prashant' -d
{
"mappings": {
"emp": {

    "properties": {
            "empid": {"type":"long", "store":"no",
                    "precision_step":"0" },

            "empname": {"type":"string", "store":"yes",
                    "index":"analyzed" },

            "empage": {"type":"integer", "store":"yes",
                    "precision_step":"0" },

            "empstatus": {"type":"string", "store":"yes",
            "index":"analyzed" }
                    }

}
}
}

*Step 2:
*Now adding the documnets by :
curl -xPOST 'http://192.168.0.179:9200/prashant/emp' -d
{

"empid": 1,

"empname": "ABC",

"empage": 25,

"empstatus":"permanent"

}

*Step 3:
*Now if I _search_with_clusters for the documents like

{
"search_request": {
"query": {
"match": {
"_all": "ABC"
}
},
"size": 1000,
"from":0
},
"query_hint": "",
"field_mapping": {
"content": [
"_source.empname"
]
},
"algorithm": "lingo3g"
}

It will return 1000 of record in hierarchical structure with the cluster
label and all.

So my question is whether the clustering of documents will happen at step2
or step3?

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Real-time-vs-On-demand-cluster-tp4051151p4051258.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1394175348897-4051258.post%40n3.nabble.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624aFf-nNCh1t%2BVQaQS1m-YZEDKD-0aGgaRaMqqWAB2o6iA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(David Pilato) #11

If I understand your question, when you get the answer at Step 2, your document is on all nodes which requires it.
But not available for search immediatly though get will work.
1 second later max, it will be available for search (on all nodes that is).

You can for demo purpose force the refresh using refresh API so your doc will be searchable after the refresh operation. Note that it could be a parameter to your index operation.

Hope this helps

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 7 mars 2014 à 07:55, prashy prashant.agrawal@paladion.net a écrit :

There are two scenario.

  1. I am submitting the documents to ES for indexing.
  2. I am executing a search query using cluster plugin.

Just to elaborate through example.

*Step 1:
*curl -xPOST 'http://192.168.0.179:9200/prashant' -d
{
"mappings": {
"emp": {

"properties": {
"empid": {"type":"long", "store":"no",
"precision_step":"0" },

   "empname": {"type":"string", "store":"yes",
       "index":"analyzed" },

   "empage": {"type":"integer", "store":"yes",
       "precision_step":"0" },

   "empstatus": {"type":"string", "store":"yes",
   "index":"analyzed" }
       }

}
}
}

*Step 2:
*Now adding the documnets by :
curl -xPOST 'http://192.168.0.179:9200/prashant/emp' -d
{

"empid": 1,

"empname": "ABC",

"empage": 25,

"empstatus":"permanent"

}

*Step 3:
*Now if I _search_with_clusters for the documents like

{
"search_request": {
"query": {
"match": {
"_all": "ABC"
}
},
"size": 1000,
"from":0
},
"query_hint": "",
"field_mapping": {
"content": [
"_source.empname"
]
},
"algorithm": "lingo3g"
}

It will return 1000 of record in hierarchical structure with the cluster
label and all.

So my question is whether the clustering of documents will happen at step2
or step3?

--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Real-time-vs-On-demand-cluster-tp4051151p4051258.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1394175348897-4051258.post%40n3.nabble.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/83B3F3E3-3504-4061-8E97-A3FB1696E0D2%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.


(Dawid Weiss) #12

But as I said, you don't cluster a document, you might want to recheck your terminology :slight_smile:

The terminology is fine. The same word applies to two different things
here, hence the confusion. Clustering in terms of infrastructure
arrangement and clustering as in statistical data analysis (or text
analysis).

Clustering means I wanted to know like I submitted one docs to ES so Indexing will happen at that time. So is it like that clustering of documents will also happens at the same time.

The Carrot2 plugin to ES does post-retrieval document clustering, so
you get clusters for each individual query (and its set of hits). For
this reason the query is also important -- it provides a hint to the
algorithm as to which trivial clusters it should avoid.

An off-line document clustering would have to be executed on all
documents in a collection (index), assign cluster labels and then just
filter these at query time (much like faceting does). Carrot2 does
not provide such a functionality (and very likely won't scale to
large indexes). You may want to check out Apache Mahout for this.

Dawid

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAM21Rt-FiGGkYXKYNdJGN3xgipW2kZ3vWVTaGhMbjC4v5PS_Sg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Prashant Agrawal) #13

Hi Dawid,

If I understood correctly wrt ES,

  1. Once we send the docs to ES it will index the data (Clustering i.e. assign cluster labels will not happen at this time). And only the documents will be stored to ES.

  2. Once we send the search query it will fetch the records, and on top of that record the clustering (assign cluster labels and maintain hierarchy) will happen.

And one more thing, Is there any concept like Real time clustering and On demand clustering in ES?

~Prashant


(Jörg Prante) #14

Elasticsearch indexes the terms in documents for search (plus the source of
the documents in _source).

With carrot2 cluster plugin, you can perform "unsupervised document
clustering". That means, you want dynamically organize search hits
regarding to their similarity after the hits have been calculated. Terms
are clustered as they appear in the hits due to statistical relationships.
After the clustering has been done, the clusters are labeled.

The other alternative is "supervised document clustering", and this is done
by using a priori knowledge, with the aggregations feature. To use
aggregations, the underlying information about document relationships have
to be in place in the index.

Generally speaking, supervised document clustering is faster than
unsupervised document clustering. But unsupervised document clustering can
be more powerful regarding document relationship exploration.

Instead of "realtime" and "ondemand", I'd suggest using the technical terms
"supervised" and "unsupervised" for document clustering, so people can not
confuse this with other technology in ES.

Jörg

On Fri, Mar 7, 2014 at 9:20 AM, prashy prashant.agrawal@paladion.netwrote:

Hi Dawid,

If I understood correctly wrt ES,

  1. Once we send the docs to ES it will index the data (Clustering i.e.
    assign cluster labels will not happen at this time). And only the documents
    will be stored to ES.

  2. Once we send the search query it will fetch the records, and on top of
    that record the clustering (assign cluster labels and maintain hierarchy)
    will happen.

And one more thing, Is there any concept like Real time clustering and On
demand clustering in ES?

~Prashant

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Real-time-vs-On-demand-cluster-tp4051151p4051276.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1394180442801-4051276.post%40n3.nabble.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHFGi9jjAsms9WSVyhrtZvikBu3RhJqkignsuv6WTwJSg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Dawid Weiss) #15

And one more thing, Is there any concept like Real time clustering and On
demand clustering in ES?

I honestly think you've read two different marketing snippets that
both relate to the same thing... "On-demand" and "real time" happens
on a subset of documents from the index and should return clusters
within a reasonable amount of time (so that users can interact with
the system). Off-line or batch clustering would denote a situation
when you're clustering all your documents, without the context of any
query. And this can take significantly longer.

This paper has a writeup of on-line clustering techniques in the
context of clustering search results (disclosure: I'm partially
responsible for it).
http://dl.acm.org/citation.cfm?id=1541884&dl=ACM&coll=DL&CFID=298573694&CFTOKEN=23620876

@Jörg: clustering is a pretty established term in information
retrieval and it nearly always denotes an unsupervised technique. I
would be hesitant to talk about "supervised clustering", even if such
a thing could be imagined by either introducing a feedback control
loop (clustering-human evaluation-reclustering) or by introducing a
predefined concept ontology... in which case it effectively becomes a
classification problem.

Dawid

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAM21Rt9iDDqD0nj0SSzWSB%2BrMOOWtFqGr2S1_qVZi9OMB%3DLpqQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Prashant Agrawal) #16

Hi Jörg,

Thanks for the clarification for most of the things. So I can say at time of indexing the clustering wont be happen and at time of search(After retrieving search document) , carrot2 will cluster the document and return the response as cluster.

Also as you told we can even create the "Supervised" clustering, so wanted to know, whether we require any add on plugin to support supervised clustering or carrot2 is enough for that.

As I read about aggregation so what I came to know is aggregation can be created with pre defined filter like Min Max Sum Avg etc. So is there any way we can create our custom filter using aggregation.

For ex. If I create a label Mobile for cluster so all the documents containing the word mobile should go to one particular cluster.


(Prashant Agrawal) #17

Hi Dawid,

"On-demand" and "real time" happens on a subset of documents from the index and should return clusters
within a reasonable amount of time (so that users can interact with
the system).

This means that there is no concept like real time and on demand in ES. Only after firing the search query the documents will be clustered and returned as response.

Off-line or batch clustering would denote a situation when you're clustering all your documents, without the >>context of any query.

Is offline clustering supported by carrot2 or any other plugin available in ES?


(system) #18