Hi,
as far as i understand the REST API, the following query will give me back 5+5 nodes, with a weight on the edges.
{
"query": {
"query_string": {
"default_field": "_all",
"query": "kibana"
}
},
"controls": {
"use_significance": true,
"sample_size": 2000,
"timeout": 5000
},
"connections": {
"vertices": [
{
"field": "entities.hashtags.text.analyzed",
"size": 5,
"min_doc_count": 3
}
]
},
"vertices": [
{
"field": "entities.hashtags.text.analyzed",
"size": 5,
"min_doc_count": 3
}
]
}
I'd like to know the logic behind for the selection of these connections.
I understand that the vertices are first ranked according to the ES most significant algo, and based on the sample that is available in the specific shard. These sample comes from a subset of a ranked list of documents that came up from the seed query.
Then, I suppose for each shard, the edges that occurs more than "shard_min_doc_count" times will be retained. But there is no ranking at that level.
Then, I suppose all edges counts from different shards are aggregated and a new filtering is performed based on "min_doc_count".
My open points in this process are the next ones :
- how / when the edges weights enter in the graph creation process ?
- how / when the vertices "size" field is taken in consideration in the graph creation process ?
It's worth reiterating:
- vertices are terms (in your example, hash tags)
- edges are collections of documents that connect 2 terms (in your case, many tweets)
When we talk about min_doc_count we mean the number of documents in the results sample that contain a term.
Edge weights are a measure of the cohesion between 2 terms as evidenced by many docs. The edge kibana->elasticsearch
may have a greater weight than kibana->awesome
even though kibana appears more often with the tag awesome than it does elasticsearch. This is because #awesome is commonly associated with many other tweets but elasticsearch is what we call "uncommonly common" in the kibana tweets and therefore more important.
The "size" field is admittedly oddly named. It is the maximum number of vertex terms you want returned and is named "size" to be consistent with other existing aggregation parameters
Hi Mark,
thanks for the explanation.
I am however a bit confused now with regard to the explanation regarding the handling of these terms in
https://www.elastic.co/guide/en/graph/current/graph-api-rest.html
* min_doc_count acts as a certainty threshold - just how many documents have to contain a pair of terms before we consider this to be a useful connection? (default is 3)
* shard_min_doc_count is an advanced setting - just how many documents on a shard have to contain a pair of terms before we return this for global consideration? (default is 2)
Indeed, you mentioned me this
"
When we talk about min_doc_count we mean the number of documents in the results sample that contain a term.
"
are these filtering occurring at counting occurrences of pair of terms or occurrences of single term ?
It can be both, actually and depends on the context.
- Single terms when identifying the initial terms that match your seed query (e.g. the hashtags associated with a query for tweets posted today).
- Pairs of terms when considering the other terms connected to the initial terms identified in 1)
- Pairs of terms when considering the other terms connected to the secondary terms identified in 2) - and so on..
OK then is it working like that ::
initialization:
- filtering of single terms that match the seed query and for which occurences > "shard_min_doc_count"
- filtering of top "size" (i.e. 5 in my case) most significant single terms
first "hop":
- filtering of pair of terms that appears to be connected to the 5 initial terms indentified at the initialization step and for which occurences > "shard_min_doc_count"
- filtering a top "size" (i.e. 5 in my case again) of most significant "edges" !
and stops here i suppose according to my input json (only one hop)
At the end, is this sampling process documented somewhere ? I'd need to characterize your sampling process in order to assess its impact on the connectivity pattern of the subgraphs generated with respect to a full graph.
It uses the relevance ranking behaviour of the underlying search engine which has some well established heuristics [1] from information theory. Additionally, you can take control of the ranking process using arbitrary query logic [2]. Looking at too much (low quality) information has a tendency to dilute the signals we look for (terms bend towards the 'norm' line in [3]) and so looking at high-quality doc samples keeps the signal strength high. However, diversity can be an important requirement when sampling so we have controls that let you manage this [4].
All of these factors help in scenarios where you have noisy data and want to tune into just the useful links (e.g. a recommendation engine mining the "wisdom of crowds"). If you are not looking for summaries and want to simply explore all connections it helps to turn off these controls [5]
[1] See TF-IDF, and more recent BM25
[2] Function Score Query | Elasticsearch Guide [2.3] | Elastic
[3] https://www.quora.com/Whats-the-best-way-to-display-significant-terms-aggregation-in-Kibana
[4] Sampler Aggregation | Elasticsearch Guide [2.3] | Elastic
[5] Troubleshooting | Elasticsearch Graph [2.4] | Elastic
Thanks for all the details regarding the relevance ranking behaviour. I'll review it in details.
However, sorry for not being clear enough on what i want to know, I'll try to re-shape my worries differently.
Suppose:
- use_significance = False
- min_doc_count=1
- shard_min_doc_count=1
- sample_size = the number of docs in my index
then on what is based the nodes (and edges ?) filtering if I use size=5 as in my first query, in order to display just a subset of nodes.
What metrics are used in order to perform this filtering ?
Are the metrics evaluated on the fly ?
Do the filtering applies at the connections as well or only on the set of terms ?
Or do the filtering applies along the exploration process ? If yes, how ?
It is important for me to know the way the graph exploration is done, as it may not be relevant to study some social network process with my current data structure, or put in order words, the data structure may have to be tuned in order to be able to use your exploration process in order to study some specific social network processes.
Thanks for your availability
The 5 most popular terms measured by number of docs with at least one occur[quote="Serge_Scevenels, post:7, topic:49796"]
What metrics are used in order to perform this filtering ?Are the metrics evaluated on the fly ?
[/quote]
We count the docs that contain the term.
The size filtering occurs to find the most popular N root terms for the starting query and then finds the most popular N branch terms under each root (so a max of N roots x N branch terms) which are then pruned to just the best N branches overall.
my last questions about the term "N best branches":
-
N : what would be the value of N in case I have defined size to n1 for the root terms and to n2 for the first exploration vertices ? I suppose it is n2. i.e you don't reevaluate the importance of the root terms, correct ?
-
Best branches : when use_significance=False, do you use the mean of the popularity of each node ? when use_significance=True, you use the "cohesion" algorithm. correct ? is there a way to tweek this "cohesion" algorithm with my own implementation ?
I have defined size to n1 for the root terms and to n2 for the first exploration vertices ? I suppose it is n2
Correct. We had a choice when users ask for n1 root and n2 branch to return n1*n2 number of results or n2. We opted for n2.
do you use the mean of the popularity of each node ?
Actually with both the significance and popularity approaches the weights used to prioritize nodes are actually further derived from these inputs . The weights are a share-of-signal where significance scores or doc counts act as the signal and the values are flushed through the various branches with different decays and boosts depending on the connectivity of the graph - a branch term could be arrived at via multiple different roots and therefore might be seen to be more "about" the initial query than a common branch term that was only referenced by one root.
Long story short - like document relevance scores, I wouldn't try to read too much into the weights that are returned as they only have meaning in prioritizing one term over another during a request.
can you tell me if there is a paper that specify this ranking algorithm, in details, i mean the one that rank the explored nodes based on the current graph connectivity and branch weights ?