Returning non connected results?



Often when I search for a term, I get a few disjoint sets, with no edges between them. I would think the results should be strongly connected. Why are some nodes returned that cannot be connected with the returned search term node?

(Mark Harwood) #2

The default settings are tuned for large data sets. Can you answer the following:

Do you have significant_links turned on in the settings?
What number for "certainty" is in the settings?
How many docs do you have in the index?
How many shards?
How many docs match your query?

  1. Yep (although, the actual setting is use_significance correct?)
  2. Not using certainty
  3. It's the Enron dataset - 1,353,160
  4. 5 shards (the default?)
  5. Which query - the graph query or the basic search query for the term?

(Mark Harwood) #4

Ah, apologies. I assumed you were using the GUI not the API.
The certainty setting in the GUI equates to the "min_doc_count" parameter here [1]

The default value is 3 meaning that we need to see at least 3 documents asserting a connection between term A and term B before we consider it a reliable link and not a one-off pairing. If you dial this back to 1 you should see more connections but they may be of lower quality.

The more shards you have the less well-informed each of them are about significance/relevance. If you don't expect the dataset to grow (e.g. there are unlikely to be more Enron emails) and it is as small as something like Enron that fits comfortably on one node then it might make more sense to index with one shard only.

I'd first try dialling the min_doc_count down though.


(system) #5