I need to generate a co-occurrence graph

Hi there !
What is the best way (performance) for implements a co-occurrence search on a couple of fields ?

I found this solution, but I'm not sure that this is the best way:

{
	"size": 0,
	"aggs": {
		"cooccurrencePC": {
			"terms": {
				"min_doc_count": 1,
				"script": "Set justComputated = new HashSet(); for (people in doc['people'].values){for (company in doc ['companies'].values){if (!justComputated.contains(people+company)){justComputated.add(people + '->' + company);}}}return justComputated;",
				"size": 100
			}
		}
	}
}

thanks
Diego

You should be able to used a pair of nested terms aggs if the number of people/companies are sufficiently small:

POST test/doc/1
{
  "people":["john", "joe", "jane"],
  "company":["Just Js"]
}
POST test/doc/2
{
  "people":["john", "joe", "rob"],
  "company":["Only Os"]
}
GET test/_search
{
  "size": 0,
  "aggs": {
	"companies": {
	  "terms": {
		"field": "company.keyword"
	  },
	  "aggs": {
		"employees": {
		  "terms": {
			"field": "people.keyword"
		  }
		}
	  }
	}
  }
}

Thank you for your prompt reply.

If I understand correctly, in the example that you provide me (with sub aggregation), the query respond with the co-occurrences between the top (default=10) companies and the peoples that are present in the subset of documents that were "filtered" by the first aggregation. But if I draw a co-occurrence graph with agg "companies" and sub agg "peoples" I get a result instead with agg "peoples" and sub agg "companies" I get a different result. So the first aggregation filed play a fundamental role.

How can I guarantee to found all co-occurrences (max qualitative result Vs best performance) ?
Maybe with more query request and some business logic to combine then ?

Thanks,
Diego

I think I need to see examples of your data to be sure about what you're tackling but people->companies and companies->people terms aggs should yield the same connections assuming

  1. each doc asserts that all the people listed inside work for all the companies listed inside.
  2. you have the size property on the terms agg cranked up high enough to get all parties

Hi Mark, a try to explain a limit case:

POST test/doc/1
{
  "people":["john", "joe", "rob"],
  "company":["Just Js"]
}

POST test/doc/2
{
  "people":["john", "joe", "rob"],
  "company":["Only Os"]
}

POST test/doc/3
{
  "people":["bob"],
  "company":["Apple", "IBM", "Microsoft"]
}

POST test/doc/4
{
  "people":["diego", "alice"],
  "company":["Apple", "IBM", "Microsoft"]
}

POST test/doc/5
{
  "people":["jose"],
  "company":["Apple", "IBM", "Microsoft"]
}

with that document distribution we have:

Top Peoples:
joe (2)
john (2)
rob (2)
alice (1)
bob (1)
diego (1)
jose (1)

Top Companies
Apple (3)
IBM (3)
Microsoft (3)
Just Js (1)
Only Os (1)

so if the first aggregation use peoples like this

POST test/_search
{
  "size": 0,
  "aggs": {
	"companies": {
	  "terms": {
		"field": "people.keyword",
		"size" : "3"

	  },
	  "aggs": {
		"employees": {
		  "terms": {
			"field": "company.keyword",
			"size" : "3"
		  }
		}
	  }
	}
  }

note the size parameter set to 3 (because we a have a very small set of documents (5) )

"aggregations": {
    "peoples": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 4,
        "buckets": [
            {
                "key": "joe",
                "doc_count": 2,
                "employees": {
                    "doc_count_error_upper_bound": 0,
                    "sum_other_doc_count": 0,
                    "buckets": [
                        {
                            "key": "Just Js",
                            "doc_count": 1
                        },
                        {
                            "key": "Only Os",
                            "doc_count": 1
                        }
                    ]
                }
            },
            {
                "key": "john",
                "doc_count": 2,
                "employees": {
                    "doc_count_error_upper_bound": 0,
                    "sum_other_doc_count": 0,
                    "buckets": [
                        {
                            "key": "Just Js",
                            "doc_count": 1
                        },
                        {
                            "key": "Only Os",
                            "doc_count": 1
                        }
                    ]
                }
            },
            {
                "key": "rob",
                "doc_count": 2,
                "employees": {
                    "doc_count_error_upper_bound": 0,
                    "sum_other_doc_count": 0,
                    "buckets": [
                        {
                            "key": "Just Js",
                            "doc_count": 1
                        },
                        {
                            "key": "Only Os",
                            "doc_count": 1
                        }
                    ]
                }
            }
        ]
    }
}

then if the first aggregation use companies like this

POST /test/_search
{
  "size": 0,
  "aggs": {
	"companies": {
	  "terms": {
		"field": "company.keyword",
		"size" : "3"

	  },
	  "aggs": {
		"employees": {
		  "terms": {
			"field": "people.keyword",
			"size" : "3"
		  }
		}
	  }
	}
  }
}

the response is:

    "aggregations": {
    "companies": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 2,
        "buckets": [
            {
                "key": "Apple",
                "doc_count": 3,
                "employees": {
                    "doc_count_error_upper_bound": 0,
                    "sum_other_doc_count": 1,
                    "buckets": [
                        {
                            "key": "alice",
                            "doc_count": 1
                        },
                        {
                            "key": "bob",
                            "doc_count": 1
                        },
                        {
                            "key": "diego",
                            "doc_count": 1
                        }
                    ]
                }
            },
            {
                "key": "IBM",
                "doc_count": 3,
                "employees": {
                    "doc_count_error_upper_bound": 0,
                    "sum_other_doc_count": 1,
                    "buckets": [
                        {
                            "key": "alice",
                            "doc_count": 1
                        },
                        {
                            "key": "bob",
                            "doc_count": 1
                        },
                        {
                            "key": "diego",
                            "doc_count": 1
                        }
                    ]
                }
            },
            {
                "key": "Microsoft",
                "doc_count": 3,
                "employees": {
                    "doc_count_error_upper_bound": 0,
                    "sum_other_doc_count": 1,
                    "buckets": [
                        {
                            "key": "alice",
                            "doc_count": 1
                        },
                        {
                            "key": "bob",
                            "doc_count": 1
                        },
                        {
                            "key": "diego",
                            "doc_count": 1
                        }
                    ]
                }
            }
        ]
    }
}

So in this example we can note:

  • if I use agg on "peoples" & sub agg on "companies" I lost the top companies, in the same way
  • if I use agg on "companies" & sub agg on "peoples" I lost the top peoples.

Diego

Why so small if there's >3 companies?

Is set to 3 to illustrate this use specific stress case.

In a production we will have millions of documents and maybe we will set this parameter most high is possible (2^31-1) compatibly with the performance.

The number of documents is not necessarily the limiting factor. It is the number of unique entities (people/companies) that the docs reference. It's probably worth asking what you hope to do with a response that may be gigabytes of JSON - it doesn't sound like something you should attempt in a single request over a distributed store. There's a reason graph databases tend to want to put everything on one machine and use masses of RAM.

You're right,
thanks for your support

DM

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.