I need to generate a co-occurrence graph

DieMal · October 2, 2018, 7:23am

Hi there !
What is the best way (performance) for implements a co-occurrence search on a couple of fields ?

I found this solution, but I'm not sure that this is the best way:

{
	"size": 0,
	"aggs": {
		"cooccurrencePC": {
			"terms": {
				"min_doc_count": 1,
				"script": "Set justComputated = new HashSet(); for (people in doc['people'].values){for (company in doc ['companies'].values){if (!justComputated.contains(people+company)){justComputated.add(people + '->' + company);}}}return justComputated;",
				"size": 100
			}
		}
	}
}

thanks
Diego

Mark_Harwood · October 2, 2018, 8:56am

You should be able to used a pair of nested terms aggs if the number of people/companies are sufficiently small:

POST test/doc/1
{
  "people":["john", "joe", "jane"],
  "company":["Just Js"]
}
POST test/doc/2
{
  "people":["john", "joe", "rob"],
  "company":["Only Os"]
}
GET test/_search
{
  "size": 0,
  "aggs": {
	"companies": {
	  "terms": {
		"field": "company.keyword"
	  },
	  "aggs": {
		"employees": {
		  "terms": {
			"field": "people.keyword"
		  }
		}
	  }
	}
  }
}

DieMal · October 3, 2018, 8:14am

Thank you for your prompt reply.

If I understand correctly, in the example that you provide me (with sub aggregation), the query respond with the co-occurrences between the top (default=10) companies and the peoples that are present in the subset of documents that were "filtered" by the first aggregation. But if I draw a co-occurrence graph with agg "companies" and sub agg "peoples" I get a result instead with agg "peoples" and sub agg "companies" I get a different result. So the first aggregation filed play a fundamental role.

How can I guarantee to found all co-occurrences (max qualitative result Vs best performance) ?
Maybe with more query request and some business logic to combine then ?

Thanks,
Diego

Mark_Harwood · October 3, 2018, 8:33am

I think I need to see examples of your data to be sure about what you're tackling but people->companies and companies->people terms aggs should yield the same connections assuming

each doc asserts that all the people listed inside work for all the companies listed inside.
you have the size property on the terms agg cranked up high enough to get all parties

DieMal · October 3, 2018, 3:21pm

Hi Mark, a try to explain a limit case:

POST test/doc/1
{
  "people":["john", "joe", "rob"],
  "company":["Just Js"]
}

POST test/doc/2
{
  "people":["john", "joe", "rob"],
  "company":["Only Os"]
}

POST test/doc/3
{
  "people":["bob"],
  "company":["Apple", "IBM", "Microsoft"]
}

POST test/doc/4
{
  "people":["diego", "alice"],
  "company":["Apple", "IBM", "Microsoft"]
}

POST test/doc/5
{
  "people":["jose"],
  "company":["Apple", "IBM", "Microsoft"]
}

with that document distribution we have:

Top Peoples:
joe (2)
john (2)
rob (2)
alice (1)
bob (1)
diego (1)
jose (1)

Top Companies
Apple (3)
IBM (3)
Microsoft (3)
Just Js (1)
Only Os (1)

so if the first aggregation use peoples like this

POST test/_search
{
  "size": 0,
  "aggs": {
	"companies": {
	  "terms": {
		"field": "people.keyword",
		"size" : "3"

	  },
	  "aggs": {
		"employees": {
		  "terms": {
			"field": "company.keyword",
			"size" : "3"
		  }
		}
	  }
	}
  }

note the size parameter set to 3 (because we a have a very small set of documents (5) )

"aggregations": {
    "peoples": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 4,
        "buckets": [
            {
                "key": "joe",
                "doc_count": 2,
                "employees": {
                    "doc_count_error_upper_bound": 0,
                    "sum_other_doc_count": 0,
                    "buckets": [
                        {
                            "key": "Just Js",
                            "doc_count": 1
                        },
                        {
                            "key": "Only Os",
                            "doc_count": 1
                        }
                    ]
                }
            },
            {
                "key": "john",
                "doc_count": 2,
                "employees": {
                    "doc_count_error_upper_bound": 0,
                    "sum_other_doc_count": 0,
                    "buckets": [
                        {
                            "key": "Just Js",
                            "doc_count": 1
                        },
                        {
                            "key": "Only Os",
                            "doc_count": 1
                        }
                    ]
                }
            },
            {
                "key": "rob",
                "doc_count": 2,
                "employees": {
                    "doc_count_error_upper_bound": 0,
                    "sum_other_doc_count": 0,
                    "buckets": [
                        {
                            "key": "Just Js",
                            "doc_count": 1
                        },
                        {
                            "key": "Only Os",
                            "doc_count": 1
                        }
                    ]
                }
            }
        ]
    }
}

then if the first aggregation use companies like this

POST /test/_search
{
  "size": 0,
  "aggs": {
	"companies": {
	  "terms": {
		"field": "company.keyword",
		"size" : "3"

	  },
	  "aggs": {
		"employees": {
		  "terms": {
			"field": "people.keyword",
			"size" : "3"
		  }
		}
	  }
	}
  }
}

the response is:

    "aggregations": {
    "companies": {
        "doc_count_error_upper_bound": 0,
        "sum_other_doc_count": 2,
        "buckets": [
            {
                "key": "Apple",
                "doc_count": 3,
                "employees": {
                    "doc_count_error_upper_bound": 0,
                    "sum_other_doc_count": 1,
                    "buckets": [
                        {
                            "key": "alice",
                            "doc_count": 1
                        },
                        {
                            "key": "bob",
                            "doc_count": 1
                        },
                        {
                            "key": "diego",
                            "doc_count": 1
                        }
                    ]
                }
            },
            {
                "key": "IBM",
                "doc_count": 3,
                "employees": {
                    "doc_count_error_upper_bound": 0,
                    "sum_other_doc_count": 1,
                    "buckets": [
                        {
                            "key": "alice",
                            "doc_count": 1
                        },
                        {
                            "key": "bob",
                            "doc_count": 1
                        },
                        {
                            "key": "diego",
                            "doc_count": 1
                        }
                    ]
                }
            },
            {
                "key": "Microsoft",
                "doc_count": 3,
                "employees": {
                    "doc_count_error_upper_bound": 0,
                    "sum_other_doc_count": 1,
                    "buckets": [
                        {
                            "key": "alice",
                            "doc_count": 1
                        },
                        {
                            "key": "bob",
                            "doc_count": 1
                        },
                        {
                            "key": "diego",
                            "doc_count": 1
                        }
                    ]
                }
            }
        ]
    }
}

So in this example we can note:

if I use agg on "peoples" & sub agg on "companies" I lost the top companies, in the same way
if I use agg on "companies" & sub agg on "peoples" I lost the top peoples.

Diego

Mark_Harwood · October 3, 2018, 3:24pm

Why so small if there's >3 companies?

DieMal · October 3, 2018, 3:40pm

Is set to 3 to illustrate this use specific stress case.

In a production we will have millions of documents and maybe we will set this parameter most high is possible (2^31-1) compatibly with the performance.

Mark_Harwood · October 3, 2018, 3:45pm

The number of documents is not necessarily the limiting factor. It is the number of unique entities (people/companies) that the docs reference. It's probably worth asking what you hope to do with a response that may be gigabytes of JSON - it doesn't sound like something you should attempt in a single request over a distributed store. There's a reason graph databases tend to want to put everything on one machine and use masses of RAM.

DieMal · October 4, 2018, 7:17am

You're right,
thanks for your support

DM

system · November 1, 2018, 7:17am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Aggregation based on Array Intersection Elasticsearch	3	692	November 11, 2020
Need help with co-occurrence of values in nested documents across documents Elasticsearch	1	280	November 26, 2020
Query term aggregation to count occurence Elasticsearch	1	416	July 5, 2017
Intersection aggregation Elasticsearch	3	5489	July 17, 2017
Calculate term co-occurrence matrix Elasticsearch	3	2471	July 5, 2017

I need to generate a co-occurrence graph

Related topics