Significant terms aggregation custom score with more than one background

I'm trying to implement different types of disproportionality metrics for pharmacovigilance reports in Elasticsearch. These are basically different flavours of significant terms aggregations, and I'm pretty sure I can use custom scoring functions, but I need marginal counts.

The individual documents in elastic are adverse drug reports that have some report meta-data, at least one suspect drug, and at least one reaction term.

I would aggregate first by suspect drug, and then do significant terms aggregation on reactions within each suspect drug bucket. The problem is that I need access to two different background counts for the scoring function. I need the count of all documents without the suspect drug and the count of all documents without the specific reaction that is being scored.

I know I can specify a custom background, but I can't seem to define more than one.

Does anyone have any ideas on how something like this could be implemented?

Hi dbuijs,

First I'd like to check my assumptions - you want to identify significant reactions for each drug?
So the drug is the bucketing that defines subsets of interest and in each of these you are trying to identify only the significant reactions?

The significance scores for each reaction/drug combo will be given four numbers:

  1. Foreground size - the number of reports for a particular drug
  2. Foreground count - the number of reports for a particular reaction for a particular drug
  3. Background size - the number of all reports
  4. Background count - the number of all reports with a particular reaction

I need the count of all documents without the suspect drug and the count of all documents without the specific reaction that is being scored.

Can these be derived by a scripted heuristic from the 4 numbers presented to the scoring function?

The query I've used previously for reaction data like this is:

GET fdadrugs/_search
{
  "size":0,
  "aggs": {
	"drug": {
	  "terms": {
		"field":"drugs.keyword"
	  },
	  "aggs":{
		"significant_reactions":{
		  "significant_terms": {
			"field": "reactions.keyword"
		  }
		}
	  }
	}
  }
}

Mark,

Thanks so much for this. I think that's exactly what I was looking for. It was just way simpler than I thought. Background count for the particular reaction is across all documents in the index, correct? Not just within each drug bucket?

Yes. So using that "headache" might be something we can spot as "commonly common" across all drugs and ignore if it was prevalent in any one drug's set of reports. It is the things that are very rare in the background (eg "blood in stools") and comparatively common in the foreground set for a drug that are of most interest. This visualisation may be illuminating. The items in the top left corner are the most interesting.

Can I just clarify the terminology here:

Foreground size = subset size
Foreground count = subset freq
Background size = superset size
Background count = superset freq

Did I get that right?

I think I've got something sorted out, but it's barfing due to divide by zero. Is there a way to just ignore or assign these values a null?

Yep.

If you're using a custom background_filter then your subset may no longer be a subset of the superset. This could produce superset_freq values of zero if your custom filter matches entirely different sets of docs to the subset. Some of the scoring heuristics have settings to include safeguards where your background filter make results that break the subset/superset assumptions.

Here's what I'm using to get the Proportional Reporting Ratio (PRR) (http://openvigil.sourceforge.net/doc/DPA.pdf):

POST drug_event/_search?size=0
{
  "aggregations": {
        "drug": {
           "terms": {
		      "field":"report_ingredient_suspect.keyword"
            },
          "aggs":{
        	"significant_reactions":{
	          "significant_terms": {
		          "field": "reaction_pt.keyword",
		            "script_heuristic": {
		            "script": "(params._subset_freq/params._subset_size)/((params._superset_size - params._subset_size)/(params._superset_freq - params._subset_freq))"
		          }
	  }
	}
  }
}
  }
}

And it's giving me a divide by zero error. Any thoughts?

If a single drug's reports contain all of the incidents of a particular reaction then superset_freq == subset_freq and subtracting them gives zero?

That does seem to have something to do with it. When I add +1 to the denominator, it doesn't error any more, but I don't get any significant terms.

I ran it again with the script just for each individual parameter and that works, but as soon as I try to do something with them, I get empty buckets. Is there a minimum score threshold I can adjust?

I’m not sure I understand your current problem. If your script returns >0 and doesn’t divide by zero then you should see some results.
Have you determined the out of the box heuristics don’t work for you? They all work with the same 4 numbers and generally just have a different emphasis on scoring for popular vs rare terms. Some of the differences are very subtle.

I think I got it working. These particular scoring algorithms have been published and validated in the scientific literature specifically for pharmacovigilance. I do want to look at how the other built-in metrics compare, but in order to get my scientist colleagues to look at it, I need to be able to demonstrate that Elasticsearch is accurately replicating metrics they already understand.

We're still seeing some strangeness in the _superset_size parameter. It seems to change based on the significant terms aggregation size parameter.

This is likely caused by having multiple shards - distribution is often the enemy of accuracy. When your data is not all in one place for analysis then different shards can have different ideas about what the top N terms are. When their results are glued together it’s not necessarily the same answer as the result for the same query on a single shard. The shard_size setting can help but for accurate significant terms analysis having a single index/shard is best.

It seems like there's significant variability in the value of _superset_size, even without touching anything, just running the same query repeatedly.

What is concerning is that every time it reports that it's heard back from all shards, and the reported upper bound for error is about 5K, but the actual variation I'm seeing within a few seconds is more than 2M (index is read-only, nothign is getting written).

I"ll also add that superset_size is quite a bit bigger than the total document count in the index. I'm guessing that this is a result of the fields having arrays with more than one value, so it doesn't worry me as much as the number changing.

Is this normal behaviour?

Do you have updates to docs or deletes? Do you have replicas?

There are documented issues with how stats are calculated with regards to deletes which may vary between replicas due to differing merges.