Eliminate Multiple Values of the Same Field from a Collection of Documents


#1

Hello...

I spent much of yesterday doing some exhaustive--and exhausting!--analysis of a large collection of IT tickets.

One thing I'm trying to do at the moment is eliminate documents based on whether or not their respective 'resolution' fields consist solely of vague and useless phrases. Yes, alas, a lot of these tickets have as their resolution 'Your problem is related to a major incident that has been resolved...' with absolutely no descriptive data of the 'major incident' mentioned. It's clear some support folks have a palette of bromides they paste in from time-to-time as they close out tickets. They probably think this is protecting them from being replaced by a computer...and they're right--for these tickets, a parrot would suffice:-).

So, I have a collection these weasel-phrases. More-or-less, I was successful using Kibana's filter tools to create one negative filter at a time on each weasel phrase--that is, 'resolution' IS NOT . The collection of these performed reasonably well to pare down the corpus to tickets containing useful resolution information. For which I am happy.

But, I'm not entirely confident in my results because the way Elasticsearch operates means I am not--as far as I can tell--always eliminating exact matches.

I realize one solution is to store the resolution field in keyword form as well, but some of these resolutions are lengthy, multi-line blurbs.

How should I proceed?

For example, doing this as a Lucene search looks like:

NOT resolution: ("weasel phrase 1", "weasel phrase 2", ..., "weasel phrase n")

Not much fun when n gets to be big--say around 30. Alternatively, I can take my list of weasel phrases and use python to generate a JSON query I can submit or copy/paste; for that matter can also whip up the above Lucene query using this method, too.

This little adventure has raised a few questions:

  1. Exact matching? Should it be happening when I raise a query for a text field? I'm guessing not unless the field is stored as a keyword, and that phrase matching will work better for the field in stored in raw format; and I'm aware that multiple formats for a field are allowed.
  2. Confusion over Kibana's Filter Bar. If I create a couple of filters--let's call them P and Q for simplicity's sake--I know the filter bar will apply P & Q to the data--both filters must apply in the documents that pass through the filter. The filter bar offers a few options but the ones "Toggle" and "Invert" interest me. From what I can tell, "Toggle" just disables all of the filters if they are currently enabled or enables them if currently disabled. I would think, based on de Morgan's Theorems from Logic that "Invert" would switch "P & Q" to "~P || ~Q" (that's NOT P OR NOT Q). But it appears the "Invert" instead yields "~P & ~Q" (that's NOT P AND NOT Q). I guess I'm complaining mildly that your "Invert" feature on the filter bar doesn't obey de Morgan's formulae. This is a fine point, but if your filter bar's invert feature followed de Morgan's laws, I'd be able to do the other thing I want, which is find immediately the complement of the documents that use clear language--i.e., the rest of the documents in the index that use weasel words. I want those of course to go back to the support people and tell them to get the folks dispensing the weasel words to cease and desist;-)
  3. This brings up my final question: Suppose I go through a painstaking filtering process--as described earlier in this topic--and I save the resulting search as a Kibana "Saved Search" object. Is there an easy way to get the complement of this search with respect to the parent index? That is, let the parent index be denoted U and the set of documents in my saved search be A--I want what's in U but not in A. I suppose I could somehow pull all of the _id values from by saved search to a list and then feed them back in a query of the form NOT id: (id1, id2, ..., idn). Is there a quick/elegant way to do that?

Anyway, I'm grateful for any help/advice you can offer. Thanks in Advance!


(Brandon Kobel) #2

That is correct. If you're running the queries against an analyzed field, the analyzer is discussed more in-depth here, it performs "full text search" against these fields to rank the matches. A keyword field will let you do exact matches.

Confusion over Kibana's Filter Bar. If I create a couple of filters--let's call them P and Q for simplicity's sake--I know the filter bar will apply P & Q to the data--both filters must apply in the documents that pass through the filter. The filter bar offers a few options but the ones "Toggle" and "Invert" interest me. From what I can tell, "Toggle" just disables all of the filters if they are currently enabled or enables them if currently disabled. I would think, based on de Morgan's Theorems from Logic that "Invert" would switch "P & Q" to "~P || ~Q" (that's NOT P OR NOT Q). But it appears the "Invert" instead yields "~P & ~Q" (that's NOT P AND NOT Q). I guess I'm complaining mildly that your "Invert" feature on the filter bar doesn't obey de Morgan's formulae. This is a fine point, but if your filter bar's invert feature followed de Morgan's laws, I'd be able to do the other thing I want, which is find immediately the complement of the documents that use clear language--i.e., the rest of the documents in the index that use weasel words. I want those of course to go back to the support people and tell them to get the folks dispensing the weasel words to cease and desist;-)

Agreed that this is a frustrating limitation of the filter bar. All of the filters are currently "AND"ed, there's not a possibility to "OR" them, which would let you do what you're looking for. https://github.com/elastic/kibana/issues/3693 exists to track this request, if you wouldn't mind commenting on this issue with your use-case or giving it a +1, it'll help us prioritize getting it added appropriately.

This brings up my final question: Suppose I go through a painstaking filtering process--as described earlier in this topic--and I save the resulting search as a Kibana "Saved Search" object. Is there an easy way to get the complement of this search with respect to the parent index? That is, let the parent index be denoted U and the set of documents in my saved search be A--I want what's in U but not in A. I suppose I could somehow pull all of the _id values from by saved search to a list and then feed them back in a query of the form NOT id: (id1, id2, ..., idn). Is there a quick/elegant way to do that?

There isn't an elegant way to do precisely this, unfortunately...


#3

Hi Brandon,

Apologies for the delay responding to your very helpful reply--long story whose details I will spare you.

In the end, I exported a CSV file of the search comprising the "good" results--the ones I wanted to eliminate from the index as a whole--and then did some quick work on them in a Jupyter notebook using python. I just took the _id values of all of the "good" documents and then generated a Lucene query string that I pasted into Kibana's search bar--and, it worked! I got the number of documents I expected.

Just for grins, I figured maybe I should export the set of "bad documents" to a CSV file as well so that I could do some more intensive analysis off-line. Sadly, however, I got an error message: "Reporting Error: Request-URI Too Long." Any guess as to what is happening here? I was a little surprised simply because the query bar was able to handle this really long query, I could save the search, but somehow I can't export a CSV of this saved search?

Also, of course, thanks for the link regarding the analyzer. And, I have added a +1 to the issue you cited on Github.

Finally, I have to confess that I find the multiple query and scripting methodologies--Lucene, JSON, and painless--could use better documentation than what exists. Don't get me wrong, I know the Elastic Team has no doubt invested considerable effort on this front. A lot more examples would help. From my point of view, the more that could be done to automatically generate queries, filters, et cetera, the better.

Thanks!