Hello...
I spent much of yesterday doing some exhaustive--and exhausting!--analysis of a large collection of IT tickets.
One thing I'm trying to do at the moment is eliminate documents based on whether or not their respective 'resolution' fields consist solely of vague and useless phrases. Yes, alas, a lot of these tickets have as their resolution 'Your problem is related to a major incident that has been resolved...' with absolutely no descriptive data of the 'major incident' mentioned. It's clear some support folks have a palette of bromides they paste in from time-to-time as they close out tickets. They probably think this is protecting them from being replaced by a computer...and they're right--for these tickets, a parrot would suffice:-).
So, I have a collection these weasel-phrases. More-or-less, I was successful using Kibana's filter tools to create one negative filter at a time on each weasel phrase--that is, 'resolution' IS NOT . The collection of these performed reasonably well to pare down the corpus to tickets containing useful resolution information. For which I am happy.
But, I'm not entirely confident in my results because the way Elasticsearch operates means I am not--as far as I can tell--always eliminating exact matches.
I realize one solution is to store the resolution field in keyword form as well, but some of these resolutions are lengthy, multi-line blurbs.
How should I proceed?
For example, doing this as a Lucene search looks like:
NOT resolution: ("weasel phrase 1", "weasel phrase 2", ..., "weasel phrase n")
Not much fun when n gets to be big--say around 30. Alternatively, I can take my list of weasel phrases and use python to generate a JSON query I can submit or copy/paste; for that matter can also whip up the above Lucene query using this method, too.
This little adventure has raised a few questions:
- Exact matching? Should it be happening when I raise a query for a text field? I'm guessing not unless the field is stored as a keyword, and that phrase matching will work better for the field in stored in raw format; and I'm aware that multiple formats for a field are allowed.
- Confusion over Kibana's Filter Bar. If I create a couple of filters--let's call them P and Q for simplicity's sake--I know the filter bar will apply P & Q to the data--both filters must apply in the documents that pass through the filter. The filter bar offers a few options but the ones "Toggle" and "Invert" interest me. From what I can tell, "Toggle" just disables all of the filters if they are currently enabled or enables them if currently disabled. I would think, based on de Morgan's Theorems from Logic that "Invert" would switch "P & Q" to "~P || ~Q" (that's NOT P OR NOT Q). But it appears the "Invert" instead yields "~P & ~Q" (that's NOT P AND NOT Q). I guess I'm complaining mildly that your "Invert" feature on the filter bar doesn't obey de Morgan's formulae. This is a fine point, but if your filter bar's invert feature followed de Morgan's laws, I'd be able to do the other thing I want, which is find immediately the complement of the documents that use clear language--i.e., the rest of the documents in the index that use weasel words. I want those of course to go back to the support people and tell them to get the folks dispensing the weasel words to cease and desist;-)
- This brings up my final question: Suppose I go through a painstaking filtering process--as described earlier in this topic--and I save the resulting search as a Kibana "Saved Search" object. Is there an easy way to get the complement of this search with respect to the parent index? That is, let the parent index be denoted U and the set of documents in my saved search be A--I want what's in U but not in A. I suppose I could somehow pull all of the _id values from by saved search to a list and then feed them back in a query of the form NOT id: (id1, id2, ..., idn). Is there a quick/elegant way to do that?
Anyway, I'm grateful for any help/advice you can offer. Thanks in Advance!