*noob* is this the correct usage of bool?


(Basiclaser) #1

Hi guys! First post.

http://hastebin.com/ufasuxudel.vhdl

I’m trying to check for almost-matching phrases, followed by almost-matching words -
is that a correct way of using bool ?
thanks.

My main confusion is surrounding how to extend the BOOL to include more and more conditions, to filter 'hits' using two range queries and also an ownership variable, so the full query would look something like this:

http://hastebin.com/opaguneduj.sm

I also tried my best to avoid deprecated methods like 'OR' but im not sure how to replace it in this example.

Thanks a lot for your help.
Chris


(Nik Everett) #2

Going over your paste from top to bottom:

//the root 'database' is the index

Its dangerous to use comparisons to relational databases for this stuff. The metaphor breaks down pretty quick.

//the type of document eg. user, document, source,

Think of types as easy filters. They are useful but they are not analogous to tables in relation databases, for example.

//find phrases (groups of words) which match the original query at least 80%

The search you have here doesn't search for phrases at all. It searches for terms and makes sure 80% of them are found in the document. I forget how minimum_should_match and fields interact - you should test it just to be sure. Is it that 80% of the terms should be in one field or is it 80% of the terms have to be found at all?

"fuzziness": "2",

This is amazingly slow. You'll be better off using the term or phrase suggester to get suggestions and make a "did you mean". For short words fuzziness of 2 will find just about all your documents.

If you want phrases to bring the score up but not be required you can put them in the must clause of bool query.

In a bool query:

  • If there are must clauses the document needs to match them all to be in the result set.
  • If there are should clauses the document needs to match minimum_should_match of them to be included in the result set.
  • minimum_should_match defaults to 1 if there are no must clauses and 0 if there are any must clauses.
  • If there are must_not clauses the document must match none of them to be included in the result set.

The reason we prefer this to AND and OR is because this is how Lucene handles these things. So its less abstracted from the actual queries being executed.

You might have a couple must with a bunch of should clauses. In this case the should clauses are for influencing the score.

If you want to exclude a document when it matches two queries then you make something like

"bool": {
  "must_not": [
    {
      "bool": {
        "must": [
          { "match": { "foo": "the quick brown fox" } },
          { "match": { "bar": "the lazy dog" } }
        ]
      }
    }
  }
}

(Basiclaser) #3

Thanks alot for your reply, its helped me get my head around the should/must for sure.
could you please clarify for me how many different standards of querying elasticsearch there are? I'm still trying to get my head around what is what.
so far I see that there are:

  • "query_string" lucene syntax
  • "basic" JSON query (is that queryDSL ?)
  • "compound" queryDSL query

Are there any other ways?
Also, are 'insert' and 'index' methods equivalent?

I really appreciate your help :grin:


(Nik Everett) #4

So there are really two ways:

  • Query string syntax on the URL. Its based on the Lucene query string but its extended a ton. It uses the same parser but Elasticsearch inserts itself in lots of places in the query parsing process.
  • The query DSL in the request body. The query DSL can contain a query_string query and in that you can get the same syntax.

I use the query DSL almost exclusively. The query_string syntax is brittle. Any mistakes cause the whole query to fail. In the query DSL there are obvious places to put user input like match. Those are less brittle. You can put your generated queries around it.

You can give query_string to a power user, but it is almost impossible to limit it to safe operations so you'd better trust that the user won't send super heavy queries. Its a pretty convenient syntax in some cases. OTOH everything the query_string can make can be done with the query DSL. The only fiddly thing is cross_feilds multi_match queries. They work similarly to a behavior in query string but not 100% the same.

Index is the thing in elasticsearch. "insert" is one of those words from relational databases that has crept into vocabulary because people are used to it even though its not a thing in Elasticsearch. I'm guilty of it too.


(Basiclaser) #5

This is all great thanks. And how would you recommend to construct a 'reasonable' or 'sensible' text-search query? I have to search 3 text fields, with matches in the 'title' field given a higher score.

Would that be a query > bool > must: title, should: author, abstract ?
I suppose this would be a good use case for filters also, right ? or is 3 filters on 3 fields overkill?


(Nik Everett) #6

I'd have a look at multi_match. I think that is the way to go.

In general three filters/fields is fine. You'll notice its more than one field but its not going to be more work than a single phrase or prefix query on one field.


(Basiclaser) #7

Thanks again.
I finally got a reasonable version working.
I'm discussing the issue of matching misspelt words to queries with my coworkers.
Is it possible to calculate fuzziness at indexing time in the same manner as suggester?
If possible, Would this help with query-time performance, and do you think it would bloat the memory usage too far to hell?


(Nik Everett) #8

I believe the completion suggester does exactly this. It swaps up front indexing time and memory usage for query time. A new version of the completion suggester that is more real time is coming super soon, though I don't know how soon.

Fuzzy queries are pretty pricy - I'd avoid doing them on every query. Something like using a suggester when you don't see enough results might be a useful alternative. Even the phrase or term suggester are fast enough if you only ask for them when you think you need them. You can ask for them every time, but that is somewhat similar in cost to a fuzzy query.


(system) #9