A simple answer to QA question: how does Elasticsearch work?

Lets say I have a very simple Elasticsearch index, 100,000 documents with a single field called "text" which was indexed using the default text analyzer

A QA engineer wants to know, when they type something in, what should they see?

I am having a hard time answering this question without algorithmic jargon, does anyone know of any good simple explanations for how the text analyzer works and what users should expect to see when typing anything in?

Basically assuming you are doing simple text indexing and queries;

  1. It'll index the document and break it down into terms
  2. When you run the query it uses the same analyser to break down that query and match it to the documents in the index to find matches

That's super high level though, so let us know if you want to dive deeper.

Is there an "Explain it like I'm 5" answer?

Words like "terms", "analyser" don't really mean much to non developers

It would be good to get a more in depth detail on this part too "break down that query and match it to the documents" (but keeping it super simple)

How about put text in and it works like Google Search :slight_smile:

Or maybe a little bit more?

Ingest / Index / Store a sentence as text.. Elasticsearch breaks it up into words and then searches that sentence by whatever words you search by.

In this case, the sentence is the text and the words are the tokens.

That's about his basic as I can get.

Or tell them to try the search on Wikipedia that's based on Elasticsearch :slight_smile:

The problem is nobody really knows how Google Search works...

We're moving from a clear cut SQL text index system where we can definitively say "we have XYZ in the database, if you type XYZ you will get XYZ", no noise words, no stemming, no synonyms

These are all details that QA need to be able to do their job of determining if things are working the way they should, do you see my problem here?

Yes...

You and / or your QA team will need to learn about Full text search in Elasticsearch..which BTW is based on relevance scoring And in my opinion is not always easy to test because it is not Boolean it is scored.

And the score can change because of many factors.

Unstructured / Full Text search is different than a terms search which is an exact match. Those are much easier to test... Those are like your where clause in your SQL database.

So there's no short answer here. Your team and you are going to need to learn a bit about text search.

Which gets back to what @warkolm said. Learning about tokenizing and analyzers assuming you want to be able to test.

Here is our docs

And I always liked Baeldung
here is an article by him

If you create the same set of documents that are indexed and analyze the same way and then use the same queries every time you should be able to basically make it repeatable.

But if you use an ever-growing QA database that's not necessarily guaranteed because of scoring which includes the number of docs etc. In the scoring.

Perhaps someone else will have a perspective for you.

Yes I've basically explained everything you've said here to the QA team but it comes off as "hand wavey" as without being a developer you are effectively saying "nobody knows" which is similar to the Google situation.

What QA really needs to know is given a set of documents, how can they be sure they are seeing the expected results for a given query, and if adding documents changes scoring, how can they craft documents in such a way that this is measurable.

Here is what I'm basically saying when introducing what a search engine is (when you come from the SQL world).

Think of a search engine as an index.

What is an index?

An index is what you can see at the end of a book.

When you are searching for a term, you are not going to open every single page a try to find the term in the page. Instead, you are going to open the index. Navigate through the index to find the term you are searching for. Luckily enough, the index is sorted, which helps you to quickly find the term. Once you have the term, you can get the page number and open the page.

Which means that the important part is how you are building the index... This is where the text analysis is playing an important role...

If your text source is L'éléphant but the user is searching for elephants, you want to generate the same tokens at index time and search time.

So L'éléphant would be indexed as elephant.
And elephants would be analyzed as elephant.

Both terms would match... :slight_smile:

HTH

  1. It'll take the document, pull apart all the words (this is the basis of analysis), and then store them in an inverted index.
  2. An inverted index is like a book index, but rather than organised by page or alphabetically, it is based on the connection between words (or terms) to documents.
  3. Then when you search it takes the words you search on, pulls them apart using the same process as happened the first time, then checks the inverted index for matching documents and returns results

Is that better?

Thanks that answers some of the questions around synonyms, if we're using all the default settings in elasticsearch is there a way to see a list of every synonym/transformation that occurs?

I'd like to give such a list to QA so for every search they attempt they are able to cross reference what they should be seeing

Yes that makes sense, others in the thread have mentioned scoring and synonyms which affect the outcomes of text searches - is there anything else?

Things like sharding, the type of query you are running, the mappings of the document, what (human) language you are searching on.

Analyze API | Elasticsearch Guide [8.4] | Elastic covers some of it but that's a very large question.

There is no synonym by default.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.