Lets say I have a very simple Elasticsearch index, 100,000 documents with a single field called "text" which was indexed using the default text analyzer
A QA engineer wants to know, when they type something in, what should they see?
I am having a hard time answering this question without algorithmic jargon, does anyone know of any good simple explanations for how the text analyzer works and what users should expect to see when typing anything in?
The problem is nobody really knows how Google Search works...
We're moving from a clear cut SQL text index system where we can definitively say "we have XYZ in the database, if you type XYZ you will get XYZ", no noise words, no stemming, no synonyms
These are all details that QA need to be able to do their job of determining if things are working the way they should, do you see my problem here?
You and / or your QA team will need to learn about Full text search in Elasticsearch..which BTW is based on relevance scoring And in my opinion is not always easy to test because it is not Boolean it is scored.
And the score can change because of many factors.
Unstructured / Full Text search is different than a terms search which is an exact match. Those are much easier to test... Those are like your where clause in your SQL database.
So there's no short answer here. Your team and you are going to need to learn a bit about text search.
Which gets back to what @warkolm said. Learning about tokenizing and analyzers assuming you want to be able to test.
Here is our docs
And I always liked Baeldung
here is an article by him
If you create the same set of documents that are indexed and analyze the same way and then use the same queries every time you should be able to basically make it repeatable.
But if you use an ever-growing QA database that's not necessarily guaranteed because of scoring which includes the number of docs etc. In the scoring.
Perhaps someone else will have a perspective for you.
Yes I've basically explained everything you've said here to the QA team but it comes off as "hand wavey" as without being a developer you are effectively saying "nobody knows" which is similar to the Google situation.
What QA really needs to know is given a set of documents, how can they be sure they are seeing the expected results for a given query, and if adding documents changes scoring, how can they craft documents in such a way that this is measurable.
When you are searching for a term, you are not going to open every single page a try to find the term in the page. Instead, you are going to open the index. Navigate through the index to find the term you are searching for. Luckily enough, the index is sorted, which helps you to quickly find the term. Once you have the term, you can get the page number and open the page.
Which means that the important part is how you are building the index... This is where the text analysis is playing an important role...
If your text source is L'éléphant but the user is searching for elephants, you want to generate the same tokens at index time and search time.
So L'éléphant would be indexed as elephant.
And elephants would be analyzed as elephant.
It'll take the document, pull apart all the words (this is the basis of analysis), and then store them in an inverted index.
An inverted index is like a book index, but rather than organised by page or alphabetically, it is based on the connection between words (or terms) to documents.
Then when you search it takes the words you search on, pulls them apart using the same process as happened the first time, then checks the inverted index for matching documents and returns results
Thanks that answers some of the questions around synonyms, if we're using all the default settings in elasticsearch is there a way to see a list of every synonym/transformation that occurs?
I'd like to give such a list to QA so for every search they attempt they are able to cross reference what they should be seeing
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.