Analyzer processing when indexing and searching

ludehon · May 28, 2019, 9:03am

Hello, I am using the Elastic Python API and having difficulties to process a document through analyzer. My goal is to process the content of ingest's document with analyzer, but I cannot achieve any result with a regular document.

This is the code I use :

es = elasticsearch.Elasticsearch(['localhost:9200'])
ic = IndicesClient(es)
ic.delete(index="_all")

index_body = {
    "settings":
    {
        "analysis":
        {
            "analyzer":
            {
                "my_analyzer":
                {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": ["lowercase"]
                }
            }
        }
    },
    "mappings":
    {   
        "properties": {
            "sentence": {
                "type": "text",
                "analyzer": "my_analyzer"
            }
        }

    }
}
ic.create(index="index1", body=index_body)

body = {"sentence": "HELLO there"}
res = es.index(index="index1", id=1, body=body)

time.sleep(2)
print("SLEEP END")

search_body = {
    "query" : {
        "term" : {
            "sentence": 'hello',
        }
    },
    "highlight": {
        "fields": {
            "sentence": {
                "fragment_size": 20, # The size of the highlighted fragment in characters. Defaults to 100.
                "number_of_fragments": 5
            }
        }
    },
    "size": 10
}

res2 = es.search(index='index1', body=search_body) #,body=search_body) # 
print("SEARCH RESULT::")
print(res2)

Console:

SLEEP END
SEARCH RESULT::
{'took': 4, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 1, 'relation': 'eq'}, 'max_score': 0.2876821, 'hits': [{'_index': 'index1', '_type': '_doc', '_id': '1', '_score': 0.2876821, '_source': {'sentence': 'HELLO there'}, 'highlight': {'sentence': ['HELLO there']}}]}}

The document I have at http://localhost:9200/index1/_doc/1 is

{"_index":"index1","_type":"_doc","_id":"1","_version":1,"_seq_no":0,"_primary_term":1,"found":true,"_source":{"sentence":"HELLO there"}}

so the field "sentence" is not processed by the analyzer.

I tried to code according to the Elastic documentation, and couldn't find Python example online that worked for me. Any help is appreciated, and for using analyzer with Ingest, should I just change the "mappings" in index_body ? By the way I am using ES 7.1/Python3

gabriel_tessier · May 29, 2019, 4:16am

Hi,

I think there's misunderstood on how the analyzer work, the script you show is correct as you store "HELLO there" with a lowercase filter and when you search 'hello' your document is found.
I think you expect that your document have "hello there" stored... so what you want is transform your data according to the analyzer definition.

So for me what you do is correct and work as it suppose to work. If you want to transform your data check about ingest and it's better to first try with Kibana console to debug then use the python when you know that elastic stuff are working, it can prevent confusion and help to have more help as lot of people will see Python API and think that the problem is related to Python and skip your message.

What is your need? do you really need to store your data in lowercase? you can also use multifield to be able to search on lower and upper case of your field.

Also not related but you can remove the time.sleep(2) and replace with es.refresh( "index1") [not sure about the syntax].
Read about the refresh here: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-refresh.html

ludehon · May 29, 2019, 7:49am

Hi @gabriel_tessier, thank you for your help.
My goal is to index files (pdf, docx, ...) and be able to offer a "flexible" search that doesn't look at accent, stop words, etc. I already use Ingest and I've done successful search (quering term from attachment.content), but since the content returned was not changed I thought the analyzer was ineffective.

As stated in analysis doc, I can specify an "index time analyzer", but it's only added to the inverted index ? I thought analyzer could process and convert the text into the indexed document.

If I understand, I don't need to transform the indexed data right ? I thought I should, because I can also specify a search time analysis...

I looked at Ingest doc and I only read about processors transforming document field but not the content.

Thanks again,
Lucien.

ludehon · May 29, 2019, 1:32pm

All right, I achieved analysis for ingest with the right mappings :
"mappings":{"properties":{"attachment": {"properties":{"content" : {"type": "text","analyzer": "my_analyzer"

As Gabriel wrote, the ingest document is not changed through analysis, but I guess the processed data goes into the inverted index. That's why it does not appear in the search results. Also, I changed the query type from "term" to "match" since "term" only look for one version of the query string :
"query" : {"match" : {"attachment.content": word}}

gabriel_tessier · May 30, 2019, 12:53am

Hi,

if your goal is to index pdf docx etc... did you check about: https://github.com/dadoonet/fscrawler

Also did you check about multifield (I can't find the related url in the doc ) and instead of making your own analyzer did you check about the already build in language analyzer that already handle stopwords accent etc... https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html#analysis-lang-analyzer

ludehon · June 3, 2019, 7:02am

Hi again, I saw the FSCrawler project from dadoonet, but since I succeeded indexing files I don't know if it will add value to my project...

By multifield do you mean object/nested ?

Indeed I will use built-in analyzer but I first wanted to test basic indexing/queries

dadoonet · June 3, 2019, 9:57am

Hey @ludehon

In case you are not aware you can also ask questions in french in Discussions en français .

What do you mean?

I don't think he meant that. He meant I believe that a given field can be analyzed in multiple ways. See fields | Elasticsearch Guide [8.11] | Elastic

ludehon · June 3, 2019, 12:15pm

Hi @dadoonet,
I did not know about native tongue questions, is it only intended for native speakers ? Doesn't that reduce the discussions available for ES community ?

I meant since I already use Ingest, I didn't fully understand the benefits of using FSCrawler over Ingest (besides being simpler ?).

dadoonet · June 3, 2019, 2:55pm

Not specifically native speakers but people who can actually read and write the give language. Not a big deal of course. Was just saying

There are pros and cons.
On the "pro" side, FSCrawler allows running OCR, allows extracting from much more type of files, allows sending to elasticsearch nodes much smaller content (the extracted one) instead of the whole binary BASE64 content which can consume lot of HEAP in elasticsearch.
On the "con" side, it's another piece of code to run, it's not supported officially by elastic as it's a community project...

My 2 cents

ludehon · June 3, 2019, 4:26pm

Thank you very much for the tip @dadoonet ! I will definitely keep this in mind for potential evolution. Have a good day

system · July 1, 2019, 4:26pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to apply English analyzer to a set of documents already indexed Elasticsearch	4	702	June 26, 2017
POST analyzed, Index Elasticsearch	19	1185	April 25, 2018
Ingest pipeline for text analysis? Elasticsearch	12	1567	August 20, 2020
Use an Analyzer within a painless script and access the produced terms Elasticsearch painless , ingest-pipeline	5	693	October 5, 2021
Elasticsearch Ingest Pipeline + index for language identification and text analysis Elasticsearch	1	409	August 14, 2020

Analyzer processing when indexing and searching

Related topics