Analyzer processing when indexing and searching

Hello, I am using the Elastic Python API and having difficulties to process a document through analyzer. My goal is to process the content of ingest's document with analyzer, but I cannot achieve any result with a regular document.

This is the code I use :

es = elasticsearch.Elasticsearch(['localhost:9200'])
ic = IndicesClient(es)
ic.delete(index="_all")

index_body = {
    "settings":
    {
        "analysis":
        {
            "analyzer":
            {
                "my_analyzer":
                {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": ["lowercase"]
                }
            }
        }
    },
    "mappings":
    {   
        "properties": {
            "sentence": {
                "type": "text",
                "analyzer": "my_analyzer"
            }
        }

    }
}
ic.create(index="index1", body=index_body)

body = {"sentence": "HELLO there"}
res = es.index(index="index1", id=1, body=body)

time.sleep(2)
print("SLEEP END")

search_body = {
    "query" : {
        "term" : {
            "sentence": 'hello',
        }
    },
    "highlight": {
        "fields": {
            "sentence": {
                "fragment_size": 20, # The size of the highlighted fragment in characters. Defaults to 100.
                "number_of_fragments": 5
            }
        }
    },
    "size": 10
}

res2 = es.search(index='index1', body=search_body) #,body=search_body) # 
print("SEARCH RESULT::")
print(res2)

Console:

SLEEP END
SEARCH RESULT::
{'took': 4, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 1, 'relation': 'eq'}, 'max_score': 0.2876821, 'hits': [{'_index': 'index1', '_type': '_doc', '_id': '1', '_score': 0.2876821, '_source': {'sentence': 'HELLO there'}, 'highlight': {'sentence': ['HELLO there']}}]}}

The document I have at http://localhost:9200/index1/_doc/1 is

{"_index":"index1","_type":"_doc","_id":"1","_version":1,"_seq_no":0,"_primary_term":1,"found":true,"_source":{"sentence":"HELLO there"}}

so the field "sentence" is not processed by the analyzer.

I tried to code according to the Elastic documentation, and couldn't find Python example online that worked for me. Any help is appreciated, and for using analyzer with Ingest, should I just change the "mappings" in index_body ? By the way I am using ES 7.1/Python3

Hi,

I think there's misunderstood on how the analyzer work, the script you show is correct as you store "HELLO there" with a lowercase filter and when you search 'hello' your document is found.
I think you expect that your document have "hello there" stored... so what you want is transform your data according to the analyzer definition.

So for me what you do is correct and work as it suppose to work. If you want to transform your data check about ingest and it's better to first try with Kibana console to debug then use the python when you know that elastic stuff are working, it can prevent confusion and help to have more help as lot of people will see Python API and think that the problem is related to Python and skip your message.

What is your need? do you really need to store your data in lowercase? you can also use multifield to be able to search on lower and upper case of your field.

Also not related but you can remove the time.sleep(2) and replace with es.refresh( "index1") [not sure about the syntax].
Read about the refresh here: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-refresh.html

Hi @gabriel_tessier, thank you for your help.
My goal is to index files (pdf, docx, ...) and be able to offer a "flexible" search that doesn't look at accent, stop words, etc. I already use Ingest and I've done successful search (quering term from attachment.content), but since the content returned was not changed I thought the analyzer was ineffective.

As stated in analysis doc, I can specify an "index time analyzer", but it's only added to the inverted index ? I thought analyzer could process and convert the text into the indexed document.

If I understand, I don't need to transform the indexed data right ? I thought I should, because I can also specify a search time analysis...

I looked at Ingest doc and I only read about processors transforming document field but not the content.

Thanks again,
Lucien.

All right, I achieved analysis for ingest with the right mappings :
"mappings":{"properties":{"attachment": {"properties":{"content" : {"type": "text","analyzer": "my_analyzer"

As Gabriel wrote, the ingest document is not changed through analysis, but I guess the processed data goes into the inverted index. That's why it does not appear in the search results. Also, I changed the query type from "term" to "match" since "term" only look for one version of the query string :
"query" : {"match" : {"attachment.content": word}}

Hi,

if your goal is to index pdf docx etc... did you check about: https://github.com/dadoonet/fscrawler

Also did you check about multifield (I can't find the related url in the doc :bowing_man: ) and instead of making your own analyzer did you check about the already build in language analyzer that already handle stopwords accent etc... https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html#analysis-lang-analyzer

1 Like

Hi again, I saw the FSCrawler project from dadoonet, but since I succeeded indexing files I don't know if it will add value to my project...

By multifield do you mean object/nested ?

Indeed I will use built-in analyzer but I first wanted to test basic indexing/queries :slight_smile:

Hey @ludehon

In case you are not aware you can also ask questions in french in Discussions en français .

What do you mean?

I don't think he meant that. He meant I believe that a given field can be analyzed in multiple ways. See fields | Elasticsearch Guide [8.11] | Elastic

1 Like

Hi @dadoonet,
I did not know about native tongue questions, is it only intended for native speakers ? Doesn't that reduce the discussions available for ES community ?

I meant since I already use Ingest, I didn't fully understand the benefits of using FSCrawler over Ingest (besides being simpler ?).

Not specifically native speakers but people who can actually read and write the give language. Not a big deal of course. Was just saying :slight_smile:

There are pros and cons.
On the "pro" side, FSCrawler allows running OCR, allows extracting from much more type of files, allows sending to elasticsearch nodes much smaller content (the extracted one) instead of the whole binary BASE64 content which can consume lot of HEAP in elasticsearch.
On the "con" side, it's another piece of code to run, it's not supported officially by elastic as it's a community project...

My 2 cents

Thank you very much for the tip @dadoonet ! I will definitely keep this in mind for potential evolution. Have a good day :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.