How to assign index & search analyzers when ingesting attachments [ES 5.0]


(Susan Liu) #1

My autocomplete implementation of autocomplete worked in v2.3. But ever since the upgrade, I can't figure out how to apply index and search time analyzers to the attachment's content field for all ingested attachments like PDFs and so forth. Does anyone have an example or a pointer to getting me started? Thank you so much in advance!

I have followed the example listed under "The Future" in https://www.elastic.co/blog/the-future-of-attachments-for-elasticsearch-and-dotnet, and that's what I have so far. See below:

********** For better visualization of the code, look at the last GitHub post that I created here: https://github.com/elastic/elasticsearch/issues/21486 **********

My NEST code:

  1. Creating the index:
    var response = CreateIndex(_currentIndex, i => i
    .InitializeUsing(indexState)
    .Mappings(m => m
    .Map(t => t
    .AutoMap()
    .AllField(all => all
    .Enabled(false)
    )
    .Properties(p => p
    .Object(a => a
    .Name(n => n.Attachment)
    .AutoMap()
    )
    )
    )
    )
    );

  2. Creating the pipeline for the attachment:
    IPutPipelineResponse response = PutPipeline("attachments", p => p
    .Description("Document attachment pipeline")
    .Processors(pr => pr
    .Attachment(a => a
    .Field(f => f.Content)
    .TargetField(f => f.Attachment)
    )
    .Remove(r => r
    .Field(f => f.Content)
    )
    )
    );

My mapping:
{
"es5" : {
"aliases" : { },
"mappings" : {
"topic" : {
"_all" : {
"enabled" : false
},
"properties" : {
"content" : {
"type" : "text",
"store" : true,
"analyzer" : "autocomplete",
"search_analyzer" : "search"
},
"delete" : {
"type" : "boolean"
},
"file" : {
"properties" : {
"author" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"content" : { I want to add analyzers for this
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"content_length" : {
"type" : "long"
},
"content_type" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"date" : {
"type" : "date"
},
"detect_language" : {
"type" : "boolean"
},
"indexed_chars" : {
"type" : "long"
},
"keywords" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"language" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"title" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
},
"hash_id" : {
"type" : "text"
},
"path" : {
"type" : "text"
},
"title" : {
"type" : "text"
}
}
}
},
"settings" : { ... }
}
}


(evert) #2

Sorry Susan (@totessusan),

I do not know about the settings on C# lib... the problema/difference from the jSON and PHP was that in previous version the attachement was written along with the code, but in the version 5+ it goes to the ingest attachment.

Sorry if I was not of help.


(Susan Liu) #3

Thanks @evert! Actually, my code already accounts for that change to use ingest attachment. Did you happen to need to implement an autocomplete search for the content of the attachment at all? If you did, were you able to specify the analyzers for the ingested attachment in PHP? What did your code look like? Maybe I can decipher your code and see what I can do in C#. :slight_smile:


(evert) #4

@totessusan I did not implement autocomplete... one of my main issues was an issue that still exists in Sorl, so ES devs are still waiting this fix to implement, which is regarding the stopwords and highlights. Check this issue here to see almost my complete code: https://github.com/elastic/elasticsearch/issues/22346

Hope it will be of a help.


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.