Mapper Attachments Plugin with .NET client

richetdan · June 22, 2016, 9:10am

Hi all, I'm new to Elasticserch and Mapper Attachment Plugin.
I am using both .net clients, mixing them: Elasticsearch.net and NEST

I've created and indexed with mapping using the following REST command:

POST /trkindex
{
    "mappings":{
        "trkdocument":
            {"properties":
                {"file":
                    {"type":"attachment",
                            "fields" :  {
                              "content": {
                                "type": "string",
                                "term_vector":"with_positions_offsets",
                                "store": true
                              },
                              "content_type" : {"store" : "yes"}
                            }
                    }
                }
            }
    },
    "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 1 }}
}

I've indexed some documents and searched for them matching the content and the content_type (always using Elasticsearch.net + NEST).

All is working as expected except the fact that into the .net objects mapped to the ES type (TRKDocument) on the file property (of type attachment) the fields are null if set automatically by the plugin.
Here the code snippet of the search:

    var a = new Nest.SearchRequest<TRKDocument>("trkindex")
    {
        
        Query = new Nest.MatchQuery
        {
            
            Query = "application",
            Field = "file.content_type",               
        }
        
    };

    var result = client.Search<TRKDocument>(a);
    Debug.WriteLine(result.Documents.FirstOrDefault<TRKDocument>().File.ContentType);

the content type returned by the debug statement is null but it correctly match the query (the query is filtering content type as expected).
If i set content_type explicitally during indexing time then is returned.
I don't understand this behavior.
How can I get the full object filled with all the properties wich are set automaticaly?

Thanks in advance

-Daniele-

forloop · June 23, 2016, 2:12am

Hey @richetdan, the mapper-attachments plugin does not modify the source document sent to Elasticsearch; the extracted content and metadata are indexed into the inverted index (based on your attachment type mapping configuration), but the original source is untouched and hence why it doesn't appear in result.Documents (which maps to _source).

In order to get the extracted values, you can specify the fields that you are interested in, then obtain the values of these fields from the .Hits<T> collection on the result. For example,

var searchResponse = Client.Search<Document>(s => s
	.Fields(f => f
                // fields you're interested in
		.Field(d => d.Attachment.Name)
		.Field(d => d.Attachment.Author)
		.Field(d => d.Attachment.Content)
		.Field(d => d.Attachment.ContentLength)
		.Field(d => d.Attachment.ContentType)
		.Field(d => d.Attachment.Date)
		.Field(d => d.Attachment.Keywords)
		.Field(d => d.Attachment.Language)
		.Field(d => d.Attachment.Title)
	)
	.Query(q => q
		.MatchAll()
	)
);

and then

var documents = new List<Document>();

foreach (var hit in searchResponse.Hits)
{
	var document = new Document { Attachment = new Nest.Attachment() };
	document.Attachment.Name = hit.Fields.ValueOf<Document, string>(d => d.Attachment.Name);
	document.Attachment.Author = hit.Fields.ValueOf<Document, string>(d => d.Attachment.Author);
	document.Attachment.Content = hit.Fields.ValueOf<Document, string>(d => d.Attachment.Content);
	document.Attachment.ContentLength = hit.Fields.ValueOf<Document, long?>(d => d.Attachment.ContentLength);
	document.Attachment.ContentType = hit.Fields.ValueOf<Document, string>(d => d.Attachment.ContentType);
	document.Attachment.Date = hit.Fields.ValueOf<Document, DateTime?>(d => d.Attachment.Date);
	document.Attachment.Keywords = hit.Fields.ValueOf<Document, string>(d => d.Attachment.Keywords);
	document.Attachment.Language = hit.Fields.ValueOf<Document, string>(d => d.Attachment.Language);
	document.Attachment.Title = hit.Fields.ValueOf<Document, string>(d => d.Attachment.Title);
	documents.Add(document);
}

in this example, I populate a collection of types from values in the .Hits<T> collecton, but you may do something different.

An Attachment type was added in Nest 2.3.3 to make working with the mapper-attachments plugin easier with NEST; it's not included in the documentation yet, but take a look at the tests for it to see how to use it.

richetdan · June 23, 2016, 5:25pm

Thank you very much forloop, you answer perfectly cover my question.
You have even anticipated my next questions.

I also tried to disable storing of the "_source" field and everything seems to work properly.
Is there any downside to using this approach? apart from the fact that I will not be able to trigger a complete rebuild of the inverted index?

Does it make sense using NEST instead of Elasticserach.NET for search documents?

-Daniele-

forloop · June 26, 2016, 7:30am

It is fairly common to not store the base64 encoded string of the document in the index to save space, but as you say, it does mean that you'd not be able to rebuild the index from the current index source documents. You may want to also store the path to where the original document can be obtained e.g. on the file system, s3 bucket, Azure blob storage, etc though.

Completely up to you The advantage of using NEST is that all requests and responses are strongly typed, making them easier to work with, and you still have access to the low level client via client.LowLevel whenever you want to drop lower.

Topic		Replies	Views
Not able to search through attachment contents Elasticsearch	32	7919	July 5, 2017
Attachment-mapper - changing analyzer Elasticsearch	7	1098	November 4, 2022
Mapper Plugin Issues Elasticsearch	2	663	July 6, 2017
Indexing pdf documents Elasticsearch	2	5196	December 27, 2016
Attachment Plugin Questions on Storing Elasticsearch	14	518	July 6, 2017

Mapper Attachments Plugin with .NET client

Related topics