Exception in Ingest using NEST.Attachment after upgrade to 8.2.2

Hi!

I have upgraded Elasticsearch to 8.2.2 from 7.17.3, we are still using Elasticsearch.Net and NEST 7.17.2. To this we use Ingest for file search and we have a field of type NEST.Attachment on our model.

Then we have a pipeline:

_ = _indexService.Client.Ingest.PutPipeline("attachments", p => p
                 .Description("Document attachment pipeline")
                 .Processors(pr => pr
                  .Attachment<ContentDocumentModel>(a => a
                      .Field(f => f.FileContent)
                      .TargetField(f => f.File))
                  .Remove<ContentDocumentModel>(r => r
                      .Field(f => f
                        .Field(ff => ff.FileContent)))));

The document looks just fine in Elastic:

          "file" : {
            "date" : "2019-01-07T15:52:47Z",
            "content_type" : "application/pdf",
            "format" : "application/pdf; version=1.4",
            "modified" : "2019-03-18T10:01:54Z",
            "language" : "en",
            "metadata_date" : "2019-03-18T11:01:54Z",
            "creator_tool" : "Adobe InDesign CC 13.1 (Windows)",
            "content" : """1

Voting at Swedish foreign missions
European Parliament elections...

But as soon as I get a hit on a document, the following exception is thrown:

expected:',', actual:'"application/pdf; version=1.4"', at offset:548

My guess is that the parsing of the file object fails.
If I remove the file object from the index search works fins but I cannot search document.
It seems like a bug in Elasticsearch.Net or a compability problem. Any ideas?

/Kristoffer

Hi @mbooh,

The first thing to check is whether you have enabled rest API compatibility?

The other thing to check is that the C# class model matches the returned JSON. Generally it should if used during indexing.

Otherwise this might be a bug in the client we'd need to investigate. Ideally, could you provide a reproduction on the GitHub - elastic/elasticsearch-net: This strongly-typed, client library enables working with Elasticsearch. It is the official client maintained and supported by Elastic. GitHub repository and we can investigate.

Thanks Steve!

So the Attachment ("file" is of type Attachment) class looks like this:

public Attachment();

        //
        // Summary:
        //     The author
        [System.Runtime.Serialization.DataMemberAttribute(Name = "author")]
        public string Author { get; set; }
        //
        // Summary:
        //     Whether the attachment contains explicit metadata in addition to the content.
        //     Used at indexing time to determine the serialized form of the attachment.
        [Ignore]
        [System.Runtime.Serialization.IgnoreDataMemberAttribute]
        public bool ContainsMetadata { get; }
        //
        // Summary:
        //     The base64 encoded content. Can be explicitly set
        [System.Runtime.Serialization.DataMemberAttribute(Name = "content")]
        public string Content { get; set; }
        //
        // Summary:
        //     The length of the content before text extraction.
        [System.Runtime.Serialization.DataMemberAttribute(Name = "content_length")]
        public long? ContentLength { get; set; }
        //
        // Summary:
        //     The content type of the attachment. Can be explicitly set
        [System.Runtime.Serialization.DataMemberAttribute(Name = "content_type")]
        public string ContentType { get; set; }
        //
        // Summary:
        //     The date of the attachment.
        [System.Runtime.Serialization.DataMemberAttribute(Name = "date")]
        public DateTime? Date { get; set; }
        //
        // Summary:
        //     Detect the language of the attachment. Language detection is disabled by default.
        [System.Runtime.Serialization.DataMemberAttribute(Name = "detect_language")]
        public bool? DetectLanguage { get; set; }
        //
        // Summary:
        //     Determines how many characters are extracted when indexing the content. By default,
        //     100000 characters are extracted when indexing the content. -1 can be set to extract
        //     all text, but note that all the text needs to be allowed to be represented in
        //     memory
        [System.Runtime.Serialization.DataMemberAttribute(Name = "indexed_chars")]
        public long? IndexedCharacters { get; set; }
        //
        // Summary:
        //     The keywords in the attachment.
        [System.Runtime.Serialization.DataMemberAttribute(Name = "keywords")]
        public string Keywords { get; set; }
        //
        // Summary:
        //     The language of the attachment. Can be explicitly set.
        [System.Runtime.Serialization.DataMemberAttribute(Name = "language")]
        public string Language { get; set; }
        //
        // Summary:
        //     The name of the attachment. Can be explicitly set
        [System.Runtime.Serialization.DataMemberAttribute(Name = "name")]
        public string Name { get; set; }
        //
        // Summary:
        //     The title of the attachment.
        [System.Runtime.Serialization.DataMemberAttribute(Name = "title")]
        public string Title { get; set; }
    }

Seems like some fields are missing here if you compare to the "file: " in the index above?
But that is strange since it is the same object that is indexed on the model, hmm?

/Kristoffer

Ok, so it looks like the NEST.Attachment was the problem. I created my own class CustomAttachment that matches the json from Elastic and the it works just fine.

public class CustomAttachment
    {
        [JsonProperty("date")]
        public DateTime? Date { get; set; }

        [JsonProperty("content_type")]
        public string ContentType { get; set; }

        [JsonProperty("format")]
        public string Format { get; set; }

        [JsonProperty("modified")]
        public DateTime? Modified { get; set; }

        [JsonProperty("language")]
        public string Language { get; set; }

        [JsonProperty("metadata_date")]
        public DateTime? MetadataDate { get; set; }

        [JsonProperty("creator_tool")]
        public string CreatorTool { get; set; }

        [JsonProperty("content")]
        public string Content { get; set; }

        [JsonProperty("content_length")]
        public long? ContentLength { get; set; }
    }

What could cause this? I mean the NEST.Attachment is just another object, it should work just fine?

Thanks for testing that @mbooh. I've also been able to reproduce this. I suspect there's a bug in the custom formatter for that type, or the v8 server response when in compatibility mode. I need to dig a bit deeper to review that. I've created an issue on our repository to track that.

Thank you @stevejgordon for a fast reply and hint to the problem.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.