Integration test for ingest-attachment plugin

Hello,

I'm trying to upgrade to ES 5 (5.1.2, latest as of today) and replacing the attachment-mapper plug-in (which has been deprecated) with ingest-attachment plug-in.

For starters I would assume that I have to register the plug-in within my integration test (as it was for mapper-attachments plug-in) - but I can't find ingest-attachment on the maven repository.
Any hints on how to go forward?

Then I'd go and create a pipeline with PutPipelineRequest.

The next step would be to create an IndexRequest and set the ID of the pipeline before adding the request to my bulk processor (or executing it).

Is this the correct way to go or am I missing something?

Finally one more issue/question. I used a mapping that contained a copy_to of the extracted content to other fields in order to process the content with different analyzers (but use the extraction with tika only once). This ain't possible anymore, because it is a multiple field it said in the error log. How can I achieve the copy to my other fields without having to use the ingest-attachment pipeline for each field (as it would be very time consuming)?

Thanks in advance.

You need to create a real integration test server.

I wrote something about this here: http://david.pilato.fr/blog/2016/10/18/elasticsearch-real-integration-tests-updated-for-ga/

A bit complicated but at least really realistic.

May be download the artifact with http://www.mojohaus.org/wagon-maven-plugin/ like:

<plugin>
    <groupId>org.codehaus.mojo</groupId>
    <artifactId>wagon-maven-plugin</artifactId>
    <version>1.0</version>
    <executions>
        <execution>
            <id>download-ingest-attachment</id>
            <phase>pre-integration-test</phase>
            <goals>
                <goal>download-single</goal>
            </goals>
            <configuration>
                <url>https://artifacts.elastic.co</url>
                <fromFile>downloads/elasticsearch-plugins/ingest-attachment/ingest-attachment-5.1.2.zip</fromFile>
                <toDir>${project.build.directory}/es-plugins</toDir>
            </configuration>
        </execution>
    </executions>
</plugin>

I used a mapping that contained a copy_to of the extracted content to other fields in order to process the content with different analyzers

Why not using multifields with ingest in that case?

So define a mapping with many analyzers for field content like (pseudocode):

{
  "content": {
    "english": { "analyzer": "english" },
    "french": { "analyzer": "french" }
  }
}

Then with ingest extract the content to this content field.

That should work.

thanks for the instructions on the real integration tests. I went for the non-automated way (started a node locally), for now.

Regarding the pipeline and the mapping I have some open issue(s).

The pipeline is defined like this:

    XContentBuilder cb = jsonBuilder()
            .startObject()
            .field("description", "ingest attachment pipeline")
            .startArray("processors")
            .startObject()
            .startObject("attachment")
            .field("field", "data")
//          .field("target_field", "file")
            .field("indexed_chars", "-1")
            .endObject()
            .endObject()
            .endArray()
            .endObject();

The mapping I tried to write the extracted contents to was defined like

        "file": { 
            "type": "text", "index": true
        },

In the pipeline definition I tried to set the field file as the target_field, but I always get exception failed to parse [file] . when I remove the target_field definition I can see within the mapping that the resulting mapping is quite complex, so that might be the reason why I get the exception.

So how would I go with your suggested mapping with many analyzers?

In my definition I also have a date that I want to pass along as well (the date is the one I receive the file, it's not meta data of the file itself)

        "startDate": {
            "type": "date", "index": true, "store": false, 
            "format": "yyyy-MM-dd HH:mm:ss:SSS||yyyy-MM-dd HH:mm:ss"
        },

How do I have to update the pipeline in order to forward the startDate to the designated mapping in the index?

Do you have a full stacktrace?

target_field should be supported. Can you test to create a pipeline manually using CURL or Kibana console and report here if it fails?

Just figured out that the problem ain't the target_field, it's my mappings definition. If I use an undefined field, e.g. hugo in the target_field, it works fine.

So my question is, how do I define the mapping for the results of ingest-attachment, especially with the different analyzers as you mentioned above.
Currently I have

{
"files": {
    "properties": {
        "startDate": {
            "type": "date", "index": true, "store": false, 
            "format": "yyyy-MM-dd HH:mm:ss:SSS||yyyy-MM-dd HH:mm:ss"
        },
        "file": { 
            "type": "text", "index": true
        }
  ...

and I want to write to field file and have the extracted content analyzed with different analyzers.

Something like:

{
 "files": {
    "properties": {
        "file": { 
            "type": "text", "index": true,
            "fields": {
                "french": { 
                     "type": "text", "analyzer": "french"
                 },
                 "english": { 
                     "type": "text", "analyzer": "english"
                 } 
            }
        }
    }
}

Then you can search in file or file.french or file.english.

Thanks for your hint.

I updated my pipeline and used a set-processor in order to set the file and other fields, and remove some fields I don't need anymore.

for the sake of completeness, my pipeline definition

    XContentBuilder cb = jsonBuilder()
            .startObject()
            .field("description", "ingest attachment pipeline")
            .startArray("processors")
            
            .startObject()
            .startObject("attachment")
            .field("field", "data")
            .field("target_field", "attachment")
            .field("indexed_chars", "-1")
            .endObject()
            .endObject()
            
            .startObject()
            .startObject("set")
            .field("field", "file")
            .field("value", "{{ attachment.content }}")
            .endObject()
            .endObject()
            
            .startObject()
            .startObject("set")
            .field("field", "fileEn")
            .field("value", "{{ attachment.content }}")
            .endObject()
            .endObject()
            
            .startObject()
            .startObject("set")
            .field("field", "fileLang1")
            .field("value", "{{ attachment.content }}")
            .endObject()
            .endObject()
            //remove data and attachment.content field
            .startObject()
            .startObject("remove")
            .field("field", "data")
            .endObject()
            .endObject()
            
            .startObject()
            .startObject("remove")
            .field("field", "attachment.content")
            .endObject()
            .endObject()
            
            .endArray()
            .endObject();

my mapping definition is still (excerpt)

        "file": { 
            "type": "text", "index": true
        },
        "fileEn": { 
            "type": "text", "index": true, "analyzer": "alangen"
        },
        "fileLang1": { 
            "type": "text", "index": true, "analyzer": "alang1"
        },

I also tried with your mapping, but I'm not quite sure if the inner fields really get anything.
Should I see those fields when I just do a simple get query of my document?

No. They are indexed but not part of the source field.

ok, thanks.

this means that the content of the file field get kind of copied to the inner fields automatically? Or could you point me to the documentation of that feature?

Yes. Somehow.

https://www.elastic.co/guide/en/elasticsearch/reference/5.1/multi-fields.html

thanks for your time!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.