Integration test for ingest-attachment plugin

David_Pocivalnik · January 23, 2017, 4:31pm

Hello,

I'm trying to upgrade to ES 5 (5.1.2, latest as of today) and replacing the attachment-mapper plug-in (which has been deprecated) with ingest-attachment plug-in.

For starters I would assume that I have to register the plug-in within my integration test (as it was for mapper-attachments plug-in) - but I can't find ingest-attachment on the maven repository.
Any hints on how to go forward?

Then I'd go and create a pipeline with PutPipelineRequest.

The next step would be to create an IndexRequest and set the ID of the pipeline before adding the request to my bulk processor (or executing it).

Is this the correct way to go or am I missing something?

Finally one more issue/question. I used a mapping that contained a copy_to of the extracted content to other fields in order to process the content with different analyzers (but use the extraction with tika only once). This ain't possible anymore, because it is a multiple field it said in the error log. How can I achieve the copy to my other fields without having to use the ingest-attachment pipeline for each field (as it would be very time consuming)?

Thanks in advance.

dadoonet · January 23, 2017, 5:09pm

You need to create a real integration test server.

I wrote something about this here: http://david.pilato.fr/blog/2016/10/18/elasticsearch-real-integration-tests-updated-for-ga/

A bit complicated but at least really realistic.

May be download the artifact with http://www.mojohaus.org/wagon-maven-plugin/ like:

<plugin>
    <groupId>org.codehaus.mojo</groupId>
    <artifactId>wagon-maven-plugin</artifactId>
    <version>1.0</version>
    <executions>
        <execution>
            <id>download-ingest-attachment</id>
            <phase>pre-integration-test</phase>
            <goals>
                <goal>download-single</goal>
            </goals>
            <configuration>
                <url>https://artifacts.elastic.co</url>
                <fromFile>downloads/elasticsearch-plugins/ingest-attachment/ingest-attachment-5.1.2.zip</fromFile>
                <toDir>${project.build.directory}/es-plugins</toDir>
            </configuration>
        </execution>
    </executions>
</plugin>

dadoonet · January 23, 2017, 5:20pm

I used a mapping that contained a copy_to of the extracted content to other fields in order to process the content with different analyzers

Why not using multifields with ingest in that case?

So define a mapping with many analyzers for field content like (pseudocode):

{
  "content": {
    "english": { "analyzer": "english" },
    "french": { "analyzer": "french" }
  }
}

Then with ingest extract the content to this content field.

That should work.

David_Pocivalnik · January 24, 2017, 4:36pm

thanks for the instructions on the real integration tests. I went for the non-automated way (started a node locally), for now.

Regarding the pipeline and the mapping I have some open issue(s).

The pipeline is defined like this:

    XContentBuilder cb = jsonBuilder()
            .startObject()
            .field("description", "ingest attachment pipeline")
            .startArray("processors")
            .startObject()
            .startObject("attachment")
            .field("field", "data")
//          .field("target_field", "file")
            .field("indexed_chars", "-1")
            .endObject()
            .endObject()
            .endArray()
            .endObject();

The mapping I tried to write the extracted contents to was defined like

        "file": { 
            "type": "text", "index": true
        },

In the pipeline definition I tried to set the field file as the target_field, but I always get exception failed to parse [file] . when I remove the target_field definition I can see within the mapping that the resulting mapping is quite complex, so that might be the reason why I get the exception.

So how would I go with your suggested mapping with many analyzers?

In my definition I also have a date that I want to pass along as well (the date is the one I receive the file, it's not meta data of the file itself)

        "startDate": {
            "type": "date", "index": true, "store": false, 
            "format": "yyyy-MM-dd HH:mm:ss:SSS||yyyy-MM-dd HH:mm:ss"
        },

How do I have to update the pipeline in order to forward the startDate to the designated mapping in the index?

dadoonet · January 24, 2017, 5:01pm

Do you have a full stacktrace?

target_field should be supported. Can you test to create a pipeline manually using CURL or Kibana console and report here if it fails?

David_Pocivalnik · January 24, 2017, 6:18pm

Just figured out that the problem ain't the target_field, it's my mappings definition. If I use an undefined field, e.g. hugo in the target_field, it works fine.

So my question is, how do I define the mapping for the results of ingest-attachment, especially with the different analyzers as you mentioned above.
Currently I have

{
"files": {
    "properties": {
        "startDate": {
            "type": "date", "index": true, "store": false, 
            "format": "yyyy-MM-dd HH:mm:ss:SSS||yyyy-MM-dd HH:mm:ss"
        },
        "file": { 
            "type": "text", "index": true
        }
  ...

and I want to write to field file and have the extracted content analyzed with different analyzers.

dadoonet · January 24, 2017, 6:56pm

Something like:

{
 "files": {
    "properties": {
        "file": { 
            "type": "text", "index": true,
            "fields": {
                "french": { 
                     "type": "text", "analyzer": "french"
                 },
                 "english": { 
                     "type": "text", "analyzer": "english"
                 } 
            }
        }
    }
}

Then you can search in file or file.french or file.english.

David_Pocivalnik · January 25, 2017, 12:31pm

Thanks for your hint.

I updated my pipeline and used a set-processor in order to set the file and other fields, and remove some fields I don't need anymore.

for the sake of completeness, my pipeline definition

    XContentBuilder cb = jsonBuilder()
            .startObject()
            .field("description", "ingest attachment pipeline")
            .startArray("processors")
            
            .startObject()
            .startObject("attachment")
            .field("field", "data")
            .field("target_field", "attachment")
            .field("indexed_chars", "-1")
            .endObject()
            .endObject()
            
            .startObject()
            .startObject("set")
            .field("field", "file")
            .field("value", "{{ attachment.content }}")
            .endObject()
            .endObject()
            
            .startObject()
            .startObject("set")
            .field("field", "fileEn")
            .field("value", "{{ attachment.content }}")
            .endObject()
            .endObject()
            
            .startObject()
            .startObject("set")
            .field("field", "fileLang1")
            .field("value", "{{ attachment.content }}")
            .endObject()
            .endObject()
            //remove data and attachment.content field
            .startObject()
            .startObject("remove")
            .field("field", "data")
            .endObject()
            .endObject()
            
            .startObject()
            .startObject("remove")
            .field("field", "attachment.content")
            .endObject()
            .endObject()
            
            .endArray()
            .endObject();

my mapping definition is still (excerpt)

        "file": { 
            "type": "text", "index": true
        },
        "fileEn": { 
            "type": "text", "index": true, "analyzer": "alangen"
        },
        "fileLang1": { 
            "type": "text", "index": true, "analyzer": "alang1"
        },

I also tried with your mapping, but I'm not quite sure if the inner fields really get anything.
Should I see those fields when I just do a simple get query of my document?

dadoonet · January 25, 2017, 1:00pm

No. They are indexed but not part of the source field.

David_Pocivalnik · January 25, 2017, 2:22pm

ok, thanks.

this means that the content of the file field get kind of copied to the inner fields automatically? Or could you point me to the documentation of that feature?

dadoonet · January 25, 2017, 2:54pm

Yes. Somehow.

https://www.elastic.co/guide/en/elasticsearch/reference/5.1/multi-fields.html

David_Pocivalnik · January 25, 2017, 2:59pm

thanks for your time!

system · February 22, 2017, 3:00pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Ingest attachment - new versions in maven Elasticsearch	6	882	June 26, 2017
Implementing Ingest Attachment Processor Plugin Elasticsearch	34	15730	March 14, 2018
Creating an Ingest Pipeline Elasticsearch	5	606	August 30, 2018
Ingest attachment Plugin exception : Elasticsearch	8	4424	January 19, 2017
Change mapping in pipeline for ingest-attachment plugin Elasticsearch	3	666	February 28, 2019

Integration test for ingest-attachment plugin

Related topics