Ingest-attachment using CBOR example


(Mark Baumgarten) #1

Hello, sorry for repeating this question which IMO was left unanswered...: Example for ingest attachment plug in with CBOR format for indexing documents

  • I too would really like to see an example.

We are struggling indexing large documents - and currently have no idea how to use "CBOR...without using json" as specified in the docs here: https://www.elastic.co/guide/en/elasticsearch/plugins/5.2/ingest-attachment.html


Ingest-attachment using CBOR examples
(David Pilato) #2

@spinscale do you know?


(Alexander Reelsen) #3

Hey,

you can use the built-in xcontent builder to create CBOR data, see the following links for some help

The trick is to just use the cbor builder to create your content, and add a bytearray to the field, you want to process instead of a base64 encoded string.

Hope this helps a bit.

--Alex


(Mark Baumgarten) #4

Thanks for your kind replies, but I'm still stuck with this issue.

This is how I fail with CBOR using python(in case someone could help):

  1. Assuming I need to have a pipeline set up - so I start with curl here:

curl -XPUT 'localhost:9200/_ingest/pipeline/attachment?pretty' -H 'Content-Type: application/json' -d'
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}
'
  1. Here's my python code, which reads an .odt file, cborcifies it, and then tries to index it:
import elasticsearch
import cbor2

from elasticsearch import Elasticsearch

es = Elasticsearch()
es.indices.delete(index="test-index", ignore=[400, 404])
filename = 'forf.odt'

with open(filename, 'rb') as f:
        doc = {
                'data': cbor2.loads(f.read())
        }
        res = es.index(index="test-index", doc_type='tweet', id=1,
                                  body=doc, pipeline='attachment')
  1. The error from the client side is:

python test_cbor_es.py
Traceback (most recent call last):
File "test_cbor_es.py", line 13, in
res = es.index(index="test-index", doc_type='tweet', id=1, body=doc, pipeline='attachment')
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/client/utils.py", line 73, in _wrapped
return func(*args, params=params, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/client/init.py", line 300, in index
_make_path(index, doc_type, id), params=params, body=body)
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/transport.py", line 318, in perform_request
status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/connection/http_urllib3.py", line 127, in perform_request
self.log_request_fail(method, full_url, url, body, duration, response.status, raw_data)
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/connection/base.py", line 100, in log_request_fail
body = body.decode('utf-8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc6 in position 60: invalid continuation byte

  1. The es server error:

[2017-03-24T10:29:48,466][ERROR][o.e.a.i.TransportIndexAction] [8_gffZR] failed to execute pipeline [attachment]
elasticsearch1 | org.elasticsearch.ElasticsearchParseException: Failed to parse content to map
elasticsearch1 | at org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:107) ~[elasticsearch-5.2.2.jar:5.2.2]
elasticsearch1 | at org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:78) ~[elasticsearch-5.2.2.jar:5.2.2]
elasticsearch1 | at org.elasticsearch.action.index.IndexRequest.sourceAsMap(IndexRequest.java:410) ~[elasticsearch-5.2.2.jar:5.2.2]
elasticsearch1 | at org.elasticsearch.ingest.PipelineExecutionService.innerExecute(PipelineExecutionService.java:164) ~[elasticsearch-5.2.2.jar:5.2.2]
elasticsearch1 | at org.elasticsearch.ingest.PipelineExecutionService.access$000(PipelineExecutionService.java:41) ~[elasticsearch-5.2.2.jar:5.2.2]
elasticsearch1 | at org.elasticsearch.ingest.PipelineExecutionService$1.doRun(PipelineExecutionService.java:65) [elasticsearch-5.2.2.jar:5.2.2]
elasticsearch1 | at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:596) [elasticsearch-5.2.2.jar:5.2.2]
elasticsearch1 | at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.2.2.jar:5.2.2]
elasticsearch1 | at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
elasticsearch1 | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
elasticsearch1 | at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
elasticsearch1 | Caused by: com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 middle byte 0x32
elasticsearch1 | at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@892135f; line: 1, column: 63]
elasticsearch1 | at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1702) ~[jackson-core-2.8.6.jar:2.8.6]
elasticsearch1 | at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:558) ~[jackson-core-2.8.6.jar:2.8.6]
elasticsearch1 | at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidOther(UTF8StreamJsonParser.java:3550) ~[jackson-core-2.8.6.jar:2.8.6]
elasticsearch1 | at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidOther(UTF8StreamJsonParser.java:3557) ~[jackson-core-2.8.6.jar:2.8.6]
elasticsearch1 | at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._decodeUtf8_2(UTF8StreamJsonParser.java:3327) ~[jackson-core-2.8.6.jar:2.8.6]
elasticsearch1 | at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2517) ~[jackson-core-2.8.6.jar:2.8.6]
elasticsearch1 | at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishAndReturnString(UTF8StreamJsonParser.java:2469) ~[jackson-core-2.8.6.jar:2.8.6]
elasticsearch1 | at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:315) ~[jackson-core-2.8.6.jar:2.8.6]
elasticsearch1 | at org.elasticsearch.common.xcontent.json.JsonXContentParser.text(JsonXContentParser.java:86) ~[elasticsearch-5.2.2.jar:5.2.2]
elasticsearch1 | at org.elasticsearch.common.xcontent.support.AbstractXContentParser.readValue(AbstractXContentParser.java:352) ~[elasticsearch-5.2.2.jar:5.2.2]
elasticsearch1 | at org.elasticsearch.common.xcontent.support.AbstractXContentParser.readMap(AbstractXContentParser.java:300) ~[elasticsearch-5.2.2.jar:5.2.2]
elasticsearch1 | at org.elasticsearch.common.xcontent.support.AbstractXContentParser.readMap(AbstractXContentParser.java:263) ~[elasticsearch-5.2.2.jar:5.2.2]
elasticsearch1 | at org.elasticsearch.common.xcontent.support.AbstractXContentParser.map(AbstractXContentParser.java:218) ~[elasticsearch-5.2.2.jar:5.2.2]
elasticsearch1 | at org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:105) ~[elasticsearch-5.2.2.jar:5.2.2]
elasticsearch1 | ... 10 more


(David Pilato) #5

Please format your code using </> icon as explained in this guide. It will make your post more readable.

Or use markdown style like:

```
CODE
```

(system) closed #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.