Hello There!
I am looking for a tutorial how can I ingest/parse/analyze files like *.pdf or .jpeg/.jpg and other types.
I have found this article but I don't use docker/k8s.
Could you be so kind and show me other tutorials/manuals? I'd like to learn it as soon as possible.
I use Elasticsearch version 7.17.
You don't have to use docker for that.
You can use the ingest attachment plugin.
There an example here: Using the Attachment Processor in a Pipeline | Elasticsearch Plugins and Integrations [7.17] | Elastic
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data"
}
}
]
}
PUT my_index/_doc/my_id?pipeline=attachment
{
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my_index/_doc/my_id
The data
field is basically the BASE64 representation of your binary file.
You can also use FSCrawler. There's a tutorial to help you getting started.
@dadoonet Thank you very much. I will try it today!
It looks pretty good:
# base64 test.txt
dG8gamVzdCBqYWtpcyB0ZWtzdAo=
root@{{hostname}} ~
# curl -X PUT "https://{{ip_address}}:{{port_number}}/my-index-000002/_doc/my_id?pipeline=attachment&pretty" -H 'Content-Type: application/json' -d'
{
"data": "dG8gamVzdCBqYWtpcyB0ZWtzdAo="
}' -k -u elastic
Enter host password for user 'elastic':
{
"_index" : "my-index-000002",
"_type" : "_doc",
"_id" : "my_id",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1
}
root@{{hostname}} ~
# curl -X GET "https://{{ip_address}}:{{port_number}}/my-index-000002/_doc/my_id?pretty" -k -u elastic
Enter host password for user 'elastic':
{
"_index" : "my-index-000002",
"_type" : "_doc",
"_id" : "my_id",
"_version" : 1,
"_seq_no" : 0,
"_primary_term" : 1,
"found" : true,
"_source" : {
"data" : "dG8gamVzdCBqYWtpcyB0ZWtzdAo=",
"attachment" : {
"content_type" : "text/plain; charset=ISO-8859-1",
"language" : "et",
"content" : "to jest jakis tekst",
"content_length" : 21
}
}
}
Response from elasticsearch is:
"content" : "to jest jakis tekst",
and my test file was:
And this is nice solution for one file. As I understand if I would like to do it for many file (thousands) I should use
Am I right?
Hello Again,
One thing I don't understand is why after sending new file to index it is updated but there is only last one file. It works like a versioning mechanism. Should I really use FSCrawler for many files and it is the only way?
It's probably because you used my_id
as the document _id
.
Hello Again!
Even if I use different ids it seems to behave strange to me...
The command I used (and a result) is here:
curl -X PUT "https://{{ip_address}}:{{port_number}}/my-index-000004/_doc/0003?pipeline=attachment&pretty" -H 'Content-Type: application/json' -d'
{
"data": "bGFsYWxhbAo="
}' -k -u elastic
Enter host password for user 'elastic':
{
"_index" : "my-index-000004",
"_type" : "_doc",
"_id" : "0003",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 2,
"failed" : 0
},
"_seq_no" : 11,
"_primary_term" : 1
}
But in Kibana I can see only this:
Shouldn't have I three documents sent?
What is the output of this?
GET /my-index-000004/_search
This is the result:
curl -X GET "https://{{ip_address}}:{{port_number}}//my-index-000004/_search" -k -u elastic
Enter host password for user 'elastic':
{"took":2875,"timed_out":false,"_shards":{"total":52,"successful":52,"skipped":0,"failed":0},"hits":{"total":{"value":0,"relation":"eq"},"max_score":null,"hits":[]}}
Here the doc count is 0.
You previously shared a screen capture which shows 1.
Could you explain that?
Alternatively, could you share the output of this?
GET /_cat/index?v
Nope. I have done nothing to that index.
curl -X GET "https://{{ip_address}}:{{port_number}}//_cat/index?v" -k -u elastic
Enter host password for user 'elastic':
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"request [//_cat/index] contains unrecognized parameter: [v]"}],"type":"illegal_argument_exception","reason":"request [//_cat/index] contains unrecognized parameter: [v]"},"status":400}
Try with:
GET /_cat/indices?h
Looks pretty same:
curl -X GET "https://{{ip_address}}:{{port_number}}//_cat/index?h" -k -u elastic
Enter host password for user 'elastic':
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"request [//_cat/indices] contains unrecognized parameter: [h]"}],"type":"illegal_argument_exception","reason":"request [//_cat/indices] contains unrecognized parameter: [h]"},"status":400}
Am I missing something?
May be the
//_cat
Should be
/_cat
?
OK... this is kind of strange...
curl -X GET "https://{{ip_address}}:{{port_number}}/_cat/indices?h" -k -u elastic
Enter host password for user 'elastic':
root@{{hostname}}
but with v
argument result is:
curl -X GET "https://{{ip_address}}:{{port_number}}/_cat/indices?v" -k -u elastic
Enter host password for user 'elastic':
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open .siem-signals-default-000001 JVRVtBIxQDOMcWMsWLYacg 1 1 0 0 452b 226b
green open filebeat-7.17.6-2022.09.02 rNeTEPTZR3aauRK6TEBWBw 1 1 214804 0 48.5mb 24.2mb
green open .siem-signals-default-000002 8wRU4bOJRSqqqX9IMUJMsA 1 1 0 0 452b 226b
green open filebeat-7.17.6-2022.09.05 3lcwEvveQWmcpTYu_eF1Mg 1 1 231778 0 44.9mb 22.4mb
green open filebeat-7.17.6-2022.08.30 u3vvVdn_Q-SlEpHLKnykFQ 1 1 71701 0 37.5mb 18.7mb
green open .items-default-000001 faFLGiiZRI-9FMBu16YI8w 1 1 0 0 452b 226b
green open filebeat-7.17.6-2022.09.06 9bgDFUSUSN2NdyXzxvdF5w 1 1 67719 0 18mb 9mb
green open .kibana_7.17.6_001 S10OBEa6Rgmc0Z_LCFzbXA 1 1 9829 610 25.1mb 12.4mb
green open filebeat-7.17.6-2022.09.07 cr-L0reKS6KilNWfUKZzRA 1 1 6969 0 4.4mb 2.1mb
green open .kibana_7.17.5_001 qidgdt6-QymHiGroYt4J7Q 1 1 4663 3189 29.5mb 14.7mb
green open .apm-custom-link NivBG4DkSFSwa1txBJtm6g 1 1 0 0 452b 226b
green open my-index-000003 RU-z0UshSz-xG-yfdrb1lQ 1 1 1 0 16.7kb 8.3kb
green open my-index-000002 am1zUIXoTkWggi_4IGmMqQ 1 1 1 0 12.8kb 6.4kb
green open my-index-000001 YGdB3rw9T0iqJoV4mPT-YQ 1 1 1 0 13.3kb 6.6kb
green open filebeat-7.17.6-2022.09.13 ME3UExF6SouvZl_rK6mK4w 1 1 37580 0 10.5mb 5.2mb
green open filebeat-7.17.6-2022.09.14 2dnmNZFWR4yiNGjGQPj1TQ 1 1 16952 0 5.6mb 2.8mb
green open my-index-000004 CyAC2uH6RR-axCseaHwnFQ 1 1 4 0 241.4kb 123.4kb
green open .fleet-enrollment-api-keys-7 k9yLqLSNSgeDVpRQq9DIBw 1 1 2 0 13.3kb 6.6kb
green open filebeat-7.17.6-2022.09.12 IEtwwPlQQbGLzp_9sKzkug 1 1 18446 0 7.1mb 3.5mb
green open .apm-agent-configuration WU-6ZnUKRKK44PMhZ53T2g 1 1 0 0 452b 226b
green open .tasks KRi--4lIQrSL_xOhA95Cdg 1 1 58 0 146kb 78.9kb
green open .monitoring-kibana-7-2022.09.13 m1l6qx93SKSZIMULoBHglg 1 1 51838 0 20.3mb 10.1mb
green open .monitoring-kibana-7-2022.09.12 8-qGQU61RbC1PSW0-PBR1w 1 1 51838 0 20.1mb 10mb
green open .monitoring-kibana-7-2022.09.11 IVrE1HfyRFq5iXI7podXGw 1 1 51838 0 19.7mb 10.1mb
green open metricbeat-7.17.6-2022.08.29 aTwLimmrTN21nx157p9q9g 1 1 206 0 1.1mb 594.4kb
green open .monitoring-kibana-7-2022.09.10 PGT936NmR2qVfHmhG4HwRQ 1 1 51840 0 20mb 9.9mb
green open metricbeat-7.17.6-2022.09.05-000001 kVaziaI8SrS067ytC-Iydg 1 1 0 0 452b 226b
green open .monitoring-kibana-7-2022.09.14 M4FATkQVQbGeahwFZJWmMg 1 1 18 0 3.2mb 1.6mb
green open .fleet-policies-7 DKlqxoiaTAiAUbVK54qsuA 1 1 8 0 105.9kb 52.9kb
green open .metrics-endpoint.metadata_united_default 191wdoN2RXCajVeqrnMGZg 1 1 0 0 452b 226b
green open filebeat-7.17.6-2022.08.30-000001 hR_GG_-zRYOUYaWqVqdUuQ 1 1 159912 0 38.6mb 19.1mb
green open .geoip_databases 4aKuteb9Ro6evZ0jOff5Qw 1 1 41 36 82.6mb 41.3mb
green open .monitoring-es-7-2022.09.09 3Ky4uczzTR6NhEYAYPv3xg 1 1 537252 349184 563.2mb 279.9mb
green open .monitoring-es-7-2022.09.08 SbbS6M5uSXqL9Vvs7KNd4w 1 1 537373 348316 563.1mb 275.5mb
green open filebeat-7.17.5-2022.08.29 Zz_A72-tS2ur7tRdFbaTjQ 1 1 14019 0 8.2mb 4mb
green open .monitoring-kibana-7-2022.09.09 QI_j2H6ITAGpR_BHENtsUw 1 1 51838 0 20.1mb 10.2mb
green open .monitoring-kibana-7-2022.09.08 SO2cPVvGSJGf3t2hP5d3WQ 1 1 51836 0 20.3mb 10.1mb
green open .kibana_task_manager_7.17.4_001 bhHoWY7ITR2IxHEDq-oskQ 1 1 17 1024 480.7kb 240.3kb
green open .fleet-artifacts-7 HEhy5deMQM6sDTeQXkrjkw 1 1 12 0 61.5kb 26.9kb
green open .transform-internal-007 JtUniNzIRfeg9ngkzuHEHA 1 1 6 0 94.1kb 47kb
green open .lists-default-000001 oG0Xlk8DQjeuxT-7kLhseg 1 1 0 0 452b 226b
green open .kibana_task_manager_7.17.5_001 ekDIzBssS1mUW6j45EsNmA 1 1 18 930 418.1kb 209kb
green open filebeat-7.17.6-2022.08.29 iaWFvgz8ThCgbemHH3PmIA 1 1 40457 0 21.6mb 10.8mb
green open .kibana_7.17.4_001 TaSnmWSBTeq-9NH1mhZAPA 1 1 52 0 4.7mb 2.3mb
green open metrics-endpoint.metadata_current_default 5j3dk6VESROWHNe56O7OKg 1 1 0 0 452b 226b
green open .security-7 IkXDgPqyTjCJ8_osVOFveA 1 1 66 0 388.9kb 194.4kb
green open filebeat-7.17.5-2022.08.30 nX1KekWNSW2TGomFNOzZMw 1 1 58855 0 29.8mb 14.9mb
green open .kibana_task_manager_7.17.6_001 z4jGWlxuQbqG77pbubkRxQ 1 1 19 646140 91mb 73mb
green open .monitoring-es-7-2022.09.13 Z1u3PMVWTvKAAVtzpkaP2Q 1 1 555121 32256 546.4mb 274mb
green open .monitoring-es-7-2022.09.12 -LRFhcZPQx6SiETZFF9hJQ 1 1 543551 8820 533.9mb 266.5mb
green open .async-search qK4QM58TR-q5CJAHKYvM-w 1 1 0 22 68.8kb 28.3kb
green open .monitoring-es-7-2022.09.14 nyjBofkPQ42cXqtnf7CXVA 1 1 132503 258938 185.6mb 93.2mb
green open .monitoring-es-7-2022.09.11 mzzRBCc0Q6SJDL_kLJUnig 1 1 537882 349928 575.5mb 286.8mb
green open .monitoring-es-7-2022.09.10 W8YfsGLKR_qzvTcpvRh16Q 1 1 541533 345340 571.4mb 284.1mb
Apparently there is 4 documents in my-index-000004
:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open my-index-000004 CyAC2uH6RR-axCseaHwnFQ 1 1 4 0 241.4kb 123.4kb
Could you run again:
GET /my-index-000004/_search
and not:
GET //my-index-000004/_search
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.