PDF, jpeg/jpg and other file types ingests

Hello There!
I am looking for a tutorial how can I ingest/parse/analyze files like *.pdf or .jpeg/.jpg and other types.
I have found this article but I don't use docker/k8s.
Could you be so kind and show me other tutorials/manuals? I'd like to learn it as soon as possible.
I use Elasticsearch version 7.17.

You don't have to use docker for that.

You can use the ingest attachment plugin.

There an example here: Using the Attachment Processor in a Pipeline | Elasticsearch Plugins and Integrations [7.17] | Elastic

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}
PUT my_index/_doc/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my_index/_doc/my_id

The data field is basically the BASE64 representation of your binary file.

You can also use FSCrawler. There's a tutorial to help you getting started.

1 Like

@dadoonet Thank you very much. I will try it today!

It looks pretty good:

# base64 test.txt 
dG8gamVzdCBqYWtpcyB0ZWtzdAo=

root@{{hostname}} ~
# curl -X PUT "https://{{ip_address}}:{{port_number}}/my-index-000002/_doc/my_id?pipeline=attachment&pretty" -H 'Content-Type: application/json' -d'
{
  "data": "dG8gamVzdCBqYWtpcyB0ZWtzdAo="
}' -k -u elastic
Enter host password for user 'elastic':
{
  "_index" : "my-index-000002",
  "_type" : "_doc",
  "_id" : "my_id",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}
root@{{hostname}} ~
# curl -X GET "https://{{ip_address}}:{{port_number}}/my-index-000002/_doc/my_id?pretty" -k -u elastic
Enter host password for user 'elastic':
{
  "_index" : "my-index-000002",
  "_type" : "_doc",
  "_id" : "my_id",
  "_version" : 1,
  "_seq_no" : 0,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "data" : "dG8gamVzdCBqYWtpcyB0ZWtzdAo=",
    "attachment" : {
      "content_type" : "text/plain; charset=ISO-8859-1",
      "language" : "et",
      "content" : "to jest jakis tekst",
      "content_length" : 21
    }
  }
}

Response from elasticsearch is:

      "content" : "to jest jakis tekst",

and my test file was:
obraz

And this is nice solution for one file. As I understand if I would like to do it for many file (thousands) I should use

Am I right?

Hello Again,
One thing I don't understand is why after sending new file to index it is updated but there is only last one file. It works like a versioning mechanism. Should I really use FSCrawler for many files and it is the only way?

It's probably because you used my_id as the document _id.

Hello Again!
Even if I use different ids it seems to behave strange to me...
The command I used (and a result) is here:

curl -X PUT "https://{{ip_address}}:{{port_number}}/my-index-000004/_doc/0003?pipeline=attachment&pretty" -H 'Content-Type: application/json' -d'
{
  "data": "bGFsYWxhbAo="
}' -k -u elastic
Enter host password for user 'elastic':
{
  "_index" : "my-index-000004",
  "_type" : "_doc",
  "_id" : "0003",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "_seq_no" : 11,
  "_primary_term" : 1
}

But in Kibana I can see only this:

Shouldn't have I three documents sent?

What is the output of this?

GET /my-index-000004/_search

This is the result:

curl -X GET "https://{{ip_address}}:{{port_number}}//my-index-000004/_search" -k -u elastic
Enter host password for user 'elastic':
{"took":2875,"timed_out":false,"_shards":{"total":52,"successful":52,"skipped":0,"failed":0},"hits":{"total":{"value":0,"relation":"eq"},"max_score":null,"hits":[]}}

Here the doc count is 0.
You previously shared a screen capture which shows 1.

Could you explain that?

Alternatively, could you share the output of this?

GET /_cat/index?v

Nope. I have done nothing to that index.

curl -X GET "https://{{ip_address}}:{{port_number}}//_cat/index?v" -k -u elastic
Enter host password for user 'elastic':
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"request [//_cat/index] contains unrecognized parameter: [v]"}],"type":"illegal_argument_exception","reason":"request [//_cat/index] contains unrecognized parameter: [v]"},"status":400}

Try with:

GET /_cat/indices?h

Looks pretty same:

curl -X GET "https://{{ip_address}}:{{port_number}}//_cat/index?h" -k -u elastic
Enter host password for user 'elastic':
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"request [//_cat/indices] contains unrecognized parameter: [h]"}],"type":"illegal_argument_exception","reason":"request [//_cat/indices] contains unrecognized parameter: [h]"},"status":400}

Am I missing something?

May be the

//_cat

Should be

/_cat

?

OK... this is kind of strange...

curl -X GET "https://{{ip_address}}:{{port_number}}/_cat/indices?h" -k -u elastic
Enter host password for user 'elastic':






















































root@{{hostname}}

but with v argument result is:

curl -X GET "https://{{ip_address}}:{{port_number}}/_cat/indices?v" -k -u elastic
Enter host password for user 'elastic':
health status index                                     uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   .siem-signals-default-000001              JVRVtBIxQDOMcWMsWLYacg   1   1          0            0       452b           226b
green  open   filebeat-7.17.6-2022.09.02                rNeTEPTZR3aauRK6TEBWBw   1   1     214804            0     48.5mb         24.2mb
green  open   .siem-signals-default-000002              8wRU4bOJRSqqqX9IMUJMsA   1   1          0            0       452b           226b
green  open   filebeat-7.17.6-2022.09.05                3lcwEvveQWmcpTYu_eF1Mg   1   1     231778            0     44.9mb         22.4mb
green  open   filebeat-7.17.6-2022.08.30                u3vvVdn_Q-SlEpHLKnykFQ   1   1      71701            0     37.5mb         18.7mb
green  open   .items-default-000001                     faFLGiiZRI-9FMBu16YI8w   1   1          0            0       452b           226b
green  open   filebeat-7.17.6-2022.09.06                9bgDFUSUSN2NdyXzxvdF5w   1   1      67719            0       18mb            9mb
green  open   .kibana_7.17.6_001                        S10OBEa6Rgmc0Z_LCFzbXA   1   1       9829          610     25.1mb         12.4mb
green  open   filebeat-7.17.6-2022.09.07                cr-L0reKS6KilNWfUKZzRA   1   1       6969            0      4.4mb          2.1mb
green  open   .kibana_7.17.5_001                        qidgdt6-QymHiGroYt4J7Q   1   1       4663         3189     29.5mb         14.7mb
green  open   .apm-custom-link                          NivBG4DkSFSwa1txBJtm6g   1   1          0            0       452b           226b
green  open   my-index-000003                           RU-z0UshSz-xG-yfdrb1lQ   1   1          1            0     16.7kb          8.3kb
green  open   my-index-000002                           am1zUIXoTkWggi_4IGmMqQ   1   1          1            0     12.8kb          6.4kb
green  open   my-index-000001                           YGdB3rw9T0iqJoV4mPT-YQ   1   1          1            0     13.3kb          6.6kb
green  open   filebeat-7.17.6-2022.09.13                ME3UExF6SouvZl_rK6mK4w   1   1      37580            0     10.5mb          5.2mb
green  open   filebeat-7.17.6-2022.09.14                2dnmNZFWR4yiNGjGQPj1TQ   1   1      16952            0      5.6mb          2.8mb
green  open   my-index-000004                           CyAC2uH6RR-axCseaHwnFQ   1   1          4            0    241.4kb        123.4kb
green  open   .fleet-enrollment-api-keys-7              k9yLqLSNSgeDVpRQq9DIBw   1   1          2            0     13.3kb          6.6kb
green  open   filebeat-7.17.6-2022.09.12                IEtwwPlQQbGLzp_9sKzkug   1   1      18446            0      7.1mb          3.5mb
green  open   .apm-agent-configuration                  WU-6ZnUKRKK44PMhZ53T2g   1   1          0            0       452b           226b
green  open   .tasks                                    KRi--4lIQrSL_xOhA95Cdg   1   1         58            0      146kb         78.9kb
green  open   .monitoring-kibana-7-2022.09.13           m1l6qx93SKSZIMULoBHglg   1   1      51838            0     20.3mb         10.1mb
green  open   .monitoring-kibana-7-2022.09.12           8-qGQU61RbC1PSW0-PBR1w   1   1      51838            0     20.1mb           10mb
green  open   .monitoring-kibana-7-2022.09.11           IVrE1HfyRFq5iXI7podXGw   1   1      51838            0     19.7mb         10.1mb
green  open   metricbeat-7.17.6-2022.08.29              aTwLimmrTN21nx157p9q9g   1   1        206            0      1.1mb        594.4kb
green  open   .monitoring-kibana-7-2022.09.10           PGT936NmR2qVfHmhG4HwRQ   1   1      51840            0       20mb          9.9mb
green  open   metricbeat-7.17.6-2022.09.05-000001       kVaziaI8SrS067ytC-Iydg   1   1          0            0       452b           226b
green  open   .monitoring-kibana-7-2022.09.14           M4FATkQVQbGeahwFZJWmMg   1   1         18            0      3.2mb          1.6mb
green  open   .fleet-policies-7                         DKlqxoiaTAiAUbVK54qsuA   1   1          8            0    105.9kb         52.9kb
green  open   .metrics-endpoint.metadata_united_default 191wdoN2RXCajVeqrnMGZg   1   1          0            0       452b           226b
green  open   filebeat-7.17.6-2022.08.30-000001         hR_GG_-zRYOUYaWqVqdUuQ   1   1     159912            0     38.6mb         19.1mb
green  open   .geoip_databases                          4aKuteb9Ro6evZ0jOff5Qw   1   1         41           36     82.6mb         41.3mb
green  open   .monitoring-es-7-2022.09.09               3Ky4uczzTR6NhEYAYPv3xg   1   1     537252       349184    563.2mb        279.9mb
green  open   .monitoring-es-7-2022.09.08               SbbS6M5uSXqL9Vvs7KNd4w   1   1     537373       348316    563.1mb        275.5mb
green  open   filebeat-7.17.5-2022.08.29                Zz_A72-tS2ur7tRdFbaTjQ   1   1      14019            0      8.2mb            4mb
green  open   .monitoring-kibana-7-2022.09.09           QI_j2H6ITAGpR_BHENtsUw   1   1      51838            0     20.1mb         10.2mb
green  open   .monitoring-kibana-7-2022.09.08           SO2cPVvGSJGf3t2hP5d3WQ   1   1      51836            0     20.3mb         10.1mb
green  open   .kibana_task_manager_7.17.4_001           bhHoWY7ITR2IxHEDq-oskQ   1   1         17         1024    480.7kb        240.3kb
green  open   .fleet-artifacts-7                        HEhy5deMQM6sDTeQXkrjkw   1   1         12            0     61.5kb         26.9kb
green  open   .transform-internal-007                   JtUniNzIRfeg9ngkzuHEHA   1   1          6            0     94.1kb           47kb
green  open   .lists-default-000001                     oG0Xlk8DQjeuxT-7kLhseg   1   1          0            0       452b           226b
green  open   .kibana_task_manager_7.17.5_001           ekDIzBssS1mUW6j45EsNmA   1   1         18          930    418.1kb          209kb
green  open   filebeat-7.17.6-2022.08.29                iaWFvgz8ThCgbemHH3PmIA   1   1      40457            0     21.6mb         10.8mb
green  open   .kibana_7.17.4_001                        TaSnmWSBTeq-9NH1mhZAPA   1   1         52            0      4.7mb          2.3mb
green  open   metrics-endpoint.metadata_current_default 5j3dk6VESROWHNe56O7OKg   1   1          0            0       452b           226b
green  open   .security-7                               IkXDgPqyTjCJ8_osVOFveA   1   1         66            0    388.9kb        194.4kb
green  open   filebeat-7.17.5-2022.08.30                nX1KekWNSW2TGomFNOzZMw   1   1      58855            0     29.8mb         14.9mb
green  open   .kibana_task_manager_7.17.6_001           z4jGWlxuQbqG77pbubkRxQ   1   1         19       646140       91mb           73mb
green  open   .monitoring-es-7-2022.09.13               Z1u3PMVWTvKAAVtzpkaP2Q   1   1     555121        32256    546.4mb          274mb
green  open   .monitoring-es-7-2022.09.12               -LRFhcZPQx6SiETZFF9hJQ   1   1     543551         8820    533.9mb        266.5mb
green  open   .async-search                             qK4QM58TR-q5CJAHKYvM-w   1   1          0           22     68.8kb         28.3kb
green  open   .monitoring-es-7-2022.09.14               nyjBofkPQ42cXqtnf7CXVA   1   1     132503       258938    185.6mb         93.2mb
green  open   .monitoring-es-7-2022.09.11               mzzRBCc0Q6SJDL_kLJUnig   1   1     537882       349928    575.5mb        286.8mb
green  open   .monitoring-es-7-2022.09.10               W8YfsGLKR_qzvTcpvRh16Q   1   1     541533       345340    571.4mb        284.1mb

Apparently there is 4 documents in my-index-000004:

health status index                                     uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   my-index-000004                           CyAC2uH6RR-axCseaHwnFQ   1   1          4            0    241.4kb        123.4kb

Could you run again:

GET /my-index-000004/_search

and not:

GET //my-index-000004/_search