Parse PDF catalogs to extract products informations

Hello,

I need to parse some PDF files to find the most relevant words and trying to find what PDF are talking about.

Those PDF are vendors catalog, I need for example to extract products to propose them in a web search engine (actually powered by Elasticsearch with entities parsed from database).

Some catalogs could also be EXCEL files and those files are not standardized.

I used Elasticsearch a long time ago and I am not a data scientist :upside_down_face:

My project uses Elasticsearch as standalone, with a PHP client :

curl -XGET 'http://localhost:9200'
{
"name" : "n1es6kx",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "5ya3CLzuQYKWMtqLk4Libw",
"version" : {
"number" : "6.8.10",
"build_flavor" : "default",
"build_type" : "deb",
"build_hash" : "537cb22",
"build_date" : "2020-05-28T14:47:19.882936Z",
"build_snapshot" : false,
"lucene_version" : "7.7.3",
"minimum_wire_compatibility_version" : "5.6.0",
"minimum_index_compatibility_version" : "5.0.0"
},
"tagline" : "You Know, for Search"
}

"friendsofsymfony/elastica-bundle": "~5.2",
"ruflin/elastica": "~6.1"

I am currently reading documentation but any advices/recommandations would be useful.

Thank you.

First, upgrade! At least to 7.17.8 but better to 8.5.3.

You can use the ingest attachment plugin.

There an example here: https://www.elastic.co/guide/en/elasticsearch/plugins/current/using-ingest-attachment.html

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}
PUT my_index/_doc/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my_index/_doc/my_id

The data field is basically the BASE64 representation of your binary file.

You can use FSCrawler. There's a tutorial to help you getting started.

1 Like

Hello,

Thank you I will check ingest attachment plugin documentation.

Any advice how my search into PDF could be more relevant ? I thought maybe I could count words to find out which words are the most relevant but I don't think it'will be the best solution.

Maybe there is a kind of IA or something ? Or should I parse my PDF first with an other application ?

What do you mean? Do you have an example for this question?

Ideally open a new discussion about it because this one is marked as solved and your original question is not directly related to the new question IMO. :wink:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.