How can I ingest PDF and words files and extract keywords of these documents?

Akimbo · May 24, 2018, 4:30pm

Hi all,
Currently, I use FSCrawler to ingest files (PDF, words). All files are indexed in ES. Now, I want to do a search engine which uses Elasticsearch and looks like Google search engine. The first step of the search engine is to display keywords of the files in according to the text that i'm tapping.
Is it possible, when I ingest files with FSCrawler, to index keywords of the files into the ES document ?

For example :
"_source": {
"content": {...},
"meta": {...} ,
"file": { "keywords" : { "keyword": "buidling", "keyword": "floor" ... } , ... },
"path": {...}

Thank you in advance

dadoonet · May 28, 2018, 10:35am

You can probably run a terms aggregation on field file.keywords.keyword I guess.

Akimbo · May 28, 2018, 12:28pm

The problem is that I have no keywords. "file.keywords" does not exist. I need index all terms of the document as keyword.

dadoonet · May 28, 2018, 12:41pm

Two things:

If the document has "real" keywords, FSCrawler should be able to provide them.
If you don't have any keyword, then you can only build a tag cloud like I guess which is going to be messy I believe. Anyway, in that case, the only way to build this from a raw text content is by enabling fielddata on field content. But this is going to put a lot of pressure I think on your JVM memory.

My 2 cents on this.

Akimbo · May 28, 2018, 12:53pm

The documents don't have "real" keywords... So what is the good way to do auto-completion with theses documents ?

dadoonet · May 28, 2018, 1:13pm

One way is the 2nd point I answered.
The other way is may be by using suggesters: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters.html

Akimbo · May 29, 2018, 9:48am

I choose the term suggester to do the auto-completion. Currently, I can request only one index like that :

GET /_search
{
  "suggest" : {
    "my-suggestion" : {
      "text" : "build",
      "term" : {
        "field" : "content",
        "min_word_length": 2,
        "prefix_length": 5
      }
    }
  }
}

But is it possible to do a query on multi-fields of differents index ? I want to do the query on the "content" field of one index and on the "message" field of another index.

I have tried it withou success :

GET _search
{
  "suggest": {
    "text" : "build",
    "my-suggest-1" : {
      "term" : {
        "field" : "message"
      }
    },
    "my-suggest-2" : {
       "term" : {
        "field" : "content"
       }
    }
  }
}

dadoonet · May 29, 2018, 9:59am

Please format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.

Or use markdown style like:

```
CODE
```

There's a live preview panel for exactly this reasons.

Lots of people read these forums, and many of them will simply skip over a post that is difficult to read, because it's just too large an investment of their time to try and follow a wall of badly formatted text.
If your goal is to get an answer to your questions, it's in your interest to make it as easy to read and understand as possible.
Please update your post.

Also, could you create a new question for this as the title is not really related anymore?
You can link to this question from the new one if you wish.

system · June 26, 2018, 10:12am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Indexing word, pdf documents? Elasticsearch	12	6119	July 7, 2020
Ingesting documents (pdf, word, .txt) to elasticsearch Elasticsearch	31	38663	March 21, 2017
Index PDF in ES Elasticsearch	14	9109	April 24, 2017
Need some help with Ingest Attachment plugin Elasticsearch	6	442	May 28, 2018
Ingest pdf/doc/ppt files from HDFS to Elastic Search. Fscrawler vs es-hadoop Elasticsearch es-hadoop	2	1871	January 10, 2018

How can I ingest PDF and words files and extract keywords of these documents?

Related topics