How can I ingest PDF and words files and extract keywords of these documents?

Hi all,
Currently, I use FSCrawler to ingest files (PDF, words). All files are indexed in ES. Now, I want to do a search engine which uses Elasticsearch and looks like Google search engine. The first step of the search engine is to display keywords of the files in according to the text that i'm tapping.
Is it possible, when I ingest files with FSCrawler, to index keywords of the files into the ES document ?

For example :
"_source": {
"content": {...},
"meta": {...} ,
"file": { "keywords" : { "keyword": "buidling", "keyword": "floor" ... } , ... },
"path": {...}

Thank you in advance

You can probably run a terms aggregation on field file.keywords.keyword I guess.

The problem is that I have no keywords. "file.keywords" does not exist. I need index all terms of the document as keyword.

Two things:

  • If the document has "real" keywords, FSCrawler should be able to provide them.
  • If you don't have any keyword, then you can only build a tag cloud like I guess which is going to be messy I believe. Anyway, in that case, the only way to build this from a raw text content is by enabling fielddata on field content. But this is going to put a lot of pressure I think on your JVM memory.

My 2 cents on this.

The documents don't have "real" keywords... So what is the good way to do auto-completion with theses documents ?

One way is the 2nd point I answered.
The other way is may be by using suggesters: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters.html

I choose the term suggester to do the auto-completion. Currently, I can request only one index like that :

GET /_search
{
  "suggest" : {
    "my-suggestion" : {
      "text" : "build",
      "term" : {
        "field" : "content",
        "min_word_length": 2,
        "prefix_length": 5
      }
    }
  }
}

But is it possible to do a query on multi-fields of differents index ? I want to do the query on the "content" field of one index and on the "message" field of another index.

I have tried it withou success :

GET _search
{
  "suggest": {
    "text" : "build",
    "my-suggest-1" : {
      "term" : {
        "field" : "message"
      }
    },
    "my-suggest-2" : {
       "term" : {
        "field" : "content"
       }
    }
  }
}

Please format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.

Or use markdown style like:

```
CODE
```

There's a live preview panel for exactly this reasons.

Lots of people read these forums, and many of them will simply skip over a post that is difficult to read, because it's just too large an investment of their time to try and follow a wall of badly formatted text.
If your goal is to get an answer to your questions, it's in your interest to make it as easy to read and understand as possible.
Please update your post.

Also, could you create a new question for this as the title is not really related anymore?
You can link to this question from the new one if you wish.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.