Word count from documents

manasguduri · January 28, 2018, 7:14pm

Is it possible to index a pdf document to visualize the count of words or like top 10 words with their count ?

Thanks in advance.

dadoonet · January 28, 2018, 7:45pm

You can do that by indexing the content (with ingest-attachment) in a text field with fielddata: true. Or may be add a keyword subfield but you might hit a limit.

My 2 cents.

manasguduri · January 28, 2018, 7:51pm

Hello dadoonet, thanks for your quick response.

I have tried using the keyword subfield but am unable to do that ! (I am using a python code to index my documents , link - https://gist.github.com/stevehanson/7462063).

The other solution you were saying ingest-attachment, am not familiar on how to do that !!

Please help.

Thanks.

dadoonet · January 29, 2018, 3:28pm

I don't read Python code. So if you could you provide a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible.

The other solution you were saying ingest-attachment, am not familiar on how to do that !!

Not really another solution but part of it. If you want to extract text from a PDF document, you can use:

ingest-attachment: Ingest Attachment plugin | Elasticsearch Plugins and Integrations [8.11] | Elastic
FSCrawler: GitHub - dadoonet/fscrawler: Elasticsearch File System Crawler (FS Crawler)
Apache Tika directly in Java: https://tika.apache.org/

manasguduri · January 30, 2018, 1:12am

I have tried this:

I have used tika with the python code i shared and it takes the data as '.keyword', but it doesn't show the count of individual words in a pdf file.
I have used fscrawler, it takes the data as content and not as '.keyword' format, so even the field doesn't show in visualization tab.

Using ingest plugin, am still working on it, am not exactly finding a way to index a pdf file, am going through lot of issues. Will work on that.

You asked me to provide a script but from the types I went through doesn't require them. All I need to do is give the directory name in which files are stored, then it will do the work for me !

I have been working on this for days now, and I lost my belief that elasticsearch will be able to individually count the words in a pdf file.

Can you please give me some references where someone had did it really, becaause I don't want to waste anymore time on this !

You are my only hope. Please help !

Regards,
Manas

manasguduri · January 31, 2018, 4:42pm

Here's a part of the script:

GET /test/_search
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 19,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "attachment",
"_id": "gdYkmWABst9CE15hAvBX",
"_score": 1,
"_source": {
"file": """
Elastic Search Logstash Kibana

dadoonet · January 31, 2018, 5:21pm

My first answer was wrong. Sorry.

You can use term vectors I think. Like:

DELETE test
PUT test
{
  "mappings": {
    "doc": {
      "properties": {
        "foo": {
          "type": "text",
          "term_vector": "yes",
          "store": true
        }
      }
    }
  }
}
POST test/doc/1
{
  "foo": "a b c c"
}
GET test/doc/1/_termvectors

It gives:

{
  "_index": "test",
  "_type": "doc",
  "_id": "1",
  "_version": 1,
  "found": true,
  "took": 5,
  "term_vectors": {
    "foo": {
      "field_statistics": {
        "sum_doc_freq": 3,
        "doc_count": 1,
        "sum_ttf": 4
      },
      "terms": {
        "a": {
          "term_freq": 1
        },
        "b": {
          "term_freq": 1
        },
        "c": {
          "term_freq": 2
        }
      }
    }
  }
}

More on this at: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html

HTH

manasguduri · January 31, 2018, 5:34pm

Actually am able to get the term vectors even now, but am not able to visualize it on Kibana !!

But I'l create a new index using the mapping format you suggested and let u know.

Thanks.

manasguduri · January 31, 2018, 5:48pm

Hello David,

Even now it takes the foo field as whole string but not as an individual word !

dadoonet · January 31, 2018, 5:51pm

You need to index that information I guess within the the document.

So you could call elasticsearch like:

GET /test/doc/_termvectors
{
  "doc" : {
    "foo" : "a b c c"
  }
}

And get the result, enrich your document with that information somehow.

Not sure about what you would do then with that information though. May be sum the number of words and store it in your doc?

Or may be write an ingest script to do that computation at index time? https://www.elastic.co/guide/en/elasticsearch/reference/master/script-processor.html

system · February 28, 2018, 5:51pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Need some help with Ingest Attachment plugin Elasticsearch	6	442	May 28, 2018
Visualizing the count of words in each document(pdf, word) in kibana Kibana	4	3076	February 27, 2018
Visualizing the count of words in each document(pdf, word) in kibana using FSCRAWLER Kibana	4	1063	February 21, 2018
How can I ingest PDF and words files and extract keywords of these documents? Elasticsearch	8	3853	June 26, 2018
Indexing word, pdf documents? Elasticsearch	12	6120	July 7, 2020

Word count from documents

Related topics