Word count from documents


(manas) #1

Is it possible to index a pdf document to visualize the count of words or like top 10 words with their count ?

Thanks in advance.


(David Pilato) #2

You can do that by indexing the content (with ingest-attachment) in a text field with fielddata: true. Or may be add a keyword subfield but you might hit a limit.

My 2 cents.


(manas) #3

Hello dadoonet, thanks for your quick response.

I have tried using the keyword subfield but am unable to do that ! (I am using a python code to index my documents , link - https://gist.github.com/stevehanson/7462063).

The other solution you were saying ingest-attachment, am not familiar on how to do that !!

Please help.

Thanks.


(David Pilato) #4

I don't read Python code. So if you could you provide a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible.

The other solution you were saying ingest-attachment, am not familiar on how to do that !!

Not really another solution but part of it. If you want to extract text from a PDF document, you can use:


(manas) #5

I have tried this:

  1. I have used tika with the python code i shared and it takes the data as '.keyword', but it doesn't show the count of individual words in a pdf file.

  2. I have used fscrawler, it takes the data as content and not as '.keyword' format, so even the field doesn't show in visualization tab.

  1. Using ingest plugin, am still working on it, am not exactly finding a way to index a pdf file, am going through lot of issues. Will work on that.

You asked me to provide a script but from the types I went through doesn't require them. All I need to do is give the directory name in which files are stored, then it will do the work for me !

I have been working on this for days now, and I lost my belief that elasticsearch will be able to individually count the words in a pdf file.

Can you please give me some references where someone had did it really, becaause I don't want to waste anymore time on this !

You are my only hope. Please help !

Regards,
Manas


(manas) #6

Here's a part of the script:

GET /test/_search
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 19,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "attachment",
"_id": "gdYkmWABst9CE15hAvBX",
"_score": 1,
"_source": {
"file": """
Elastic Search Logstash Kibana


(David Pilato) #7

My first answer was wrong. Sorry.

You can use term vectors I think. Like:

DELETE test
PUT test
{
  "mappings": {
    "doc": {
      "properties": {
        "foo": {
          "type": "text",
          "term_vector": "yes",
          "store": true
        }
      }
    }
  }
}
POST test/doc/1
{
  "foo": "a b c c"
}
GET test/doc/1/_termvectors

It gives:

{
  "_index": "test",
  "_type": "doc",
  "_id": "1",
  "_version": 1,
  "found": true,
  "took": 5,
  "term_vectors": {
    "foo": {
      "field_statistics": {
        "sum_doc_freq": 3,
        "doc_count": 1,
        "sum_ttf": 4
      },
      "terms": {
        "a": {
          "term_freq": 1
        },
        "b": {
          "term_freq": 1
        },
        "c": {
          "term_freq": 2
        }
      }
    }
  }
}

More on this at: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html

HTH


(manas) #8

Actually am able to get the term vectors even now, but am not able to visualize it on Kibana !!

But I'l create a new index using the mapping format you suggested and let u know.

Thanks.


(manas) #9

Hello David,

Even now it takes the foo field as whole string but not as an individual word !


(David Pilato) #10

You need to index that information I guess within the the document.

So you could call elasticsearch like:

GET /test/doc/_termvectors
{
  "doc" : {
    "foo" : "a b c c"
  }
}

And get the result, enrich your document with that information somehow.

Not sure about what you would do then with that information though. May be sum the number of words and store it in your doc?

Or may be write an ingest script to do that computation at index time? https://www.elastic.co/guide/en/elasticsearch/reference/master/script-processor.html


(system) #11

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.