Elasticsearch: total term frequency and doc count from given set of documents


(Prasoon Kirar) #1

I am trying to get total term frequency and document count from given set of documents, but _termvectors in elasticsearch returns ttf and doc_count from all documents within the index. Is there any way so that I can specify list of documents (document ids) so that result will based on those documents only.

Below are documents details and query to get total term frequency:
Index details:

PUT /twitter
{ "mappings": {
    "tweets": {
      "properties": {
	"name": {
	  "type": "text",
	  "analyzer":"english"
	}
      }
    }
  },
  "settings" : {
    "index" : {
      "number_of_shards" : 1,
      "number_of_replicas" : 0
    }
  }
}

Document Details:

PUT /twitter/tweets/1
{
  "name":"Hello bar"
}

PUT /twitter/tweets/2
{
  "name":"Hello foo"
}

PUT /twitter/tweets/3
{
  "name":"Hello foo bar"
}

It will create three document with ids 1, 2 and 3. Now suppose tweets with ids 1 and 2 belongs to user1 and 3 belong to another user and I want to get the termvectors for user1.

Query to get this result:

GET /twitter/tweets/_mtermvectors
{
  "ids" : ["1", "2"],
  "parameters": {
      "fields": ["name"],
      "term_statistics": true,
      "offsets":false,
      "payloads":false,
      "positions":false
  }
}

Response:

{
  "docs": [
    {
      "_index": "twitter",
      "_type": "tweets",
      "_id": "1",
      "_version": 1,
      "found": true,
      "took": 1,
      "term_vectors": {
        "name": {
          "field_statistics": {
            "sum_doc_freq": 7,
            "doc_count": 3,
            "sum_ttf": 7
          },
          "terms": {
            "bar": {
              "doc_freq": 2,
              "ttf": 2,
              "term_freq": 1
            },
            "hello": {
              "doc_freq": 3,
              "ttf": 3,
              "term_freq": 1
            }
          }
        }
      }
    },
    {
      "_index": "twitter",
      "_type": "tweets",
      "_id": "2",
      "_version": 1,
      "found": true,
      "took": 1,
      "term_vectors": {
        "name": {
          "field_statistics": {
            "sum_doc_freq": 7,
            "doc_count": 3,
            "sum_ttf": 7
          },
          "terms": {
            "foo": {
              "doc_freq": 2,
              "ttf": 2,
              "term_freq": 1
            },
            "hello": {
              "doc_freq": 3,
              "ttf": 3,
              "term_freq": 1
            }
          }
        }
      }
    }
  ]
}

Here we can see hello is having doc_count 3 and ttf 3. How can I make it to consider only documents with given ids.

One approach I am thinking is to create different index for different users. But I am not sure if this approach is correct. With this approach indices will increase with users. Or can there be another solution?


(David Pilato) #2

Please format your code using </> icon as explained in this guide. It will make your post more readable.

Or use markdown style like:

```
CODE
```

(Prasoon Kirar) #3

Thanks for suggestion, I have updated formatting accordingly.


(David Pilato) #4

Awesome. Thanks!

I don't have the answer to your question. Indeed creating specific indices would work. If it's a one shot or small operation, you could imagine something like:

  • calling reindex API using the query you wish and reindex few docs to tmp_timestamp index for example
  • call the termvectors API
  • drop the tmp_timestamp index

But may be @jimczi has a much better idea?


(Jimferenczi) #5

Since you have the term_freq per term per document in the response, it should be straightforward to derive the total term frequency for each term (just sum up the term_freq of each document/term) and the doc count is just the number of documents in the response that contain the term.


(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.