Elasticsearch: total term frequency and doc count from given set of documents

I am trying to get total term frequency and document count from given set of documents, but _termvectors in elasticsearch returns ttf and doc_count from all documents within the index. Is there any way so that I can specify list of documents (document ids) so that result will based on those documents only.

Below are documents details and query to get total term frequency:
Index details:

PUT /twitter
{ "mappings": {
    "tweets": {
      "properties": {
	"name": {
	  "type": "text",
	  "analyzer":"english"
	}
      }
    }
  },
  "settings" : {
    "index" : {
      "number_of_shards" : 1,
      "number_of_replicas" : 0
    }
  }
}

Document Details:

PUT /twitter/tweets/1
{
  "name":"Hello bar"
}

PUT /twitter/tweets/2
{
  "name":"Hello foo"
}

PUT /twitter/tweets/3
{
  "name":"Hello foo bar"
}

It will create three document with ids 1, 2 and 3. Now suppose tweets with ids 1 and 2 belongs to user1 and 3 belong to another user and I want to get the termvectors for user1.

Query to get this result:

GET /twitter/tweets/_mtermvectors
{
  "ids" : ["1", "2"],
  "parameters": {
      "fields": ["name"],
      "term_statistics": true,
      "offsets":false,
      "payloads":false,
      "positions":false
  }
}

Response:

{
  "docs": [
    {
      "_index": "twitter",
      "_type": "tweets",
      "_id": "1",
      "_version": 1,
      "found": true,
      "took": 1,
      "term_vectors": {
        "name": {
          "field_statistics": {
            "sum_doc_freq": 7,
            "doc_count": 3,
            "sum_ttf": 7
          },
          "terms": {
            "bar": {
              "doc_freq": 2,
              "ttf": 2,
              "term_freq": 1
            },
            "hello": {
              "doc_freq": 3,
              "ttf": 3,
              "term_freq": 1
            }
          }
        }
      }
    },
    {
      "_index": "twitter",
      "_type": "tweets",
      "_id": "2",
      "_version": 1,
      "found": true,
      "took": 1,
      "term_vectors": {
        "name": {
          "field_statistics": {
            "sum_doc_freq": 7,
            "doc_count": 3,
            "sum_ttf": 7
          },
          "terms": {
            "foo": {
              "doc_freq": 2,
              "ttf": 2,
              "term_freq": 1
            },
            "hello": {
              "doc_freq": 3,
              "ttf": 3,
              "term_freq": 1
            }
          }
        }
      }
    }
  ]
}

Here we can see hello is having doc_count 3 and ttf 3. How can I make it to consider only documents with given ids.

One approach I am thinking is to create different index for different users. But I am not sure if this approach is correct. With this approach indices will increase with users. Or can there be another solution?

Please format your code using </> icon as explained in this guide. It will make your post more readable.

Or use markdown style like:

```
CODE
```

Thanks for suggestion, I have updated formatting accordingly.

Awesome. Thanks!

I don't have the answer to your question. Indeed creating specific indices would work. If it's a one shot or small operation, you could imagine something like:

  • calling reindex API using the query you wish and reindex few docs to tmp_timestamp index for example
  • call the termvectors API
  • drop the tmp_timestamp index

But may be @jimczi has a much better idea?

Since you have the term_freq per term per document in the response, it should be straightforward to derive the total term frequency for each term (just sum up the term_freq of each document/term) and the doc count is just the number of documents in the response that contain the term.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.