I have a use case where I want to find the most popular shingles from a text corpus.
I made a first implementation some time ago in Elasticsearch 2.3.1. I would extract shingles from the text and index them in Elasticsearch, using an MD5 hash of the shingle as the document ID.
Here's the mapping:
{
"dynamic": "strict",
"_all": {
"enabled": false
},
"properties": {
"assoOrigin": {
"type": "keyword"
},
"assoShingle": {
"type": "keyword",
"fields": {
"assoFirst": {
"type": "text",
"analyzer": "assoFirst"
},
"french": {
"type": "text",
"analyzer": "french"
},
"simple": {
"type": "text",
"analyzer": "simple"
}
}
}
}
}
Then I would query like this:
POST asso_index/asso/_search
{
"version": true,
"query": {
"function_score": {
"query": {
"bool": {
"must": [
{
"dis_max": {
"tie_breaker": 0,
"queries": [
{
"bool": {
"must": [
{
"match": {
"assoShingle.simple": "rock"
}
}
],
"should": [
{
"match": {
"assoShingle.assoFirst": "rock"
}
}
]
}
},
{
"bool": {
"must": [
{
"match": {
"assoShingle.french": "rock"
}
}
],
"should": [
{
"match": {
"assoShingle.assoFirst": "rock"
}
}
]
}
}
]
}
}
]
}
},
"functions": [
{
"field_value_factor": {
"field": "_version"
}
}
]
}
}
}
This would work pretty well, as no document is ever deleted from the index, so _version actually gives the number of times the same shingle was indexed, hence conveniently the number of occurences of the shingle in the text.
However, when migrating to Elasticsearch 5.6.6, this is not working anymore, and I get the following error:
Fielddata is not supported on field [_version] of type [_version]
I don't see any good means of keeping track of the number of occurences of the shingles in ES, and since the text can be quite big, I'd rather avoid to calculate it externally.
I'd appreciate advice on how to work around this issue.
Cheers,
Nicolas