String prefix grouping in chunks of similar size


(ibotty) #1

hi,

i'm not very proficient in elasticsearch (yet), so please bear with me if
this is a stupid question. if you don't notice, i'm not a native speaker
either.

say i have many documents with some space separated words in it (see script
below for unrealistic test data).

i'd like to get an overview about all words all documents. as far as i
understood elasticsearch that means i have to use facets.

using the terms facet i can get all words and filter them with regular
expressions. e.g.:

curl -sXGET localhost:9200/test/test/_search?pretty&search_type=count -d
'
{
"facets": {
"terms": {
"terms": {
"field": "sentence",
"regex": "^(elastic|search)",
"all_terms": true
}
}
}
}'

btw: is this the recommended way? it smells inefficient to me (for this
small index it is pretty fast though).

i can get the count of words starting with "elastic" using the query facet:
{
"facets": {
"query": {
"query": {
"prefix": {"sentence": "elastic"}
}
}
}
}

but i have not figured out how to get a quantil-like view out of it: e.g.:
about 100 words starting with the substring AA-BC, about 100 words from
BC-HG.
i.e. something you might know from bibliography indexing cards.

with some frontend work i could get this using the terms facet above. but i
would need to get all terms and loop through them. this could be
shortcutted somehow if i could get the (say) 100th term in the list but not
the other before.

thank you in advance,
tobias florek

test_data.sh

$!/bin/sh
curl -XDELETE localhost:9200/test > /dev/null

bulkfile=$(mktemp)

put 3 words from a large dictionary in the test-index

awk '/^[a-z]+$/ {printf("%s%s", $0, (NR%3 ? " " : "\n"))}'
/usr/share/dict/words |
while read words;
do
echo '{"index": {"_index": "test", "_type": "test"}}' >> $bulkfile
printf '{"sentence": "%s"}\n' "$words" >> $bulkfile
done

curl -s -XPOST localhost:9200/_bulk --data-binary @${bulkfile} > /dev/null
rm $bulkfile


(system) #2