hi all, first time using ES discuss. (And I posted the same question to ES slack as I don't know where this community mostly hangs?)
At the moment I'm using trial ES serverless resources, and the python API. I used to know Lucene pretty well, and trying to do things that used to be straight-forward, eg: How can I get the full list of keyword/tokens in an index for fields text and desc and iterate over them to get document counts? How about if it is a named entity recognition index with keywords put in ner_desc.entities.entity ? Thanks for any hints, eg pointers to how/if ES reveals lower-level Lucene features.
Great question - I want to push back on an assumption I think you've made.
Elasticsearch is not a wrapper of Lucene, in that it takes Lucene and puts a REST interface on top of it. Elasticsearch is implemented using Lucene. There's many, many aspects of Elasticsearch that extend, complement, or replace functionality provided by Lucene. As such, Lucene doesn't really 'exist' as a separate standalone 'thing' within Elasticsearch. So there are no features of Lucene that can be exposed separately to what Elasticsearch exposes - everything is done by Elasticsearch, using many areas of functionality of Lucene to do so, but not as Elasticsearch-wrapping-Lucene.
Some Lucene concepts don't make sense in Elasticsearch due to how we've implemented something. Some other things map pretty well onto underlying Lucene functionality. And some are entirely Elasticsearch-only, not really using Lucene. You can't really split them up from outside Elasticsearch.
SImon, thanks very much for your thoughtful response. the more i work with ES the more I appreciate all the things it is trying to do, far beyond Lucene search. it would be some pointers to how functionality is divided up, according to you:
Some Lucene concepts don't make sense in Elasticsearch due to how we've implemented something.
Some other things map pretty well onto underlying Lucene functionality.
And some are entirely Elasticsearch-only,
I'd really appreciate pointers to design docs that let me know which is which.
here's a simple example: I was looking for access to (what I recall Lucene calling; no access to Lucene doc this minute) the Vocabulary: a full enumeration of all assigned keywords. I wound up writing a script that accomplishes this, but wonder if I did it right.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.