I am wondering if elasticsearch can support me in the following scenario
out of the box and if not then whether a new feature can be implemented to
support it.
In my case I have various document types (mails, blogs, IRC logs, ... etc).
Each document has an author but in reality author (for example "Lukas
Vlcek") can use various nicks across whole document corpus, for example he
uses nick "Lukas" for mails and nicks "luk1" and "luk2" for IRC logs. Now,
I would like to be able to provide name consolidated filtering and query
capabilities via the search UI.
In other words if user selects author: "Lukas Vlcek", then
search results would contain mail results for "Lukas" and IRC logs
results for both "luk1" and "luk2"
facets would aggregate "Lukas","luk1" and "luk2" under single "Lukas
Vlcek" item
As of now it would be probably possible to workaround this somehow but not
very generally and it would probably require frequent reindexing with every
change in nicks for particular user. Also I do not think this is search
groupings (it is not about deduplication, it is more like synonyms... but
without need to reindex). And AFAIK parent/child has a little limited query
capabilities (think of complex facets and filtering, custom scoring and
things like that...). Nested types would require expensive reindexing...
May be my idea is too naive but I think it shouldn't be that hard to have
direct support for something like that in ES given that the set of "nicks"
data is not too large.
Let's have a separate index that would define author - nicks relation.
It would contain documents like { author: "Lukas Vlcek", nicks: [ "Lukas'",
"luk1", "luk2" ]}
Have this index be automatically replicated to all nodes (or at least to
those nodes that contain shards with data that needs to be queried when
doing searches described above)
Then when doing a search, it could expand the author field values for
search (that would be something like real time synonyms) and also use it
for facet aggregations (this could be probably expensive part depending on
the size of the data).
As a result I could keep the author - nicks relation in separated
(hopefully small) index that could be updated anytime and search requests
would take account on it (in "real-time" fashion) yielding aggregated
facets (nicks would be mapped to author name) and search results (where
individual search hits would provide both original nick and corresponding
author name). Is that doable?
Yes, this would be a nice feature. And yes this is even doable at the
moment but not that optimal as you suggested (server side fetching of
the filter-alias-index replicated to all nodes).
But you can do this from the client side: you could create and query
that alias index and then use a Terms-query/filter to fetch documents
with authorX OR authorY OR ...
Using terms-query/filter would not really help me with the facets portion
and that is important. I did not specifically mention such use case but I
want the author-nicks mapping to work even if user does not select any
author at all. So for example user just search for "Lucene" token. And
imagine that we search Lucene mail lists (dev/users/announcements/...etc)
and besides top scoring documents I want to display top authors facet. And
if one author is using more email addresses then I would like to
consolidate contributions from all individual email accounts of particular
user under a single alias.
Generally speaking, I think (well ... hope) that such functionality is
doable in ES and I am sure people would find many exotic use cases for it.
Not just the one mentioned above (in fact the above problem can be solved
in many different ways but I think that if I could get it directly from ES
that would be really cool and it would save me a lot of work and
maintenance!).
Yes, this would be a nice feature. And yes this is even doable at the
moment but not that optimal as you suggested (server side fetching of
the filter-alias-index replicated to all nodes).
But you can do this from the client side: you could create and query
that alias index and then use a Terms-query/filter to fetch documents
with authorX OR authorY OR ...
A very common use case for these type of features would be in indexing any
documents that were decorated after NPL Named Entity Recognition analysis.
This is very popular for social network analysis where information like
identity, location, group, role, intent, ... are parsed out of documents
and included as meta-data. Typically these are decorated as a pointers back
to a point of reference and some type of confidence score. Named-entity recognition - Wikipedia
If one could use ES to efficiently store / query these type of
relationships, it would become an attractive sink for data out of systems
like Apache UIMA/GATE. I plan to add some features to my current work that
would benefit from this type of functionality but not for 6-8 months.
I haven't considered it too much at this point, but I've always wanted
something like HBase's co-processor functionality for ES where you can do
pre/post processing on inserts/updates/deletes/...
Using terms-query/filter would not really help me with the facets portion
and that is important. I did not specifically mention such use case but I
want the author-nicks mapping to work even if user does not select any
author at all. So for example user just search for "Lucene" token. And
imagine that we search Lucene mail lists (dev/users/announcements/...etc)
and besides top scoring documents I want to display top authors facet. And
if one author is using more email addresses then I would like to
consolidate contributions from all individual email accounts of particular
user under a single alias.
Generally speaking, I think (well ... hope) that such functionality is
doable in ES and I am sure people would find many exotic use cases for it.
Not just the one mentioned above (in fact the above problem can be solved
in many different ways but I think that if I could get it directly from ES
that would be really cool and it would save me a lot of work and
maintenance!).
Yes, this would be a nice feature. And yes this is even doable at the
moment but not that optimal as you suggested (server side fetching of
the filter-alias-index replicated to all nodes).
But you can do this from the client side: you could create and query
that alias index and then use a Terms-query/filter to fetch documents
with authorX OR authorY OR ...
Your solution is similar to using a map-side join in Hadoop. There is
no point going out to HDFS for data if it can simply be stored in
memory, especially if the data is used by every machine in the
cluster.
The caveat is of course what is considered large or not. Self-limiting
is not always the best solution since some would "exploit" the feature
and then bad mouth the product when it does not work.
May be my idea is too naive but I think it shouldn't be that hard to have
direct support for something like that in ES given that the set of "nicks"
data is not too large.
if it would be possible to have this functionality as an ES-independent
plugin, then I would not worry about bad mouths that much. They could only
blame the author of the plugin.
@Shay, do you thing something like that is possible to implement as a
plugin?
Regards,
Lukas
On Fri, Dec 9, 2011 at 1:16 AM, Ivan Brusic ivan@brusic.com wrote:
Your solution is similar to using a map-side join in Hadoop. There is
no point going out to HDFS for data if it can simply be stored in
memory, especially if the data is used by every machine in the
cluster.
The caveat is of course what is considered large or not. Self-limiting
is not always the best solution since some would "exploit" the feature
and then bad mouth the product when it does not work.
May be my idea is too naive but I think it shouldn't be that hard to have
direct support for something like that in ES given that the set of
"nicks"
data is not too large.
So for example user just search for "Lucene" token. And
imagine that we search Lucene mail lists (dev/users/announcements/...etc)
and besides top scoring documents I want to display top authors facet. And
if one author is using more email addresses then I would like to
consolidate contributions from all individual email accounts of particular
user under a single alias.
Ah, ok, this is indeed an additional requirement for the 'aliasing
index' I have in mind
But wouldn't it be somehow possible with a script while faceting?
if it would be possible to have this functionality as an ES-independent
plugin, then I would not worry about bad mouths that much. They could only
blame the author of the plugin.
@Shay, do you thing something like that is possible to implement as a
plugin?
Regards,
Lukas
On Fri, Dec 9, 2011 at 1:16 AM, Ivan Brusic i...@brusic.com wrote:
Your solution is similar to using a map-side join in Hadoop. There is
no point going out to HDFS for data if it can simply be stored in
memory, especially if the data is used by every machine in the
cluster.
The caveat is of course what is considered large or not. Self-limiting
is not always the best solution since some would "exploit" the feature
and then bad mouth the product when it does not work.
May be my idea is too naive but I think it shouldn't be that hard to have
direct support for something like that in ES given that the set of
"nicks"
data is not too large.
That is definitely possible but still you have to reindex all related
documents if you learn that you need to change things (for example when you
learn that you assigned given nick a wrong user name, or if you want to
change the name of the user). As I said there are definitely many ways how
to approach my use case but having some sort of out of box support in ES
would be really great, such functionality would open door for other crazy
experiments...
if it would be possible to have this functionality as an ES-independent
plugin, then I would not worry about bad mouths that much. They could
only
blame the author of the plugin.
@Shay, do you thing something like that is possible to implement as a
plugin?
Regards,
Lukas
On Fri, Dec 9, 2011 at 1:16 AM, Ivan Brusic i...@brusic.com wrote:
Your solution is similar to using a map-side join in Hadoop. There is
no point going out to HDFS for data if it can simply be stored in
memory, especially if the data is used by every machine in the
cluster.
The caveat is of course what is considered large or not. Self-limiting
is not always the best solution since some would "exploit" the feature
and then bad mouth the product when it does not work.
May be my idea is too naive but I think it shouldn't be that hard to
have
direct support for something like that in ES given that the set of
"nicks"
data is not too large.
May be I should have said that as of now I do not have any dataset of
"username - nicks". It will be built gradually over time and I am looking
for some way how to not reindex the data with every update/change in this
relatively small "username-nicks" dataset.
That is definitely possible but still you have to reindex all related
documents if you learn that you need to change things (for example when you
learn that you assigned given nick a wrong user name, or if you want to
change the name of the user). As I said there are definitely many ways how
to approach my use case but having some sort of out of box support in ES
would be really great, such functionality would open door for other crazy
experiments...
if it would be possible to have this functionality as an ES-independent
plugin, then I would not worry about bad mouths that much. They could
only
blame the author of the plugin.
@Shay, do you thing something like that is possible to implement as a
plugin?
Regards,
Lukas
On Fri, Dec 9, 2011 at 1:16 AM, Ivan Brusic i...@brusic.com wrote:
Your solution is similar to using a map-side join in Hadoop. There is
no point going out to HDFS for data if it can simply be stored in
memory, especially if the data is used by every machine in the
cluster.
The caveat is of course what is considered large or not. Self-limiting
is not always the best solution since some would "exploit" the feature
and then bad mouth the product when it does not work.
May be my idea is too naive but I think it shouldn't be that hard
to have
direct support for something like that in ES given that the set of
"nicks"
data is not too large.
doAfterQuery approach will not help with facets. It is too late for it.
What would be cool is some kind of integration with distributed in-memory
datastore that could be consulted at any phase of query execution and score
calculation (not only after query). And I am sure Shay already thought
about this... but since such feature is not available now I am at least
looking (asking) for some intermediate step
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.