Do not know how to call it but probably it is a new (and cool!) feature request?

Hi,

I am wondering if elasticsearch can support me in the following scenario
out of the box and if not then whether a new feature can be implemented to
support it.

In my case I have various document types (mails, blogs, IRC logs, ... etc).
Each document has an author but in reality author (for example "Lukas
Vlcek") can use various nicks across whole document corpus, for example he
uses nick "Lukas" for mails and nicks "luk1" and "luk2" for IRC logs. Now,
I would like to be able to provide name consolidated filtering and query
capabilities via the search UI.

In other words if user selects author: "Lukas Vlcek", then

  1. search results would contain mail results for "Lukas" and IRC logs
    results for both "luk1" and "luk2"
  2. facets would aggregate "Lukas","luk1" and "luk2" under single "Lukas
    Vlcek" item

As of now it would be probably possible to workaround this somehow but not
very generally and it would probably require frequent reindexing with every
change in nicks for particular user. Also I do not think this is search
groupings (it is not about deduplication, it is more like synonyms... but
without need to reindex). And AFAIK parent/child has a little limited query
capabilities (think of complex facets and filtering, custom scoring and
things like that...). Nested types would require expensive reindexing...

May be my idea is too naive but I think it shouldn't be that hard to have
direct support for something like that in ES given that the set of "nicks"
data is not too large.

  1. Let's have a separate index that would define author - nicks relation.
    It would contain documents like { author: "Lukas Vlcek", nicks: [ "Lukas'",
    "luk1", "luk2" ]}
  2. Have this index be automatically replicated to all nodes (or at least to
    those nodes that contain shards with data that needs to be queried when
    doing searches described above)
  3. Then when doing a search, it could expand the author field values for
    search (that would be something like real time synonyms) and also use it
    for facet aggregations (this could be probably expensive part depending on
    the size of the data).

As a result I could keep the author - nicks relation in separated
(hopefully small) index that could be updated anytime and search requests
would take account on it (in "real-time" fashion) yielding aggregated
facets (nicks would be mapped to author name) and search results (where
individual search hits would provide both original nick and corresponding
author name). Is that doable?

Comments/suggestions welcome.

Regards,
Lukas

Hi Lukas,

Yes, this would be a nice feature. And yes this is even doable at the
moment but not that optimal as you suggested (server side fetching of
the filter-alias-index replicated to all nodes).

But you can do this from the client side: you could create and query
that alias index and then use a Terms-query/filter to fetch documents
with authorX OR authorY OR ...

http://www.elasticsearch.org/guide/reference/query-dsl/terms-filter.html

http://www.elasticsearch.org/guide/reference/query-dsl/terms-query.html

BTW: the alias index filtering feature should be also in your
toolbox + could even be used to solve a part of this problem:

http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases.html

Peter.

Peter,

Using terms-query/filter would not really help me with the facets portion
and that is important. I did not specifically mention such use case but I
want the author-nicks mapping to work even if user does not select any
author at all. So for example user just search for "Lucene" token. And
imagine that we search Lucene mail lists (dev/users/announcements/...etc)
and besides top scoring documents I want to display top authors facet. And
if one author is using more email addresses then I would like to
consolidate contributions from all individual email accounts of particular
user under a single alias.

Generally speaking, I think (well ... hope) that such functionality is
doable in ES and I am sure people would find many exotic use cases for it.
Not just the one mentioned above (in fact the above problem can be solved
in many different ways but I think that if I could get it directly from ES
that would be really cool and it would save me a lot of work and
maintenance!).

Regards,
Lukas

On Thu, Dec 8, 2011 at 9:46 PM, Karussell tableyourtime@googlemail.comwrote:

Hi Lukas,

Yes, this would be a nice feature. And yes this is even doable at the
moment but not that optimal as you suggested (server side fetching of
the filter-alias-index replicated to all nodes).

But you can do this from the client side: you could create and query
that alias index and then use a Terms-query/filter to fetch documents
with authorX OR authorY OR ...

Elasticsearch Platform — Find real-time answers at scale | Elastic

Elasticsearch Platform — Find real-time answers at scale | Elastic

BTW: the alias index filtering feature should be also in your
toolbox + could even be used to solve a part of this problem:

Elasticsearch Platform — Find real-time answers at scale | Elastic

Peter.

Lukáš,

A very common use case for these type of features would be in indexing any
documents that were decorated after NPL Named Entity Recognition analysis.
This is very popular for social network analysis where information like
identity, location, group, role, intent, ... are parsed out of documents
and included as meta-data. Typically these are decorated as a pointers back
to a point of reference and some type of confidence score.
Named-entity recognition - Wikipedia

If one could use ES to efficiently store / query these type of
relationships, it would become an attractive sink for data out of systems
like Apache UIMA/GATE. I plan to add some features to my current work that
would benefit from this type of functionality but not for 6-8 months.

I haven't considered it too much at this point, but I've always wanted
something like HBase's co-processor functionality for ES where you can do
pre/post processing on inserts/updates/deletes/...

--Mike
On Thu, Dec 8, 2011 at 4:37 PM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Peter,

Using terms-query/filter would not really help me with the facets portion
and that is important. I did not specifically mention such use case but I
want the author-nicks mapping to work even if user does not select any
author at all. So for example user just search for "Lucene" token. And
imagine that we search Lucene mail lists (dev/users/announcements/...etc)
and besides top scoring documents I want to display top authors facet. And
if one author is using more email addresses then I would like to
consolidate contributions from all individual email accounts of particular
user under a single alias.

Generally speaking, I think (well ... hope) that such functionality is
doable in ES and I am sure people would find many exotic use cases for it.
Not just the one mentioned above (in fact the above problem can be solved
in many different ways but I think that if I could get it directly from ES
that would be really cool and it would save me a lot of work and
maintenance!).

Regards,
Lukas

On Thu, Dec 8, 2011 at 9:46 PM, Karussell tableyourtime@googlemail.comwrote:

Hi Lukas,

Yes, this would be a nice feature. And yes this is even doable at the
moment but not that optimal as you suggested (server side fetching of
the filter-alias-index replicated to all nodes).

But you can do this from the client side: you could create and query
that alias index and then use a Terms-query/filter to fetch documents
with authorX OR authorY OR ...

Elasticsearch Platform — Find real-time answers at scale | Elastic

Elasticsearch Platform — Find real-time answers at scale | Elastic

BTW: the alias index filtering feature should be also in your
toolbox + could even be used to solve a part of this problem:

Elasticsearch Platform — Find real-time answers at scale | Elastic

Peter.

Your solution is similar to using a map-side join in Hadoop. There is
no point going out to HDFS for data if it can simply be stored in
memory, especially if the data is used by every machine in the
cluster.

The caveat is of course what is considered large or not. Self-limiting
is not always the best solution since some would "exploit" the feature
and then bad mouth the product when it does not work.

Cheers,

Ivan

On Thu, Dec 8, 2011 at 2:33 AM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Hi,

May be my idea is too naive but I think it shouldn't be that hard to have
direct support for something like that in ES given that the set of "nicks"
data is not too large.

Hi Ivan,

if it would be possible to have this functionality as an ES-independent
plugin, then I would not worry about bad mouths that much. They could only
blame the author of the plugin.

@Shay, do you thing something like that is possible to implement as a
plugin?

Regards,
Lukas

On Fri, Dec 9, 2011 at 1:16 AM, Ivan Brusic ivan@brusic.com wrote:

Your solution is similar to using a map-side join in Hadoop. There is
no point going out to HDFS for data if it can simply be stored in
memory, especially if the data is used by every machine in the
cluster.

The caveat is of course what is considered large or not. Self-limiting
is not always the best solution since some would "exploit" the feature
and then bad mouth the product when it does not work.

Cheers,

Ivan

On Thu, Dec 8, 2011 at 2:33 AM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Hi,

May be my idea is too naive but I think it shouldn't be that hard to have
direct support for something like that in ES given that the set of
"nicks"
data is not too large.

Hi Lukas,

see below

So for example user just search for "Lucene" token. And
imagine that we search Lucene mail lists (dev/users/announcements/...etc)
and besides top scoring documents I want to display top authors facet. And
if one author is using more email addresses then I would like to
consolidate contributions from all individual email accounts of particular
user under a single alias.

Ah, ok, this is indeed an additional requirement for the 'aliasing
index' I have in mind :slight_smile:

But wouldn't it be somehow possible with a script while faceting?

Peter.

Why not have a "real-username" for all users indexed which is not
displayed (or an ID) and only add the current alias as displayable
name?

Or is this not possible?

On 9 Dez., 09:46, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi Ivan,

if it would be possible to have this functionality as an ES-independent
plugin, then I would not worry about bad mouths that much. They could only
blame the author of the plugin.

@Shay, do you thing something like that is possible to implement as a
plugin?

Regards,
Lukas

On Fri, Dec 9, 2011 at 1:16 AM, Ivan Brusic i...@brusic.com wrote:

Your solution is similar to using a map-side join in Hadoop. There is
no point going out to HDFS for data if it can simply be stored in
memory, especially if the data is used by every machine in the
cluster.

The caveat is of course what is considered large or not. Self-limiting
is not always the best solution since some would "exploit" the feature
and then bad mouth the product when it does not work.

Cheers,

Ivan

On Thu, Dec 8, 2011 at 2:33 AM, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

May be my idea is too naive but I think it shouldn't be that hard to have
direct support for something like that in ES given that the set of
"nicks"
data is not too large.

That is definitely possible but still you have to reindex all related
documents if you learn that you need to change things (for example when you
learn that you assigned given nick a wrong user name, or if you want to
change the name of the user). As I said there are definitely many ways how
to approach my use case but having some sort of out of box support in ES
would be really great, such functionality would open door for other crazy
experiments...

Regards,
Lukas

On Fri, Dec 9, 2011 at 11:08 AM, Karussell tableyourtime@googlemail.comwrote:

Why not have a "real-username" for all users indexed which is not
displayed (or an ID) and only add the current alias as displayable
name?

Or is this not possible?

On 9 Dez., 09:46, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi Ivan,

if it would be possible to have this functionality as an ES-independent
plugin, then I would not worry about bad mouths that much. They could
only
blame the author of the plugin.

@Shay, do you thing something like that is possible to implement as a
plugin?

Regards,
Lukas

On Fri, Dec 9, 2011 at 1:16 AM, Ivan Brusic i...@brusic.com wrote:

Your solution is similar to using a map-side join in Hadoop. There is
no point going out to HDFS for data if it can simply be stored in
memory, especially if the data is used by every machine in the
cluster.

The caveat is of course what is considered large or not. Self-limiting
is not always the best solution since some would "exploit" the feature
and then bad mouth the product when it does not work.

Cheers,

Ivan

On Thu, Dec 8, 2011 at 2:33 AM, Lukáš Vlček lukas.vl...@gmail.com
wrote:

Hi,

May be my idea is too naive but I think it shouldn't be that hard to
have
direct support for something like that in ES given that the set of
"nicks"
data is not too large.

May be I should have said that as of now I do not have any dataset of
"username - nicks". It will be built gradually over time and I am looking
for some way how to not reindex the data with every update/change in this
relatively small "username-nicks" dataset.

On Fri, Dec 9, 2011 at 11:33 AM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

That is definitely possible but still you have to reindex all related
documents if you learn that you need to change things (for example when you
learn that you assigned given nick a wrong user name, or if you want to
change the name of the user). As I said there are definitely many ways how
to approach my use case but having some sort of out of box support in ES
would be really great, such functionality would open door for other crazy
experiments...

Regards,
Lukas

On Fri, Dec 9, 2011 at 11:08 AM, Karussell tableyourtime@googlemail.comwrote:

Why not have a "real-username" for all users indexed which is not
displayed (or an ID) and only add the current alias as displayable
name?

Or is this not possible?

On 9 Dez., 09:46, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi Ivan,

if it would be possible to have this functionality as an ES-independent
plugin, then I would not worry about bad mouths that much. They could
only
blame the author of the plugin.

@Shay, do you thing something like that is possible to implement as a
plugin?

Regards,
Lukas

On Fri, Dec 9, 2011 at 1:16 AM, Ivan Brusic i...@brusic.com wrote:

Your solution is similar to using a map-side join in Hadoop. There is
no point going out to HDFS for data if it can simply be stored in
memory, especially if the data is used by every machine in the
cluster.

The caveat is of course what is considered large or not. Self-limiting
is not always the best solution since some would "exploit" the feature
and then bad mouth the product when it does not work.

Cheers,

Ivan

On Thu, Dec 8, 2011 at 2:33 AM, Lukáš Vlček lukas.vl...@gmail.com
wrote:

Hi,

May be my idea is too naive but I think it shouldn't be that hard
to have
direct support for something like that in ES given that the set of
"nicks"
data is not too large.

Yes, please open one or even two issues :slight_smile:

I think one that makes a more generic server-side refetching possible
via scripting or similar **

and then one issue attacking "external aliased query handling and
facet aggregation"

Peter.

**
query: {
some normal term query selecting some docs e.g. friends or nicks
of a user

 doAfterQuery: myQueryScript

}

in myQueryScript the resulting hits are available and then one could
construct a terms query JSON from docs[i].nick

how to attack pagination? and could the generated query even have
another doAfterQuery part or return several queries?

Peter,

doAfterQuery approach will not help with facets. It is too late for it.

What would be cool is some kind of integration with distributed in-memory
datastore that could be consulted at any phase of query execution and score
calculation (not only after query). And I am sure Shay already thought
about this... but since such feature is not available now I am at least
looking (asking) for some intermediate step :slight_smile:

Regards,
Lukas

On Fri, Dec 9, 2011 at 12:16 PM, Karussell tableyourtime@googlemail.comwrote:

Yes, please open one or even two issues :slight_smile:

I think one that makes a more generic server-side refetching possible
via scripting or similar **

and then one issue attacking "external aliased query handling and
facet aggregation"

Peter.

**
query: {
some normal term query selecting some docs e.g. friends or nicks
of a user

doAfterQuery: myQueryScript

}

in myQueryScript the resulting hits are available and then one could
construct a terms query JSON from docs[i].nick

how to attack pagination? and could the generated query even have
another doAfterQuery part or return several queries?