Advice on how to implement a chat feature w/ ElasticSearch as back end

John_Rose · February 13, 2015, 8:05pm

Hi,

We're new to ElasticSearch but quite impressed with the tools & community.
Currently we're using it to build an in-house chat/collaboration solution.

In a nutshell: it's in-house version of HipChat that's tightly integrated
with our existing business software.

Currently we're using ES as the back end of the system to store + retrieve
messages.

Our next challenge is to allow users (and the system) to search for things
like "@firstname lastname", "firstname lastname", tags like #hashtag or
file#1234, email addresses, links to files, URLs, employee names, and so
on. These things aren't possible with the standard tokenizer/analyzer. We
could simply map the "body" field as "not_analyzed" in the index but then
we'd lose most searchability.

A special wrinkle here is that username mentions may include spaces - like
"@Bob Smith." - as well as employee names like "Bob Smith" without the "@"
prefix.

We're considering several approaches here and could really use
feedback/critique!

Approach #1: Multiple mappings to the "body" field: one of which uses the
standard tokenizer/analyzer, and one of which is not_analyzed. We could use
the not_analyzed version to search on things like @mentions and #hashtags.
(This would nearly double our storage requirements, right?)

Approach #2: Implementing a custom tokenizer that treats things like
@username, "@Bob Smith", "Bob Smith" and #hashtags as single tokens so that
we can search on them later. We'd base it on something like ES's email
tokenizer:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-uaxurlemail-tokenizer.html

Approach #3: In our application layer (a Rails application) we could parse
out @usernames before saving the record to ElasticSearch. We could then
save the @usernames in a separate ES array field. Same for #hashtags and so
forth. So in addition to the "body" field we'd have a field called
"mentions", a field called "hashtags", a field called "hyperlinks", and so
forth.

How would you do it?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/929cd26b-c17b-4ea1-b9d5-cb415ea037bf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

warkolm · February 16, 2015, 9:41am

Ultimately you have to make a compromise.
I'd go with #1 as it's simplest and disk is not that expensive, especially
when you are only really talking about a few of the fields in a doc.

On 14 February 2015 at 07:05, John Rose johnedmundrose@gmail.com wrote:

Hi,

We're new to Elasticsearch but quite impressed with the tools & community.
Currently we're using it to build an in-house chat/collaboration solution.

In a nutshell: it's in-house version of HipChat that's tightly integrated
with our existing business software.

Currently we're using ES as the back end of the system to store + retrieve
messages.

Our next challenge is to allow users (and the system) to search for things
like "@firstname lastname", "firstname lastname", tags like #hashtag or
file#1234, email addresses, links to files, URLs, employee names, and so
on. These things aren't possible with the standard tokenizer/analyzer. We
could simply map the "body" field as "not_analyzed" in the index but then
we'd lose most searchability.

A special wrinkle here is that username mentions may include spaces - like
"@Bob Smith." - as well as employee names like "Bob Smith" without the "@"
prefix.

We're considering several approaches here and could really use
feedback/critique!

Approach #1: Multiple mappings to the "body" field: one of which uses the
standard tokenizer/analyzer, and one of which is not_analyzed. We could use
the not_analyzed version to search on things like @mentions and #hashtags.
(This would nearly double our storage requirements, right?)

Approach #2: Implementing a custom tokenizer that treats things like
@username, "@Bob Smith", "Bob Smith" and #hashtags as single tokens so that
we can search on them later. We'd base it on something like ES's email
tokenizer:
Elasticsearch Platform — Find real-time answers at scale | Elastic

Approach #3: In our application layer (a Rails application) we could parse
out @usernames before saving the record to Elasticsearch. We could then
save the @usernames in a separate ES array field. Same for #hashtags and so
forth. So in addition to the "body" field we'd have a field called
"mentions", a field called "hashtags", a field called "hyperlinks", and so
forth.

How would you do it?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/929cd26b-c17b-4ea1-b9d5-cb415ea037bf%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/929cd26b-c17b-4ea1-b9d5-cb415ea037bf%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X90TYNvKo1Rc5Bj1x5HgLZ4MdprfEqvymgeg5GqZyHkiA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

warkolm · February 16, 2015, 8:17pm

I just realised I misread this one osrry, and keeping a not analysed copy
of your body would indeed be expensive.

On 16 February 2015 at 20:41, Mark Walkom markwalkom@gmail.com wrote:

Ultimately you have to make a compromise.
I'd go with #1 as it's simplest and disk is not that expensive, especially
when you are only really talking about a few of the fields in a doc.

On 14 February 2015 at 07:05, John Rose johnedmundrose@gmail.com wrote:

Hi,

We're new to Elasticsearch but quite impressed with the tools &
community. Currently we're using it to build an in-house chat/collaboration
solution.

In a nutshell: it's in-house version of HipChat that's tightly integrated
with our existing business software.

Currently we're using ES as the back end of the system to store +
retrieve messages.

Our next challenge is to allow users (and the system) to search for
things like "@firstname lastname", "firstname lastname", tags like #hashtag
or file#1234, email addresses, links to files, URLs, employee names, and so
on. These things aren't possible with the standard tokenizer/analyzer. We
could simply map the "body" field as "not_analyzed" in the index but then
we'd lose most searchability.

A special wrinkle here is that username mentions may include spaces -
like "@Bob Smith." - as well as employee names like "Bob Smith" without the
"@" prefix.

We're considering several approaches here and could really use
feedback/critique!

Approach #1: Multiple mappings to the "body" field: one of which uses the
standard tokenizer/analyzer, and one of which is not_analyzed. We could use
the not_analyzed version to search on things like @mentions and #hashtags.
(This would nearly double our storage requirements, right?)

Approach #2: Implementing a custom tokenizer that treats things like
@username, "@Bob Smith", "Bob Smith" and #hashtags as single tokens so that
we can search on them later. We'd base it on something like ES's email
tokenizer:
Elasticsearch Platform — Find real-time answers at scale | Elastic

Approach #3: In our application layer (a Rails application) we could
parse out @usernames before saving the record to Elasticsearch. We could
then save the @usernames in a separate ES array field. Same for #hashtags
and so forth. So in addition to the "body" field we'd have a field called
"mentions", a field called "hashtags", a field called "hyperlinks", and so
forth.

How would you do it?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/929cd26b-c17b-4ea1-b9d5-cb415ea037bf%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/929cd26b-c17b-4ea1-b9d5-cb415ea037bf%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X-Z11bK4URvHUW4k9BFm%3DVE37tBihLzdwsWFk5xekdoUg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.