Creating a browse interface from ES


(Robin Sheat) #1

I am building a system that uses Elasticsearch to store and retrieve
library catalogue data. One thing I've been asked for is a browse interface.

Here's a definition of what this is:

  • The user does a search, for example "Author starts with" and they
    supply "Smith"
  • The system puts them into the middle of a list of authors, at or near
    the position of the first one that starts with "Smith", so they might see:
    • Smart, Murray
    • Smart, Murray J.
    • Smeaton, Duncan
    • Smieliauskas, Wally
    • Smillie, John
    • Smith Milway, Katie
    • Smith, A. M. C.
    • Smith, Andrew
    • Smith, Andrew M. C.
    • etc.
  • These will be paged, so having ~20 or so results per page. If the user
    pages back, they head towards the start of the alphabet, if they page
    forwards they will go onward.
  • Each result shown will have a count beside it showing how many results
    (i.e. catalogue items) are associated with that author.
  • Clicking on a result takes you to everything by that author (this and
    everything beyond it is fairly easy and mostly implemented already.)

I'm wondering if anyone has any good ideas on how to approach this. At this
stage, I don't care too much about handling searches that aren't "field
starts with" searches, as exactly how that will be done is currently up in
the air and I'll deal with it when the time comes.

Here's what I'm thinking, but there are issues with it:

  • All the fields that are going to be browsed are faceted
  • I get a list of all the facets for that field, search through it to
    find the starting point, and handle the paging manually in code.
    • This has the big problem that I might be fetching hundreds of
      thousands of terms and processing them, which won't be quick.

I'm open to any options here, whether I can somehow jump into the middle of
a large set of facets like the query "from" field, or if I should instead
put everything into another index specifically for this purpose (though I
don't know how I'd structure and query it), or something else.

From what I can see, my ideal solution would be that I can specify the
facet field, tell ES that I want to start at the one that starts with
"Smith", and it displays from around there, then I have the ability to say
"go 20 back", but I'm not sure that this is possible.

You can see an example of the sort of thing I'm talking about in action
here: http://hollisclassic.harvard.edu/ - put in Smith as "Author (last
name first)", and it gives you a (terribly ugly looking) browse list.

Any thoughts?

Thanks, Robin.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/606138e0-0989-4106-9073-c3fdc4b0b46e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #2

Welcome to the show :slight_smile:

I also build library catalog on Elasticsearch professionally.

Some time ago I wrote a Perl Dancer starter app just to show how very basic
features like a hit list and facets are look like.

The browsing UI you mean is a traditional instrument librarians are used to
when they want machines to sort and not to search. Here, it is called
"register search" but in fact it is a sorted list. Beside being a
traditional approach which is not the most modern form of search, I made
the experience that librarians do not understand the difference between
search and sort - they learned in school they must page manually through
all results, and all relevance ranking comes form the devil himself.

Nevertheless, you can build sorted lists in ES but with a small trick - an
extra index.

The author list can be implemented with ES "sort" in the search action,
using "from" and "to" to page through results. You have to index the author
names in an author name index (or better, an authoritative name index).
There, you should index two forms of the author field, one for search
(tokenized) and another unique form to sort on with the keyword analyzer so
it's unchanged. You have to take care to use the "preferred author name" of
the "main entry" for sort, which is determined by library catalog rules and
not necessarily the form how the author name is written. Variant names of
an author should be kept aside, just for search. So you should index all
author names into one authority index for this purpose, where each author
is represented by a single document. This should be easy since librarians
are used to mark all author names by a unique key (at least they carry the
biographical dates with them)

Beside author names, you'd have to deal with corporate names, conference
names, and subject names.

I do not fully understand how you intend to process facet results. This
seems a bit overengineered. ES can not page through facet (aggregation)
results efficiently, as you already noted, this would have to be a task for
the client application to work on, and for millions of entries, it would be
very sluggish experience, and the required heap resources would be far from
reasonable.

The "field starts with" is also easy to implement, look at the
"match_phrase_prefix" query, here is a blog post by Zachary Tong

Jörg

On Wed, Jun 11, 2014 at 6:42 AM, Robin Sheat robin@catalyst.net.nz wrote:

I am building a system that uses Elasticsearch to store and retrieve
library catalogue data. One thing I've been asked for is a browse interface.

Here's a definition of what this is:

  • The user does a search, for example "Author starts with" and they
    supply "Smith"
  • The system puts them into the middle of a list of authors, at or
    near the position of the first one that starts with "Smith", so they might
    see:
    • Smart, Murray
    • Smart, Murray J.
    • Smeaton, Duncan
    • Smieliauskas, Wally
    • Smillie, John
    • Smith Milway, Katie
    • Smith, A. M. C.
    • Smith, Andrew
    • Smith, Andrew M. C.
    • etc.
  • These will be paged, so having ~20 or so results per page. If the
    user pages back, they head towards the start of the alphabet, if they page
    forwards they will go onward.
  • Each result shown will have a count beside it showing how many
    results (i.e. catalogue items) are associated with that author.
  • Clicking on a result takes you to everything by that author (this
    and everything beyond it is fairly easy and mostly implemented already.)

I'm wondering if anyone has any good ideas on how to approach this. At
this stage, I don't care too much about handling searches that aren't
"field starts with" searches, as exactly how that will be done is currently
up in the air and I'll deal with it when the time comes.

Here's what I'm thinking, but there are issues with it:

  • All the fields that are going to be browsed are faceted
  • I get a list of all the facets for that field, search through it to
    find the starting point, and handle the paging manually in code.
    • This has the big problem that I might be fetching hundreds of
      thousands of terms and processing them, which won't be quick.

I'm open to any options here, whether I can somehow jump into the middle
of a large set of facets like the query "from" field, or if I should
instead put everything into another index specifically for this purpose
(though I don't know how I'd structure and query it), or something else.

From what I can see, my ideal solution would be that I can specify the
facet field, tell ES that I want to start at the one that starts with
"Smith", and it displays from around there, then I have the ability to say
"go 20 back", but I'm not sure that this is possible.

You can see an example of the sort of thing I'm talking about in action
here: http://hollisclassic.harvard.edu/ - put in Smith as "Author (last
name first)", and it gives you a (terribly ugly looking) browse list.

Any thoughts?

Thanks, Robin.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/606138e0-0989-4106-9073-c3fdc4b0b46e%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/606138e0-0989-4106-9073-c3fdc4b0b46e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFL8nbu%3D2y4vCbdZdz_eVSgz5y6M4mDyQ-LGixY0uaspg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Robin Sheat) #3

Op woensdag 11 juni 2014 20:11:39 UTC+12 schreef Jörg Prante:

The browsing UI you mean is a traditional instrument librarians are used
to when they want machines to sort and not to search. Here, it is called
"register search" but in fact it is a sorted list. Beside being a
traditional approach which is not the most modern form of search, I made
the experience that librarians do not understand the difference between
search and sort - they learned in school they must page manually through
all results, and all relevance ranking comes form the devil himself.

That sounds like it. I know that it's a thing very few people would want,
however the people who are giving us money to do this would like it, so
that makes it into an important thing :slight_smile: I have the search aspects mostly
built already, but need this browse support.

Nevertheless, you can build sorted lists in ES but with a small trick - an
extra index.

The author list can be implemented with ES "sort" in the search action,
using "from" and "to" to page through results. You have to index the author
names in an author name index (or better, an authoritative name index).
There, you should index two forms of the author field, one for search
(tokenized) and another unique form to sort on with the keyword analyzer so
it's unchanged. You have to take care to use the "preferred author name" of
the "main entry" for sort, which is determined by library catalog rules and
not necessarily the

That's close, but a bit different from what I want. If I have an 'author'
index, and I search for things starting with 'Smith', sorting A->Z, I want
to be able to page back, and get the results that are closer to the start
of the alphabet. That is to say, it should tell me the "Smith" is the 524th
(e.g.) author entry across the whole index when sorted, then I can set up
my results page so the user can page backwards. Or, if I could do a
startswith search, and have a negative "from" so it looks backwards in the
results...

But that still won't work, as startswith doesn't give me a place in the
index, it gives me a subset of the index restricted by the query.

Essentially, I think that this would give me an authority searcher, when I
want an authority browser. I could browse through it starting at 'A', but
I really need to be able to jump to a point in the middle and go
backwards/forwards from there.

form how the author name is written. Variant names of an author should be
kept aside, just for search. So you should index all author names into one
authority index for this purpose, where each author is represented by a
single document. This should be easy since librarians are used to mark all
author names by a unique key (at least they carry the biographical dates
with them)

Beside author names, you'd have to deal with corporate names, conference

names, and subject names.

Those are implementation details that I'll work out when I have something
working. Believe me, I've spent a lot of time working with MARC data, I
understand all the weird ways that it does things :slight_smile:

I do not fully understand how you intend to process facet results. This
seems a bit overengineered. ES can not page through facet (aggregation)
results efficiently, as you already noted, this would have to be a task for
the client application to work on, and for millions of entries, it would be
very sluggish experience, and the required heap resources would be far from
reasonable.

Well, it's the only way I could think of that would function at all. But
it's far from an ideal solution, in that it's going to work at all.

The "field starts with" is also easy to implement, look at the
"match_phrase_prefix" query, here is a blog post by Zachary Tong

http://www.elasticsearch.org/blog/starts-with-phrase-matching/

Yeah, I have that working already, but it's not quite what I want.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c5e78db5-857f-4c92-a919-20b478fe67f5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #4

What about this:

  • build author name index

  • page size is static (e.g. 20)

  • absolute position: you must index each author name with absolute position
    info (sort author names before indexing, use a counter and increment it
    while indexing)

  • sort asc/desc works on author's name keyword analyzed field

  • jump function: execute constant_score query, with an optional filterered
    query of prefix query on author name keyword analyzed field (search for 'A'
    jumps to author names with 'A', search for 'B' jumps to 'B' etc.)

  • search function is trivial

  • relative move: paging back and forth through the result is done by using
    the absolute position info from the hits and the 'from' / 'size' ES
    parameters, ignoring the filtered query (since this is used just for
    jumping)

This is how I implement "register search"

Jörg

On Mon, Jun 16, 2014 at 5:01 AM, Robin Sheat robin@catalyst.net.nz wrote:

That's close, but a bit different from what I want. If I have an 'author'
index, and I search for things starting with 'Smith', sorting A->Z, I want
to be able to page back, and get the results that are closer to the start
of the alphabet. That is to say, it should tell me the "Smith" is the 524th
(e.g.) author entry across the whole index when sorted, then I can set up
my results page so the user can page backwards. Or, if I could do a
startswith search, and have a negative "from" so it looks backwards in the
results...

But that still won't work, as startswith doesn't give me a place in the
index, it gives me a subset of the index restricted by the query.

Essentially, I think that this would give me an authority searcher, when I
want an authority browser. I could browse through it starting at 'A', but
I really need to be able to jump to a point in the middle and go
backwards/forwards from there.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHrTGTXh5ADseC0CAXehsOTxrfOutGOZsgi01zo7pMu7w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Robin Sheat) #5

joergprante@gmail.com schreef op ma 16-06-2014 om 13:12 [+0200]:

This is how I implement "register search"

This is interesting. It could work for me.

Though, I'm not sure I totally understand it. To find, say "Smith", I'd
search for it, get its index, and then use the from/size stuff to bring
up the list in that area. Is that essentially what you're using?

If so, that seems like what I need. The only issue is that it'll require
a total reindex every time something is added. But, I don't see a way
around that even with some other ideas I'm exploring.

--
Robin Sheat
Catalyst IT Ltd.
✆ +64 4 803 2204
GPG: 5FA7 4B49 1E4D CAA4 4C38 8505 77F5 B724 F871 3BDF

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1402982936.19820.60.camel%40zarathud.wgtn.cat-it.co.nz.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #6

If you can use the sort key of the term (internal java collation key or ICU
collation key) instead of absolute position number, there is no longer the
need to reindex. One advantage is that you can adjust the sort key to the
requirements (in Germany we have complex sort requirements that are not
compatible with Unicode canonical sort order).

One left challenge is creating the frequency count per term. In "register
search" each term in the result list should be paired with an occurence
count (or even a prefix count). This can be achieved by iterating over the
result page (e.g. 20 entries) and executing a count query over the term (or
use a prefix query for the count)

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-count.html

is very fast since no docs are returned.

Sure, there are facets/aggregations but they have a slight disadvantage:
they do not return exact counts, only an estimated count. For "register
search" you need absolutely exact counts.

Jörg

On Tue, Jun 17, 2014 at 7:28 AM, Robin Sheat robin@catalyst.net.nz wrote:

joergprante@gmail.com schreef op ma 16-06-2014 om 13:12 [+0200]:

This is how I implement "register search"

This is interesting. It could work for me.

Though, I'm not sure I totally understand it. To find, say "Smith", I'd
search for it, get its index, and then use the from/size stuff to bring
up the list in that area. Is that essentially what you're using?

If so, that seems like what I need. The only issue is that it'll require
a total reindex every time something is added. But, I don't see a way
around that even with some other ideas I'm exploring.

--
Robin Sheat
Catalyst IT Ltd.
✆ +64 4 803 2204
GPG: 5FA7 4B49 1E4D CAA4 4C38 8505 77F5 B724 F871 3BDF

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1402982936.19820.60.camel%40zarathud.wgtn.cat-it.co.nz
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoG9N-DEMz0KA2LUXvkQE4zWB14kQbHVQFYbc%2BsHcFsCug%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Robin Sheat) #7

Op dinsdag 17 juni 2014 19:28:18 UTC+12 schreef Jörg Prante:

If you can use the sort key of the term (internal java collation key or
ICU collation key) instead of absolute position number, there is no longer
the need to reindex. One advantage is that you can adjust the sort key to
the requirements (in Germany we have complex sort requirements that are not
compatible with Unicode canonical sort order).

That has the potential to be useful

The one thing I'm a bit stuck on is that if the user searches for "Smith",
and I don't have a numerical order (i.e. a precomputed index based on the
sorted data), then how can I tell ES to give me the records that preceed
and succeed "Smith" (such as the "Smillie" example shown above.)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c2120509-c97b-49bb-81b8-0d8b1df9091d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #8