GSOC| Elasticsearch: Speed Up Some Highlighting


(Sohaib) #1

Hi,

I am interested in contributing to Elastic Search and GSOC provides a nice platform to start. I am somewhat late in pursuing this as I have been fiddling with a variety of products. I hope that it is not already too late. I have some experience with SOLR and Lucene along with a primarily Java/Scala background so onboarding shouldn't be too hard hopefully. My current experience with Elastic is still somewhat elementary. I have used it as an inverted index but never really gone into the depth of its features.

I have a few questions about the project that I'd like to list down:

  1. What exactly does the document mean by "more efficient when in a different order". An example could perhaps make this clearer why it is faster. For instance, I was looking through the PlainHighlighter and from my understanding, it initialized a Lucene Highlighter and then calls getBestTextFragments on this. I did not understand why the order would matter here. Maybe this is true for a different highlighter. In which case it would be awesome if you could point me to the code.
  2. I saw only three classes for highlighters. Plain, Fast Vector and Unified. The document mentioned four. Would be great to have a reference to this last one.
  3. I continue to read the highlighters and understand them and will also look at 'low hanging fruit' to fix but I'd be glad for some expert advice that could help jumpstart my understanding faster.

Thanks,
Sohaib


(xeraa) #2

@sohaib definitely not too late; the official application period is still a week away. This project is actually fully based on Lucene now, so a deeper understanding of Elasticsearch should not be required (and we are happy to teach you what you will need to know).

We just had our annual conference and are still busy with an internal event right after that. We'll all travel home on Tuesday, so let us get back to you later this week with a better technical answers.


(Sohaib) #3

Hi guys,

Been a bit silent here. I hope you guys are back from the conference now and we can have a discussion. So I fiddled a bit with the code. Managed to put in a couple of PRs (1 and 2). I seem to have reached a bit of a wall on the first one as it seems to be a non-issue but let's see how it goes.

Maybe you can recommend a more highlighter specific issue for me to fix to get more involved with that part of the code. For now, I have had a look on the tests for the highlighters which seems to be a nice way to get started on their usage.

Looking forward to more details on this issue.

Thanks,
Sohaib


(xeraa) #4

Sorry about the delay! We should have all made it home now and can get back to business. @Igor_Motov and @nik9000 can give you a better answer than me.

And thanks for the PRs!


(Nik Everett) #5

Yeah! Thanks for starting to hack on the code base!

Sure! HighlightPhase highlights each document in the order that the documents are returned, highlighting each field you request in the order that you requested it.

But sometimes it is probably faster to highlight by first sorting the documents into which segment they are a part of. Within each segment sort by field, then highlight each field for that document. We think that this is the case when highlighting using a postings list because the postings will is organized around this access pattern. This is complicated for two reasons:

  1. Sometimes users specify an explicit ordering of fields and when they do we have to continue to respect it. When they don't we can order the fields in whatever order we think is most efficient.
  2. Benchmarking this is hard because the performance difference will only show up in sufficiently heavy situations. Luckily, we have a tool specifically for things like this. In particular, we'd need a big enough search index and we'd need a big sit of queries to see all of the advantage of doing things this way.

The reason why this order is supposed to be better is because it makes the disk access less random. This order is supposed to make them entirely "forward" on disk.

This is one of the highlighters for which order doesn't matter. The "unified" highlighter is the only one that will likely benefit from going in this order. The "fast vector" highlighter would likely prefer a slightly different ordering. But the "fast vector" highlighter isn't super well loved.

We once had a "postings" highlighter integrated but it has been un-integrated in favor of using the "unified" highlighter.

I don't know that there are any low hanging fruit in the highlighter category. To be honest any low hanging fruit will do.

Two other important points:

  1. I plan to add other Elasticsearch related project proposals. If this one doesn't suit you but you still want to work on Elasticsearch you'll certainly have other choices.
  2. This proposal is interesting in that it requires both Elasticsearch changes and some benchmarking. It is neat because when all of it is done you'll be able to say how much you sped things up, but you'll have to build a good benchmark in the first place. It also requires reading Lucene, the library that implements the highlighting. At least, that is what I had in mind when I wrote the proposal. I admit to not having done all of the research required to be sure of exactly what ordering would be faster and I really don't know what the performance improvement might look like.

(Sohaib) #6

Thanks for the excellent and detailed reply. I will comment first on your end-notes and then ask a couple of questions based on your answers.

  1. I would be glad to look at the other topics and see if something interests me even more but I already find this interesting. Part of the reason is I have always wanted to dive into Lucene but my knowledge has mostly remained bookish (courtesy Manning). This seems like a good reason for diving in.
  2. About reading Lucene, I think I know on a high level how Lucene works and specifically how highlighting works (token sources, fragmenters, scorers etc.). From what I saw Elastic more or less just wraps around this. So hopefully I won't be starting from zero in that direction. As for benchmarking, I will look around and try to suggest a way to test the performance. Rally seems cool. Maybe we can design a different track with some open source corpus that fits this case better. I am not sure which case will fit better but I will try to search around.

Moving on to the questions.

Thank you. Makes perfect sense now. I believe Lucene indexes would be more important to look at as Elastic storage focuses on metadata and transaction logs?
This nice blog is the fuel for my new found knowledge :smiley:

At this point in time, what would you suggest? Research + formulating a proposal or still focussing on more PRs?

Thanks a ton for reading this through and looking forward to your other project proposals!


(Nik Everett) #7

I'm a fan of using the wikipedia search index for this kind of thing, mostly because they do a fair bit of highlighting. I wrote a [blog]https://www.elastic.co/blog/loading-wikipedia) post on how to load it a while ago. I expect the instruction don't work properly for the master branch though.

I think it depends on your goal. PRs, even for somewhat unrelated things are outwardly visible signs of interest. Also they help the project which I figure is a nice side effect.

I honestly don't know the GSOC process well enough to know about the proposal to know about the proposal stage. This is my first time being a mentor. Maybe @xeraa can comment on that?


(xeraa) #8

Once the application period is over we will evaluate all submitted proposals and rank them. Depending on how many slots we'll get from Google (being a first time organization I wouldn't expect too many), we'll then accept the first X proposals.

For the ranking the two most important things are:

  • A good project proposal: You understand the scope and goal of the project and provide a schedule that makes sense. If you want to share your abstract in advance, I've suggested an approach in Regarding Kibana : Calendar Visualization and Filtering — for the Elasticsearch projects that would be nik, igor, and philipp at elastic dot co (alternatively a secret Gist or something similar would also work). We'll do our best to take a look before the application period closes.
  • PR: It's not about opening as many PRs as possible. One good one is better than multiple average or bad ones. Of course the more or the complex PRs you have, the better, but with a complex project like Elasticsearch we don't expect anything spectacular in such a short amount of time.

So I'd write the project proposal next and let us know in case you want a review. If you have more time, more PRs are definitely good with an emphasis to finish started ones.

Hope that helps :slight_smile:


(Sohaib) #9

@xeraa Sorry for the delayed reply. Thank you for the answer. Following your advice, I have created a google doc which contains a proposed solution incorporating suggestions by @nik9000. Following the spirit of open source, the document is open to everyone to comment upon. This also helps me gather more feedback. I also have inlined some questions in the proposal itself and would be glad if they can be answered. Looking forward to your suggestions after which I can add the required padding (introduction, why me, precise timeline etc.) and turn it into a formal proposal for submission.

I looked at the new GSoC ideas for Elasticsearch and saw the API issue. I'll try to pick up some as PRs as they seem easy to dive into.

Other than the document I have a couple of questions about the project itself. Probably @nik9000 can answer these better.

  1. The direction in which we are going (optimizing disk access by trying to do sequential seek) I would need to have access to a machine/cluster with magnetic disks. Would this be possible?
  2. While I have done my research on Lucene and I understand we can get sorted documents such that those belonging to the same segment are kept together I still don't get how can we ensure that the ordering within that segment is strictly in a way such that disk seeks are always forward. From what I understand this would depend on whether the user specifies the document ID in a strictly increasing manner (which would not be the general case especially since document updates lead to delete + insert). Does ES keep some special ID inside a document that would help make this trivial?

(Nik Everett) #10

Sorry I didn't get to replying to these earlier, here goes!

My opinions is that a few small, targeted PRs is a great way to get into contributing to Elasticsearch regardless of why you are submitting them. Complex PRs by folks who don't have a history of contributing are very likely to be a waste of time for everyone involved.

But I think you are going about this in the right way anyway. Asking questions is best, I think.

For the most part we don't optimize or test on magnetic disk machines, though we probably ought to do more of it because we recommend them in hot/warm style deploys. In this case the optimization should work well with SSDs because of read ahead.

Lucene's docid is this thing. Users can specify whatever id they want at the Elasticsearch level and that is stored in a field on the document and indexed in lucene, but it is just another field to Lucene.


(Sohaib) #11

Thanks for the help Nik. I have now put in the required padding after incorporating your suggestions (there weren't too many so I guess I am doing something right or something completely wrong :smiley: ) and shared the draft from the GSoC platform. Feel free to have a look and suggest changes if any.

Also would be cool if @xeraa could take a look to see if the format is what you guys expect :slight_smile:

Meanwhile, I will continue to work on the REST client issue to increase familiarity with the codebase and will keep you updated.


(Sohaib) #12

Maybe it is a bit late to ask this but I have been working on the high-level REST API client and seem to have familiarised myself by working on adding an API. The idea is fairly straightforward and well structured into subtasks so it does not look like a lot of work to me to design a proposal based on this. Is it okay for me to submit a proposal for the REST high-level client as well?

Since the project itself might get dropped on priority I thought it might be better to have a bit of diversity in terms of proposals. However, this does mean more work for you guys which is why I am asking this.


(xeraa) #13

Multiple proposals are definitely welcome — don't hold back ;-).

I think the number of slots we're getting is partly based on the submitted proposals (the more popular a project is, the more slots it will get), so we're definitely fine with receiving more. And we like to have more choice :smiley:

PS: I'll try to give feedback tomorrow, so you can still check it over the weekend / early next week. But I think Nik is more qualified for that than me, so I probably won't come up with anything major (either).


(Sohaib) #14

@xeraa Thanks for the feedback. @nik9000 I put in the other proposal for the Java High-Level REST Client as well. It is probably a tad too late but would be cool if you could find the time to have a cursory glance.

Thanks a ton for the all the time and help that you guys provided.