Index and query source code


(Joris Rau) #1

Hi,

I want to use Elasticsearch to use a regexp query on source code. However I could not find any information about indexing or querying source code using elasticsearch. I know that GitHub is using Elasticsearch for their own source search. So what's the best way to go?

Regards,

Joris


(Mark Walkom) #2

That's a pretty broad question, unfortunately we wouldn't be able to share any of how GH does this as it's their proprietary information.

But the biggest thing is likely to be analysis and what sort of regexp querying you want to do, as that sort of query is going to be pretty resource intensive.


(Joris Rau) #3

Thanks for your reply!

I was thinking about using a 3-gram analyzer for the source code and a different 3-gram analyzer which creates 3-grams out of regex queries. Is that a good idea? Do you have a hint for me which could point me to a better solution?

Regards,

Joris


(Mark Harwood) #4

You might want to look at something that works with camelCase e.g. https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-analyzer.html#_camelcase_tokenizer


(Joris Rau) #5

Thanks. That's an interesting idea. In my case I still want to be able to use all the special characters for the search. So maybe I should combine that with a hierarchical anaylzer? Is that a good idea?


(Mark Harwood) #6

Using multiple indexing strategies is a common technique. See https://www.elastic.co/guide/en/elasticsearch/reference/2.1/multi-fields.html#_multi_fields_with_multiple_analyzers


(Joris Rau) #7

Alright. Awesome. Thanks!
One last question: I take it there is no proper way to run a full text regex query in elasticsearch, right? I am just asking because you already mention it in the docs that the regexp query runs only on terms.


(Mark Harwood) #8

Full text regex query would be slow so yes, we only work with terms produced by the tokenization process.


(Joris Rau) #9

Thanks. Now I feel enlightened :slight_smile: .


(system) #10