Index and query source code

Hi,

I want to use Elasticsearch to use a regexp query on source code. However I could not find any information about indexing or querying source code using elasticsearch. I know that GitHub is using Elasticsearch for their own source search. So what's the best way to go?

Regards,

Joris

That's a pretty broad question, unfortunately we wouldn't be able to share any of how GH does this as it's their proprietary information.

But the biggest thing is likely to be analysis and what sort of regexp querying you want to do, as that sort of query is going to be pretty resource intensive.

1 Like

Thanks for your reply!

I was thinking about using a 3-gram analyzer for the source code and a different 3-gram analyzer which creates 3-grams out of regex queries. Is that a good idea? Do you have a hint for me which could point me to a better solution?

Regards,

Joris

You might want to look at something that works with camelCase e.g. https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-analyzer.html#_camelcase_tokenizer

1 Like

Thanks. That's an interesting idea. In my case I still want to be able to use all the special characters for the search. So maybe I should combine that with a hierarchical anaylzer? Is that a good idea?

Using multiple indexing strategies is a common technique. See https://www.elastic.co/guide/en/elasticsearch/reference/2.1/multi-fields.html#_multi_fields_with_multiple_analyzers

1 Like

Alright. Awesome. Thanks!
One last question: I take it there is no proper way to run a full text regex query in elasticsearch, right? I am just asking because you already mention it in the docs that the regexp query runs only on terms.

Full text regex query would be slow so yes, we only work with terms produced by the tokenization process.

1 Like

Thanks. Now I feel enlightened :slight_smile: .