Index and query source code

Joris_Rau · January 12, 2016, 12:25pm

Hi,

I want to use Elasticsearch to use a regexp query on source code. However I could not find any information about indexing or querying source code using elasticsearch. I know that GitHub is using Elasticsearch for their own source search. So what's the best way to go?

Regards,

Joris

warkolm · January 14, 2016, 6:48am

That's a pretty broad question, unfortunately we wouldn't be able to share any of how GH does this as it's their proprietary information.

But the biggest thing is likely to be analysis and what sort of regexp querying you want to do, as that sort of query is going to be pretty resource intensive.

Joris_Rau · January 14, 2016, 10:58am

Thanks for your reply!

I was thinking about using a 3-gram analyzer for the source code and a different 3-gram analyzer which creates 3-grams out of regex queries. Is that a good idea? Do you have a hint for me which could point me to a better solution?

Regards,

Joris

Mark_Harwood · January 14, 2016, 11:09am

You might want to look at something that works with camelCase e.g. https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-analyzer.html#_camelcase_tokenizer

Joris_Rau · January 14, 2016, 11:33am

Thanks. That's an interesting idea. In my case I still want to be able to use all the special characters for the search. So maybe I should combine that with a hierarchical anaylzer? Is that a good idea?

Mark_Harwood · January 14, 2016, 11:39am

Using multiple indexing strategies is a common technique. See https://www.elastic.co/guide/en/elasticsearch/reference/2.1/multi-fields.html#_multi_fields_with_multiple_analyzers

Joris_Rau · January 14, 2016, 11:57am

Alright. Awesome. Thanks!
One last question: I take it there is no proper way to run a full text regex query in elasticsearch, right? I am just asking because you already mention it in the docs that the regexp query runs only on terms.

Mark_Harwood · January 14, 2016, 12:16pm

Full text regex query would be slow so yes, we only work with terms produced by the tokenization process.

Joris_Rau · January 14, 2016, 12:29pm

Thanks. Now I feel enlightened .

Topic		Replies	Views
Speeding up elastic search regex filters/query optimization Elasticsearch	2	2492	July 5, 2017
Analyzing URLs for regexp queries Elasticsearch	4	5524	July 6, 2017
Python : How to write a simple regexp query using Elasticsearch in python? Elasticsearch	2	552	July 21, 2020
Understanding regexp query better to avoid query failures and OOMs Elasticsearch	1	885	July 6, 2017
How to use regexp query filter in ElasticSearch Source Filtering? Elasticsearch	8	3147	April 9, 2018

Index and query source code

Related topics