I want to use Elasticsearch to use a regexp query on source code. However I could not find any information about indexing or querying source code using elasticsearch. I know that GitHub is using Elasticsearch for their own source search. So what's the best way to go?
That's a pretty broad question, unfortunately we wouldn't be able to share any of how GH does this as it's their proprietary information.
But the biggest thing is likely to be analysis and what sort of regexp querying you want to do, as that sort of query is going to be pretty resource intensive.
I was thinking about using a 3-gram analyzer for the source code and a different 3-gram analyzer which creates 3-grams out of regex queries. Is that a good idea? Do you have a hint for me which could point me to a better solution?
Thanks. That's an interesting idea. In my case I still want to be able to use all the special characters for the search. So maybe I should combine that with a hierarchical anaylzer? Is that a good idea?
Alright. Awesome. Thanks!
One last question: I take it there is no proper way to run a full text regex query in elasticsearch, right? I am just asking because you already mention it in the docs that the regexp query runs only on terms.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.