Gopa is a open source crawler project by myself, which is written in Golang. You may have already tried other crawler software to do the same stuff, and also you may ask me “why you are reinvent the wheel?”, well, I just did it.
The goal here is to build a vertical search engine, which is used to search all the elasticsearch related stuff. This is a “Google” focus on Elastic topics, and we hope to use it find Elastic articles, posts or Discuss posts in one place and so we are going to index the websites elastic.co and discuss.elastic.co.
- An Elasticsearch service, I am going to use Elastic Cloud, which is the best Elasticsearch-aaS as you can find! Of course, you can use your own Elasticsearch cluster.
- Gopa, the crawler I am going to use in this article.
- Chrome browser, yep, we need it to work with Gopa, it is optional, but for crawling discuss.elastic.co, we need it.
Create a cluster in Elastic Could
Create a Elasticsearch cluster with Elastic Cloud is really simple! Log into the console, select Elasticsearch version greater than 5.0. Woot if you are choosing 6.1! In Elastic Cloud, you can stay in the cutting edge of Elasticsearch.
Save the Elasticsearch URL and password to somewhere, we will be using them later.
Go to Gopa-Snapshot to download a pre-compiled binary.
I am going to download the
darwin64.tar.gz and run it in my Macbook.
Open your favourite terminal and execute the following command to download the Gopa binary:
After decompress the archive, you will see some config files and the main
➜ tar vxzf darwin64.tar x gopa-darwin64 x gopa.yml x config/ x config/elasticsearch/ x config/elasticsearch/gopa-blob-mapping.sh x config/elasticsearch/gopa-snapshot-mapping.sh x config/elasticsearch/clean-all-gopa-data.sh x config/elasticsearch/gopa-task-mapping.sh x config/elasticsearch/gopa-index-mapping.sh x config/elasticsearch/gopa-host-mapping.sh x stop.sh
We need to create a index and mapping in Elasticsearch, the script is located in Gopa’s folder that we just decompressed.
Open Kibana, from the provided URL in your Elastic Cloud console, and login with your password (which we saved from the previous step).
Copy index creation script from the file:
config/elasticsearch/gopa-index-mapping.sh and execute the QueryDSL.
This step make sure the search will work just fine.
Before we start the
gopa, let’s make sure we config it correctly.
gopa.yml, navigate to the following config section, and update the Elasticsearch URL and password (which we saved from Elastic Cloud in previous step).
- name: index enabled: true ui: enabled: true site_name: Elasticsearch logo: https://static-www.elastic.co/cn/assets/blt6050efb80ceabd47/elastic-logo (2).svg?q=294 favicon: https://www.elastic.co/favicon.ico elasticsearch: endpoint: https://YOUR-ELASTIC-CLOUD-URL-HERE index_prefix: gopa- username: elastic password: i-am-not-gonna-tell-you
Next, go to the
chrome plugin section and check if the chrome’s command path is correct for your own install.
plugins: - name: chrome enabled: true command: "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" debug_port: 9223
Now we are all set, the next thing is to start it.
Start gopa is simple, execute the command below:
➜ ./gopa-darwin64 ________ ________ __________ _____ / _____/ \_____ \\______ \/ _ \ / \ ___ / | \| ___/ /_\ \ \ \_\ \/ | \ | / | \ \______ /\_______ /____| \____|__ / \/ \/ \/ [gopa] 0.10.0_SNAPSHOT ///last commit: 65e729c, Sun Dec 17 10:52:49 2017 +0800, medcl, disable chrome test for CI /// [12-19 10:53:35] [INF] [instance.go:23] workspace: data/gopa/nodes/0 [12-19 10:53:35] [INF] [pipeline.go:68] pipeline: checker started with 5 shards [12-19 10:53:35] [INF] [pipeline.go:68] pipeline: fetch started with 1 shards [12-19 10:53:35] [INF] [pipeline.go:68] pipeline: update started with 4 shards [12-19 10:53:35] [INF] [ui.go:148] http server listen at: http://127.0.0.1:9001 [12-19 10:53:35] [INF] [api.go:132] api server listen at: http://127.0.0.1:8001
As you can see, the
gopa-darwin started and listened on two ports, one is for UI, and the other one is for the API.
http://127.0.0.1:9001/admin/console/ with your favorite web browser (NO IE Please). You will see a console page and we type the command to start the crawling:
seed http://elastic.co and
That's all you need to do. Let's check out the search page, open:
elasticsearch and you will see the results:
Now, a simple
Elastic is up and running. you can easily to roll it for your own purpose.
Gopa is a experimental project and is not 100% stable. There are some sample configs available but not covered in this article, you may discover it by yourself. If you have any problem to use it, please file a issue here. Like it? Give it a star!
That’s it, hope you enjoyed it.