Welcome to Elastic Advent Day 20, today I am going to introduce a new way to quickly build your own Google with open source Elasticsearch and Gopa.
Gopa is a open source crawler project by myself, which is written in Golang. You may have already tried other crawler software to do the same stuff, and also you may ask me “why you are reinvent the wheel?”, well, I just did it.
The goal here is to build a vertical search engine, which is used to search all the elasticsearch related stuff. This is a “Google” focus on Elastic topics, and we hope to use it find Elastic articles, posts or Discuss posts in one place and so we are going to index the websites elastic.co and discuss.elastic.co.
Requirement
- An Elasticsearch service, I am going to use Elastic Cloud, which is the best Elasticsearch-aaS as you can find! Of course, you can use your own Elasticsearch cluster.
- Gopa, the crawler I am going to use in this article.
- Chrome browser, yep, we need it to work with Gopa, it is optional, but for crawling discuss.elastic.co, we need it.
Create a cluster in Elastic Could
Create a Elasticsearch cluster with Elastic Cloud is really simple! Log into the console, select Elasticsearch version greater than 5.0. Woot if you are choosing 6.1! In Elastic Cloud, you can stay in the cutting edge of Elasticsearch.
Save the Elasticsearch URL and password to somewhere, we will be using them later.
Get Gopa
Go to Gopa-Snapshot to download a pre-compiled binary.
I am going to download the darwin64.tar.gz
and run it in my Macbook.
Open your favourite terminal and execute the following command to download the Gopa binary:
wget https://github.com/infinitbyte/gopa-snapshot/releases/download/master/darwin64.tar.gz
After decompress the archive, you will see some config files and the main gopa-darwin64
binary.
➜ tar vxzf darwin64.tar
x gopa-darwin64
x gopa.yml
x config/
x config/elasticsearch/
x config/elasticsearch/gopa-blob-mapping.sh
x config/elasticsearch/gopa-snapshot-mapping.sh
x config/elasticsearch/clean-all-gopa-data.sh
x config/elasticsearch/gopa-task-mapping.sh
x config/elasticsearch/gopa-index-mapping.sh
x config/elasticsearch/gopa-host-mapping.sh
x stop.sh
Config Elasticsearch
We need to create a index and mapping in Elasticsearch, the script is located in Gopa’s folder that we just decompressed.
Open Kibana, from the provided URL in your Elastic Cloud console, and login with your password (which we saved from the previous step).
Copy index creation script from the file: config/elasticsearch/gopa-index-mapping.sh
and execute the QueryDSL.
This step make sure the search will work just fine.
Config Gopa
Before we start the gopa
, let’s make sure we config it correctly.
Open gopa.yml
, navigate to the following config section, and update the Elasticsearch URL and password (which we saved from Elastic Cloud in previous step).
- name: index
enabled: true
ui:
enabled: true
site_name: Elasticsearch
logo: https://static-www.elastic.co/cn/assets/blt6050efb80ceabd47/elastic-logo (2).svg?q=294
favicon: https://www.elastic.co/favicon.ico
elasticsearch:
endpoint: https://YOUR-ELASTIC-CLOUD-URL-HERE
index_prefix: gopa-
username: elastic
password: i-am-not-gonna-tell-you
Next, go to the chrome
plugin section and check if the chrome’s command path is correct for your own install.
plugins:
- name: chrome
enabled: true
command: "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"
debug_port: 9223
Now we are all set, the next thing is to start it.
Start Gopa
Start gopa is simple, execute the command below:
➜ ./gopa-darwin64
________ ________ __________ _____
/ _____/ \_____ \\______ \/ _ \
/ \ ___ / | \| ___/ /_\ \
\ \_\ \/ | \ | / | \
\______ /\_______ /____| \____|__ /
\/ \/ \/
[gopa] 0.10.0_SNAPSHOT
///last commit: 65e729c, Sun Dec 17 10:52:49 2017 +0800, medcl, disable chrome test for CI ///
[12-19 10:53:35] [INF] [instance.go:23] workspace: data/gopa/nodes/0
[12-19 10:53:35] [INF] [pipeline.go:68] pipeline: checker started with 5 shards
[12-19 10:53:35] [INF] [pipeline.go:68] pipeline: fetch started with 1 shards
[12-19 10:53:35] [INF] [pipeline.go:68] pipeline: update started with 4 shards
[12-19 10:53:35] [INF] [ui.go:148] http server listen at: http://127.0.0.1:9001
[12-19 10:53:35] [INF] [api.go:132] api server listen at: http://127.0.0.1:8001
As you can see, the gopa-darwin
started and listened on two ports, one is for UI, and the other one is for the API.
Rocking now
Let’s open http://127.0.0.1:9001/admin/console/
with your favorite web browser (NO IE Please). You will see a console page and we type the command to start the crawling:
seed http://elastic.co
and seed discuss.elastic.co
That's all you need to do. Let's check out the search page, open: http://127.0.0.1:9001/
, type elasticsearch
and you will see the results:
Now, a simple Google
for your own Elastic
is up and running. you can easily to roll it for your own purpose.
Final words
Gopa is a experimental project and is not 100% stable. There are some sample configs available but not covered in this article, you may discover it by yourself. If you have any problem to use it, please file a issue here. Like it? Give it a star!
Also if you need more complete site search solution, you may also interested in our enterprise level SaaS product from our Swiftype team, More info and Here.
That’s it, hope you enjoyed it.