Dec 20th, 2017: [EN][Elasticsearch] Build your own Google with Elasticsearch and Gopa


(Medcl) #1

Welcome to Elastic Advent Day 20, today I am going to introduce a new way to quickly build your own Google with open source Elasticsearch and Gopa.

Gopa is a open source crawler project by myself, which is written in Golang. You may have already tried other crawler software to do the same stuff, and also you may ask me “why you are reinvent the wheel?”, well, I just did it.

The goal here is to build a vertical search engine, which is used to search all the elasticsearch related stuff. This is a “Google” focus on Elastic topics, and we hope to use it find Elastic articles, posts or Discuss posts in one place and so we are going to index the websites elastic.co and discuss.elastic.co.

Requirement

  • An Elasticsearch service, I am going to use Elastic Cloud, which is the best Elasticsearch-aaS as you can find! Of course, you can use your own Elasticsearch cluster.
  • Gopa, the crawler I am going to use in this article.
  • Chrome browser, yep, we need it to work with Gopa, it is optional, but for crawling discuss.elastic.co, we need it.

Create a cluster in Elastic Could

Create a Elasticsearch cluster with Elastic Cloud is really simple! Log into the console, select Elasticsearch version greater than 5.0. Woot if you are choosing 6.1! In Elastic Cloud, you can stay in the cutting edge of Elasticsearch.

Save the Elasticsearch URL and password to somewhere, we will be using them later.

Get Gopa

Go to Gopa-Snapshot to download a pre-compiled binary.
I am going to download the darwin64.tar.gz and run it in my Macbook.
Open your favourite terminal and execute the following command to download the Gopa binary:

wget https://github.com/infinitbyte/gopa-snapshot/releases/download/master/darwin64.tar.gz

After decompress the archive, you will see some config files and the main gopa-darwin64 binary.

➜ tar vxzf darwin64.tar 
x gopa-darwin64
x gopa.yml
x config/
x config/elasticsearch/
x config/elasticsearch/gopa-blob-mapping.sh
x config/elasticsearch/gopa-snapshot-mapping.sh
x config/elasticsearch/clean-all-gopa-data.sh
x config/elasticsearch/gopa-task-mapping.sh
x config/elasticsearch/gopa-index-mapping.sh
x config/elasticsearch/gopa-host-mapping.sh
x stop.sh

Config Elasticsearch

We need to create a index and mapping in Elasticsearch, the script is located in Gopa’s folder that we just decompressed.

Open Kibana, from the provided URL in your Elastic Cloud console, and login with your password (which we saved from the previous step).

Copy index creation script from the file: config/elasticsearch/gopa-index-mapping.sh and execute the QueryDSL.

This step make sure the search will work just fine.

Config Gopa

Before we start the gopa, let’s make sure we config it correctly.
Open gopa.yml, navigate to the following config section, and update the Elasticsearch URL and password (which we saved from Elastic Cloud in previous step).

- name: index
  enabled: true
  ui:
    enabled: true
    site_name: Elasticsearch
    logo: https://static-www.elastic.co/cn/assets/blt6050efb80ceabd47/elastic-logo (2).svg?q=294
    favicon: https://www.elastic.co/favicon.ico
  elasticsearch:
    endpoint: https://YOUR-ELASTIC-CLOUD-URL-HERE
    index_prefix: gopa-
    username: elastic
    password: i-am-not-gonna-tell-you

Next, go to the chrome plugin section and check if the chrome’s command path is correct for your own install.

plugins:
- name: chrome
  enabled: true
  command: "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"
  debug_port: 9223

Now we are all set, the next thing is to start it.

Start Gopa

Start gopa is simple, execute the command below:

➜  ./gopa-darwin64 
  ________ ________ __________  _____   
 /  _____/ \_____  \\______   \/  _  \  
/   \  ___  /   |   \|     ___/  /_\  \ 
\    \_\  \/    |    \    |  /    |    \
 \______  /\_______  /____|  \____|__  /
        \/         \/                \/ 
[gopa] 0.10.0_SNAPSHOT
///last commit: 65e729c, Sun Dec 17 10:52:49 2017 +0800, medcl, disable chrome test for CI ///

[12-19 10:53:35] [INF] [instance.go:23] workspace: data/gopa/nodes/0
[12-19 10:53:35] [INF] [pipeline.go:68] pipeline: checker started with 5 shards
[12-19 10:53:35] [INF] [pipeline.go:68] pipeline: fetch started with 1 shards
[12-19 10:53:35] [INF] [pipeline.go:68] pipeline: update started with 4 shards
[12-19 10:53:35] [INF] [ui.go:148] http server listen at: http://127.0.0.1:9001
[12-19 10:53:35] [INF] [api.go:132] api server listen at: http://127.0.0.1:8001

As you can see, the gopa-darwin started and listened on two ports, one is for UI, and the other one is for the API.

Rocking now

Let’s open http://127.0.0.1:9001/admin/console/ with your favorite web browser (NO IE Please). You will see a console page and we type the command to start the crawling:
seed http://elastic.co and seed discuss.elastic.co

That's all you need to do. Let's check out the search page, open: http://127.0.0.1:9001/, type elasticsearch and you will see the results:

Now, a simple Google for your own Elastic is up and running. you can easily to roll it for your own purpose.

Final words

Gopa is a experimental project and is not 100% stable. There are some sample configs available but not covered in this article, you may discover it by yourself. If you have any problem to use it, please file a issue here. Like it? Give it a star!

Also if you need more complete site search solution, you may also interested in our enterprise level SaaS product from our Swiftype team, More info and Here.

That’s it, hope you enjoyed it.


Web crawler for elastic leatest versions
(Medcl) #2

This topic was automatically closed after 7 days. New replies are no longer allowed.