Dec 12th, 2024: [EN] Swing through web content like a superhero

ppf2 · December 12, 2024, 8:00am

Swing through web content like a superhero

Web crawlers at Elastic have undergone multiple evolutions throughout the years to adapt to the rapidly changing landscape of data ingestion (e.g., recent advancements in generative AI).

Swiftype is the creator of the popular SaaS product Site Search that makes it easy for users to put a search box on their websites without needing deep technical skills. Site Search uses Elasticsearch behind the scenes to index and search web content. In 2017, Swiftype joined forces with Elastic and we subsequently released the App Search web crawler in 2021, available on AWS, GCP and Azure in >40 global regions for Elastic Cloud Hosted and self-managed environments. In 2022, the Elastic Web Crawler was released, allowing users to ingest crawl content directly into Elasticsearch indices.

All of the previous web crawlers (Swiftype Site Search, App Search and Elastic Web Crawlers) require running a massive private Enterprise Search code base that includes a lot of different tools. In 2024, Elastic started the development of the latest Elastic Open Crawler. This light-weight open-code crawler is decoupled from Enterprise Search and performs significantly better than its predecessors. The Open Crawler is currently in beta.

This post provides a brief history of our crawler offerings, outlines the benefits of our latest Elastic Open Crawler, and provides resources to help you get started on it.

Crawler Evolution

	Swiftype Site Search Crawler	App Search Web Crawler	Elastic Web Crawler (2)	Elastic Open Crawler
Release status	GA (Maintenance only)	GA (Maintenance only)	GA (Maintenance only)	Beta (Active development)
Compatible Elasticsearch versions	N/A	7.15-8.x	8.4-8.x	8.13+
Source code (Crawler)	Closed source	Closed source	Closed source	Open-code
Deployment option (Crawler setup)	Swiftype Service	Elastic Cloud Hosted, Elastic Cloud Enterprise, Elastic Cloud on Kubernetes and Self-managed	Elastic Cloud Hosted, Elastic Cloud Enterprise, Elastic Cloud on Kubernetes and Self-managed	Self-managed
Deployment option (Ingest destination)	Swiftype Service	Elastic Cloud Hosted, Elastic Cloud Enterprise, Elastic Cloud on Kubernetes and Self-managed	Elastic Cloud Hosted, Elastic Cloud Enterprise, Elastic Cloud on Kubernetes and Self-managed	Elastic Cloud Serverless, Elastic Cloud Hosted, Elastic Cloud Enterprise, Elastic Cloud on Kubernetes and Self-managed
Deployment option (Index type)	Swiftype managed indices	App Search managed engines	Enterprise Search managed engines, Elasticsearch index engines and Standalone Elasticsearch indices	Standalone Elasticsearch indices
Vector embeddings pipeline	No	No	Yes (3)	Yes

Benefits of the Elastic Open Crawler

A lightweight, stateless and standalone command line tool decoupled from Enterprise Search.
Open-code repository.
Unified licensing model. Similar to Elastic Connectors, licensing is based on the ingest destination.
Standard+ when ingesting to Elastic Cloud Serverless or Elastic Cloud Hosted.
Platinum+ when ingesting to Elastic Cloud Enterprise (ECE), Elastic Cloud on Kubernetes (ECK) and Self-Managed.
Bring your own Elasticsearch indices with custom index mappings and settings.
Effortlessly incorporate Elasticsearch semantic search workflow.
Indexing uses Elasticsearch Bulk API and the number of indexing and search requests to Elasticsearch have been drastically reduced to provide significant performance improvements.
Easily run within your own infrastructure using a Docker container.
Single tool that can ingest web content to all Elasticsearch platforms: Elastic Cloud Serverless, Elastic Cloud Hosted (ESS), Elastic Cloud Enterprise (ECE), Elastic Cloud on Kubernetes (ECK) and Self-Managed.

Wait, there's more coming!

Coming Soon: Elastic Open Crawler Roadmap

The following are features on the Elastic Open Crawler roadmap (1):

Elastic managed crawler
Programmatic story for ease of management
Event logging
Custom field extraction using meta tags and data attributes
Full HTML extraction
Transition Tool (1)

(1) Such tooling will support transitioning crawler configurations from App Search Web Crawler and Elastic Web Crawler. Users will need to recrawl the content using the Elastic Open Crawler.

(2) Additional details on feature comparison between Elastic Open Crawler and Elastic Web Crawler are available here.

(3) While you can create your own inference pipeline, the fields storing the vectors cannot be used by the App Search engine.

Get Started with Elastic Open Crawler

To begin your web-swinging journey, check out these resources!

Getting started
Open Crawler released for tech-preview
Open Crawler now in beta
Elastic Open Crawler video walkthrough
Semantic search using the Open Crawler and Semantic Text
View and adapt the code to meet your needs
Scale and manage deployments with simple CLI commands
Easily integrate to power hybrid, conversational search experiences

Keep your search applications ahead of the curve with our fully customizable web crawler. Got questions? Join our Slack workspace or discussion forum to connect with our developer community.

Have your-ELK a merry little crawler-day!

Topic		Replies	Views
Web crawler for elastic leatest versions Elasticsearch	3	5840	December 20, 2017
Web Crawler API Elastic Search crawler	2	441	June 20, 2024
Best Web Crawler For Elastisearch Elasticsearch	1	4751	July 20, 2018
Elasticsearch Crawling Elastic Community and Ecosystem	6	3393	November 13, 2017
Is Elasticsearch webcrawler an open source feature or it is paid? Elasticsearch	6	389	May 17, 2022