Dec 12th, 2024: [EN] Swing through web content like a superhero

Swing through web content like a superhero

Web crawlers at Elastic have undergone multiple evolutions throughout the years to adapt to the rapidly changing landscape of data ingestion (e.g., recent advancements in generative AI).

Swiftype is the creator of the popular SaaS product Site Search that makes it easy for users to put a search box on their websites without needing deep technical skills. Site Search uses Elasticsearch behind the scenes to index and search web content. In 2017, Swiftype joined forces with Elastic and we subsequently released the App Search web crawler in 2021, available on AWS, GCP and Azure in >40 global regions for Elastic Cloud Hosted and self-managed environments. In 2022, the Elastic Web Crawler was released, allowing users to ingest crawl content directly into Elasticsearch indices.

All of the previous web crawlers (Swiftype Site Search, App Search and Elastic Web Crawlers) require running a massive private Enterprise Search code base that includes a lot of different tools. In 2024, Elastic started the development of the latest Elastic Open Crawler. This light-weight open-code crawler is decoupled from Enterprise Search and performs significantly better than its predecessors. The Open Crawler is currently in beta.

This post provides a brief history of our crawler offerings, outlines the benefits of our latest Elastic Open Crawler, and provides resources to help you get started on it.

Crawler Evolution

Swiftype Site Search Crawler App Search Web Crawler Elastic Web Crawler (3) Elastic Open Crawler Elastic managed Web Crawler (1)
Release status GA (Maintenance only) GA (Maintenance only) GA (Maintenance only) Beta (Active development) On the roadmap
Compatible Elasticsearch versions N/A 7.15-8.x 8.4-8.x 8.13+ 9.x
Source code (Crawler) Closed source Closed source Closed source Open-code Open-code
Deployment option (Crawler setup) Swiftype Service Elastic Cloud Hosted, Elastic Cloud Enterprise, Elastic Cloud on Kubernetes and Self-managed Elastic Cloud Hosted, Elastic Cloud Enterprise, Elastic Cloud on Kubernetes and Self-managed Self-managed Elastic Cloud Serverless, Elastic Cloud Hosted
Deployment option (Ingest destination) Swiftype Service Elastic Cloud Hosted, Elastic Cloud Enterprise, Elastic Cloud on Kubernetes and Self-managed Elastic Cloud Hosted, Elastic Cloud Enterprise, Elastic Cloud on Kubernetes and Self-managed Elastic Cloud Serverless, Elastic Cloud Hosted, Elastic Cloud Enterprise, Elastic Cloud on Kubernetes and Self-managed Elastic Cloud Serverless, Elastic Cloud Hosted, TBD
Deployment option (Index type) Swiftype managed indices App Search managed engines Enterprise Search managed engines, Elasticsearch index engines and Standalone Elasticsearch indices Standalone Elasticsearch indices Standalone Elasticsearch indices
Vector embeddings pipeline No No Yes (4) Yes Yes

Benefits of the Elastic Open Crawler

  • A lightweight, stateless and standalone command line tool decoupled from Enterprise Search.

  • Open-code repository.

  • Unified licensing model. Similar to Elastic Connectors, licensing is based on the ingest destination.

  • Standard+ when ingesting to Elastic Cloud Serverless or Elastic Cloud Hosted.

  • Platinum+ when ingesting to Elastic Cloud Enterprise (ECE), Elastic Cloud on Kubernetes (ECK) and Self-Managed.

  • Bring your own Elasticsearch indices with custom index mappings and settings.

  • Effortlessly incorporate Elasticsearch semantic search workflow.

  • Indexing uses Elasticsearch Bulk API and the number of indexing and search requests to Elasticsearch have been drastically reduced to provide significant performance improvements.

  • Easily run within your own infrastructure using a Docker container.

  • Single tool that can ingest web content to all Elasticsearch platforms: Elastic Cloud Serverless, Elastic Cloud Hosted (ESS), Elastic Cloud Enterprise (ECE), Elastic Cloud on Kubernetes (ECK) and Self-Managed.

Wait, there's more coming!

Coming Soon: Elastic Open Crawler Roadmap

The following are features on the Elastic Open Crawler roadmap (1):

  • Elastic managed crawler

  • Programmatic story for ease of management

  • Event logging

  • Custom field extraction using meta tags and data attributes

  • Full HTML extraction

  • Transition Tool (2)

(1) The roadmap information in this article is forward-looking and represents our current vision for the Elastic Open Crawler. These plans are subject to change at our discretion.

(2) Such tooling will support transitioning crawler configurations from App Search Web Crawler and Elastic Web Crawler. Users will need to recrawl the content using the Elastic Open Crawler.

(3) Additional details on feature comparison between Elastic Open Crawler and Elastic Web Crawler are available here.

(4) While you can create your own inference pipeline, the fields storing the vectors cannot be used by the App Search engine.

Get Started with Elastic Open Crawler

To begin your web-swinging :spider_web: journey, check out these resources!

Keep your search applications ahead of the curve with our fully customizable web crawler. Got questions? Join our Slack workspace or discussion forum to connect with our developer community.

Have your-ELK a merry little crawler-day! :christmas_tree:

1 Like