From Coveo to Elasticsearch

Dear Community,

I have a web application built for enterprise search using Coveo. It has lot of sources (CSVs, Excel, Sharepoint, S3, APIs, etc.) indexed using default Coveo Connectors and few custom Connector written in C#.

I wanted to build a POC for the same and be able to demonstrate the Elasticsearch Capabilities with all those sources.

Key Points:

  1. Data is not only incremental but also mutable. Meaning, already indexed data can have changes must be able to update the changes.

  2. I need a one technology to connect and index all those data sources. Be it python Client or Logstash. what is the best possibility. I do not have any logs for now.

  3. One of the main reasons to move to Elasticsearch is due to open source. whereas, Coveo needs License.

BTW, I am Elastic Certified Engineer, But I don't have good experience in font-end Stack. I am looking for advise from both front-end & Back-end perspective.

I really do appreciate your recommendations/ suggestions.

Thank you!

Welcome!

Why not using Elastic Workplace Search?

It's designed for that use case.

Hi @dadoonet

Thank you for your response.

I'm not sure I can go with that in actual project. I can cover some extent using basic version though.

so, just wanted to go with open source stack.

Up to you.

Note that Workplace Search is available in the free version with the built in basic license.

If you want to "re-implement" by yourself all the crawlers in a single tool and the UI, I'm not sure what advice I can give. As a Java developer, I'd probably go to the Java route but it's up to you.

For binary documents (PDF, etc) you can use the ingest attachment plugin.

There an example here: Using the Attachment Processor in a Pipeline | Elasticsearch Plugins and Integrations [7.10] | Elastic

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}
PUT my_index/_doc/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my_index/_doc/my_id

The data field is basically the BASE64 representation of your binary file.

You can use FSCrawler. There's a tutorial to help you getting started.