Tokenizing a string in a canvas element

ebalmeida · March 29, 2019, 7:09pm

Hi, I've been having a little difficulty solving this particular problem using the tools available in Canvas.

In a relational database I have a table of contracts with ids and types of contract and each contract may have many tasks with ids, start date, end date and task name. Those were originally ingested in elastic as two indexes (one for contracts and one for tasks, which also contain all contract data) and I have limited say as on how to make this ingestion. Each contract doc in elastic also contains a column containing all tasks for that contract as a string combining start date, end date (if existing) and task name, in the following format:

start_date - (end_date | "N/A") - task_name

What I need is a way to count the sum of number of times two specific tasks occured (Task A or Task B for simplicity sake) for each type of contract, as long as Task A didn't occur right after Task B or A, in which case they should only be counted once. So:

C, A, C, A

should sum 2

C, B, C, A

should be 2 as well, while

C, B, A, C

should be 1 and

C, A, A, A

should be 1 as well.

There's also a time constraint, in which all start dates must be in the last "closed" month (right now that's Feb, 2019)
I can use essql with the tasks index to count all occurences of the tasks in the month, but that doesn't account for the duplicates (because looking at tasks alone I have no way of knowing what came before in the contract).
Using the contracts index I can bring the list of tasks as a string for each contract. My instinct is to tokenize, clean it of dupes in a list and then count it, but I found no way to do that using the functions or examples described in canvas.

Maybe someone has been through something similar before and can shed some light. Thank you very much.

tsullivan · March 29, 2019, 10:37pm

You'll need your data to be in a ready state for analysis. If it has dupes or signals that are hard to find, the best bet would be to re-index the data but pre-process it first.

I get that you have limited say on how to make the ingestion. But getting this to work well will probably involve a 2-step ingestion. What you have now is just raw data. You could set up a Logstash instance that uses Elasticsearch input plugin, watches for new raw data, send the data through a clean-up filter, and re-index clean docs into a 3rd index. Alternatively, you could use Watcher to do this, and use things like scripted metric aggregation to clean the data before it gets re-indexed.

ebalmeida · April 1, 2019, 3:07am

That's a nice solution and I'll check whether it'll be possible. I take it it's not possible to do that level of manipulation using the canvas expression language then?

tsullivan · April 1, 2019, 4:58pm

Hi,

With a lot of work, you might be able to get close to what you're looking for. But as soon as you find something interesting to track down, you're limited because of the data model. Anything extra you're interested to find out from here is going to be a repeat of the hard work.

With a more expressive data model, you'll be quicker in finding interesting things to track down, and quicker in tuning the searches just using basic concepts like filtering and Elasticsearch-based aggregations. That's also going to be way more performant: Elasticsearch brings the power of clustered nodes, whereas Kibana's processing with Canvas, for the most part is happening in a single thread in a single browser or NodeJS server.

ebalmeida · April 4, 2019, 6:10pm

I see, I'll test some proper ingestion (or re-ingestion, digestion?) using logstash in a test environment and pitch the case. Currently we're using some custom made robots using oracle's libraries to ingest from Oracle to elasticsearch.

Thanks a lot. I'd still be interested in seeing a possible solution mixing essql and the expression language if anyone can offer one, for learning sake.

system · May 2, 2019, 6:10pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.