Elasticsearch on Spark - case class limit to 22 fields on Scala

eliasah · August 14, 2015, 12:07pm

I'm trying to work on the kdd99 Dataset for Fraud Detection. In the dataset, a record looks like this :

0,tcp,http,SF,215,45076,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,0,0,0.00,0.00,0.00,0.00,0.00,.00,0.00,0.00,normal.

A record represents a connection. For each connection, the data set contains information like the number of bytes sent, login attempts, TCP errors, and so on. Each connection is one line of CSV-formatted data, containing 38 features.

So what I am trying to do is to write the data into Elasticsearch using Spark, so I can first analyze it with Kibana on a visual level, before performing deeper computation with Spark to predict whether a record is a fraudulent action or not.

The issue is that till Scala 2.10, a case class is limited to 22 fields.

Which means that I can't create a case class to associate to a record.

How can I go around this limitation without switching to Scala 2.11 which seems that can solve the issue (SI-7296)?

I appreciate your help. Thanks in advance!

costin · August 18, 2015, 9:36am

Case classes are just an option of types that can be serialized out of the box. You can just as well use a Map (whether in Scala or Java) or a JavaBean - though I would recommend the former especially considering the big number of parameters involved.

eliasah · August 18, 2015, 9:58am

Ok thanks! So actually you'll recommend using a classic Map structure?

costin · August 18, 2015, 10:01am

A case class is just that - a strongly-typed Map. Why not use it? Especially if the properties fall under the same type and you can add some generics to it, the Map should work just fine and have no size issue.

eliasah · August 18, 2015, 10:02am

Great. Thanks! Your solution seems quite logical now.