IFE Information


Prediction as a service with ensemble model in SparkML and Python ScikitLearn

Dec. 20, 2016

Watch the recording of the speech done at Spark Summit Brussles 2016 here:
https://www.youtube.com/watch?v=wyfTjd9z1sY

Data Science with SparkML on DataBricks is a perfect platform for application of Ensemble Learning on massive a scale. This presentation describes Prediction-as-a-Service platform which can predict trends on 1 billion observed prices daily. In order to train ensemble model on a multivariate time series in thousands/millions dimensional space, one has to fragment the whole space into subspaces which exhibit a significant similarity. In order to achieve this, the vastly sparse space has to undergo dimensionality reduction into a parameters space which then is used to cluster the observations. The data in the resulting clusters is modeled in parallel using machine learning tools capable of coefficient estimation at the massive scale (SparkML and Scikit Learn). The estimated model coefficients are stored in a database to be used when executing predictions on demand via a web service. This approach enables training models fast enough to complete the task within a couple of hours, allowing daily or even real time updates of the coefficients. The above machine learning framework is used to predict the airfares used as support tool for the airline Revenue Management systems.​


Airfare prediction using Machine Learning with Apache Spark on 1 billion observed airfares daily, Josef Habdank, Infare Solutions

May. 25, 2016

Prediction using Machine Learning (ML) techniques on Big Data is a computationally and system-wide challenging problem. Especially in the case when the system is processing approximately 10^9 observations per day scalability is the prime concern. In order to be able to rapidly train models covering whole multivariate space the time series vectors, which exhibit significant similarities, are clustered into the groups. Consequently the resulting vector clusters could be modelled using ML tools capable of coefficient estimation at the massive scale (Apache Spark with Scikit Learn). Presentation describes application of the Linear Regression and Support Vector Regression with Radial Basis Function kernel. This approach enables training models fast enough to complete the task within a couple of hours, allowing daily or even real time updates of the coefficients. The above machine learning framework is used to predict the airfares used as support tool for the Revenue Management systems.


Extreme Apache Spark: how in 3 months we created a pipeline that can process 2.5 billion rows a day

Mar. 19, 2016

Presentation consists of an amazing bundle of Pro tips and tricks for building an insanely scalable Apache Spark and Spark Streaming based data pipeline.

Presentation consists of 4 parts:
* Quick intro to Spark
* N-billion rows/day system architecture
* Data Warehouse and Messaging
* How to deploy spark so it does not backfire