Sketching: accelerating big data computations

Full Featured (30 min.)

Working on big data can be a hard task, especially on extremely large scale data which requires massive resources and time. This might cause long development cycles and production issues. In many cases, applications can be accelerated dramatically if some approximation is allowed. The idea is to find a sample of the data, that enables fast processing on one hand, but achieves good result accuracy on the other hand. Some of these sampling techniques are used 'behind the scenes' in popular big data processing frameworks such as Hadoop MapReduce and Spark. In this talk, I will present sampling techniques that enable big data applications to run faster but still obtain accurate results.