New Features in Spark-2.0
& API History of Spark
顾亮亮
2016.06.29


Agenda


New Features in Spark-2.0


Major Features in Spark-2.0

参考


Tungsten


Tungsten Phase 2


TPC Performance


Whole Stage Code Generation

Code Generation VS Whole Stage Code Generation

Whole Stage Code Generation

参考


Structed Streaming API: Simple


Structed Streaming

参考


Structed Streaming = Infinite DataFrame

DataFrame

Structed Streaming


Schedule


Goal


Unifying Datasets and DataFrames


Spark-2.0其他Feature


API History Of Spark


RDD -> Structed Streaming

Batch

  1. RDD
  2. SparkSQL(DataFrame)
  3. Dataset

Streaming

  1. DStream (基于RDD API)
  2. StreamingSQL (基于SparkSQL API) (Intel)
  3. Structed Streaming (基于Dataset API)

Machine Learning

  1. Row API (基于RDD API)
  2. ML Pipelines (基于DataFrame API)

Spark API


RDD API

参考


DStream API

参考


Shark -> SparkSQL(DataFrame)

Shark: Hive on Spark

SparkSQL(DataFrame)

参考


SparkSQL(DataFrame API) (1.3.0)

架构

例子


Fields based API

DataFrame API类似于Twitter Scalding的Fields based API;


DataSet API (1.6.0)

DataSet API: Typed interface over DataFrame API

参考


DataSet API (1.6.0) (CONT.)

Spark-1.6

!scala
abstract class RDD[T: ClassTag] { ... }
class DataFrame { ... }
class Dataset[T] { ... }

Spark-2.0

!scala
class DataFrame extends Dataset[Row]{ ... }

Type Safe API

Dataset API类似于Twitter Scalding的Type safe API


Thank U