New Features in Spark-2.0
& API History of Spark
顾亮亮
2016.06.29

Presenter Notes

Agenda

  • New Features in Spark-2.0
    • Tungsten Phase 2
    • Structured Streaming
    • Unifying Datasets and DataFrames
    • Other new Features in Spark-2.0
  • API History Of Spark
    • Batch: RDD -> SparkSQL (DataFrame) -> Dataset
    • Streaming: DStream -> StreamingSQL -> Structed Streaming
    • Machine Learning: Row API -> ML Pipeline

Presenter Notes

New Features in Spark-2.0

Presenter Notes

Presenter Notes

Tungsten

Presenter Notes

Tungsten Phase 2

Presenter Notes

TPC Performance

Presenter Notes

Whole Stage Code Generation

Code Generation VS Whole Stage Code Generation

  • Code Generation: 仅仅是加快了表达式的求值(比如1+a)
  • Whole Stage Code Generation: 为整个查询计划生成代码

Whole Stage Code Generation

  • 消除虚函数调用
  • 将临时数据从内存中移到CPU寄存器中
  • 利用现代CPU特性来展开循环并使用SIMD功能,通过vectorization技术,可以加快那些代码生成比较复杂的算子运行速度

参考

Presenter Notes

Structed Streaming API: Simple

Presenter Notes

Structed Streaming = Infinite DataFrame

DataFrame

Structed Streaming

Presenter Notes

Schedule

Presenter Notes

Goal

Presenter Notes

Unifying Datasets and DataFrames

Presenter Notes

Spark-2.0其他Feature

  • 新的ANSI SQL解析器,可以运行TPC-DS所有的99个查询
  • SparkSession替代旧的SQLContext和HiveContext
  • 新的Accumulator API,拥有更加简洁的类型层次,而且支持基本类型
  • 基于DataFrame的Machine Learning API可以作为主要的ML API了
  • Machine learning pipeline持久化:可以保存和加载Spark支持所有语言的Machine learning pipeline和models
  • R的分布式算法:Generalized Linear Models (GLM), Naive Bayes, Survival Regression, and K-Means

Presenter Notes

API History Of Spark

Presenter Notes

RDD -> Structed Streaming

Batch

  1. RDD
  2. SparkSQL(DataFrame)
  3. Dataset

Streaming

  1. DStream (基于RDD API)
  2. StreamingSQL (基于SparkSQL API) (Intel)
  3. Structed Streaming (基于Dataset API)

Machine Learning

  1. Row API (基于RDD API)
  2. ML Pipelines (基于DataFrame API)

Presenter Notes

Spark API

Presenter Notes

Presenter Notes

DStream API

  • RDD interfaces
  • Windowing
  • Incremental Aggregation
  • Time-Skewed joins
  • 静态编译成RDD执行

参考

Presenter Notes

Shark -> SparkSQL(DataFrame)

Shark: Hive on Spark

SparkSQL(DataFrame)

参考

Presenter Notes

SparkSQL(DataFrame API) (1.3.0)

架构

例子

Presenter Notes

Fields based API

DataFrame API类似于Twitter Scalding的Fields based API;

Presenter Notes

DataSet API (1.6.0)

DataSet API: Typed interface over DataFrame API

  • RDD API可以进行类型检查,但是不能使用Catalyst进行优化
  • DataFrame API可以使用Catalyst进行优化,但是不能进行类型检查
  • Dataset API介于两者之间,即可以进行类型检查又可以使用Catalyst进行优化

参考

Presenter Notes

DataSet API (1.6.0) (CONT.)

Spark-1.6

1 abstract class RDD[T: ClassTag] { ... }
2 class DataFrame { ... }
3 class Dataset[T] { ... }

Spark-2.0

1 class DataFrame extends Dataset[Row]{ ... }

Presenter Notes

Type Safe API

Dataset API类似于Twitter Scalding的Type safe API

Presenter Notes

Thank U

Presenter Notes