New Features in Spark-2.0
& API History of Spark
顾亮亮
2016.06.29

Presenter Notes

Agenda

  • New Features in Spark-2.0
    • Tungsten Phase 2
    • Structured Streaming
    • Unifying Datasets and DataFrames
    • Other new Features in Spark-2.0
  • API History Of Spark
    • Batch: RDD -> SparkSQL (DataFrame) -> Dataset
    • Streaming: DStream -> StreamingSQL -> Structed Streaming
    • Machine Learning: Row API -> ML Pipeline

Presenter Notes

New Features in Spark-2.0

Presenter Notes

Presenter Notes

Tungsten

Presenter Notes

Tungsten Phase 2

Presenter Notes

TPC Performance

Presenter Notes

Whole Stage Code Generation

Code Generation VS Whole Stage Code Generation

  • Code Generation: 仅仅是加快了表达式的求值(比如1+a)
  • Whole Stage Code Generation: 为整个查询计划生成代码

Whole Stage Code Generation

  • 消除虚函数调用
  • 将临时数据从内存中移到CPU寄存器中
  • 利用现代CPU特性来展开循环并使用SIMD功能,通过vectorization技术,可以加快那些代码生成比较复杂的算子运行速度

参考

Presenter Notes

Structed Streaming API: Simple

Presenter Notes

Structed Streaming = Infinite DataFrame

DataFrame

Structed Streaming

Presenter Notes

Schedule

Presenter Notes

Goal

Presenter Notes

Unifying Datasets and DataFrames

Presenter Notes

Spark-2.0其他Feature

  • 新的ANSI SQL解析器,可以运行TPC-DS所有的99个查询
  • SparkSession替代旧的SQLContext和HiveContext
  • 新的Accumulator API,拥有更加简洁的类型层次,而且支持基本类型
  • 基于DataFrame的Machine Learning API可以作为主要的ML API了
  • Machine learning pipeline持久化:可以保存和加载Spark支持所有语言的Machine learning pipeline和models
  • R的分布式算法:Generalized Linear Models (GLM), Naive Bayes, Survival Regression, and K-Means

Presenter Notes

API History Of Spark

Presenter Notes

RDD -> Structed Streaming

Batch

  1. RDD
  2. SparkSQL(DataFrame)
  3. Dataset

Streaming

  1. DStream (基于RDD API)
  2. StreamingSQL (基于SparkSQL API) (Intel)
  3. Structed Streaming (基于Dataset API)

Machine Learning

  1. Row API (基于RDD API)
  2. ML Pipelines (基于DataFrame API)

Presenter Notes

Spark API

Presenter Notes

RDD API