Lightning-fast unified analytics engine

Spark Release 1.2.0

Spark 1.2.0 is the third release on the 1.X line. This release brings performance and usability improvements in Spark’s core engine, a major new API for MLlib, expanded ML support in Python, a fully H/A mode in Spark Streaming, and much more. GraphX has seen major performance and API improvements and graduates from an alpha component. Spark 1.2 represents the work of 172 contributors from more than 60 institutions in more than 1000 individual patches.

To download Spark 1.2 visit the downloads page.

Spark Core

In 1.2 Spark core upgrades two major subsystems to improve the performance and stability of very large scale shuffles. The first is Spark’s communication manager used during bulk transfers, which upgrades to a netty-based implementation. The second is Spark’s shuffle mechanism, which upgrades to the “sort based” shuffle initially released in Spark 1.1. These both improve the performance and stability of very large scale shuffles. Spark also adds an elastic scaling mechanism designed to improve cluster utilization during long running ETL-style jobs. This is currently supported on YARN and will make its way to other cluster managers in future versions. Finally, Spark 1.2 adds support for Scala 2.11. For instructions on building for Scala 2.11 see the build documentation.

Spark Streaming

This release includes two major feature additions to Spark’s streaming library, a Python API and a write ahead log for full driver H/A. The Python API covers almost all the DStream transformations and output operations. Input sources based on text files and text over sockets are currently supported. Support for Kafka and Flume input streams in Python will be added in the next release. Second, Spark streaming now features H/A driver support through a write ahead log (WAL). In Spark 1.1 and earlier, some buffered (received but not yet processed) data can be lost during driver restarts. To prevent this Spark 1.2 adds an optional WAL, which buffers received data into a fault-tolerant file system (e.g. HDFS). See the streaming programming guide for more details.

MLLib

Spark 1.2 previews a new set of machine learning API’s in a package called spark.ml that supports learning pipelines, where multiple algorithms are run in sequence with varying parameters. This type of pipeline is common in practical machine learning deployments. The new ML package uses Spark’s SchemaRDD to represent ML datasets, providing direct interoperability with Spark SQL. In addition to the new API, Spark 1.2 extends decision trees with two tree ensemble methods: random forests and gradient-boosted trees, among the most successful tree-based models for classification and regression. Finally, MLlib’s Python implementation receives a major update in 1.2 to simplify the process of adding Python APIs, along with better Python API coverage.

Spark SQL

In this release Spark SQL adds a new API for external data sources. This API supports mounting external data sources as temporary tables, with support for optimizations such as predicate pushdown. Spark’s Parquet and JSON bindings have been re-written to use this API and we expect a variety of community projects to integrate with other systems and formats during the 1.2 lifecycle.

Hive integration has been improved with support for the fixed-precision decimal type and Hive 0.13. Spark SQL also adds dynamically partitioned inserts, a popular Hive feature. An internal re-architecting around caching improves the performance and semantics of caching SchemaRDD instances and adds support for statistics-based partition pruning for cached data.

GraphX

In 1.2 GraphX graduates from an alpha component and adds a stable API. This means applications written against GraphX are guaranteed to work with future Spark versions without code changes. A new core API, aggregateMessages, is introduced to replace the now deprecated mapReduceTriplet API. The new aggregateMessages API features a more imperative programming model and improves performance. Some early test users found 20% - 1X performance improvement by switching to the new API.

In addition, Spark now supports graph checkpointing and lineage truncation which are necessary to support large numbers of iterations in production jobs. Finally, a handful of performance improvements have been added for PageRank and graph loading.

Other Notes

Upgrading to Spark 1.2

Spark 1.2 is binary compatible with Spark 1.0 and 1.1, so no code changes are necessary. This excludes APIs marked explicitly as unstable. Spark changes default configuration in a handful of cases for improved performance. Users who want to preserve identical configurations to Spark 1.1 can roll back these changes.

  1. spark.shuffle.blockTransferService has been changed from nio to netty
  2. spark.shuffle.manager has been changed from hash to sort
  3. In PySpark, the default batch size has been changed to 0, which means the batch size is chosen based on the size of object. Pre-1.2 behavior can be restored using SparkContext([... args... ], batchSize=1024).
  4. Spark SQL has changed the following defaults:
    • spark.sql.parquet.cacheMetadata: false -> true
    • spark.sql.parquet.compression.codec: snappy -> gzip
    • spark.sql.hive.convertMetastoreParquet: false -> true
    • spark.sql.inMemoryColumnarStorage.compressed: false -> true
    • spark.sql.inMemoryColumnarStorage.batchSize: 1000 -> 10000
    • spark.sql.autoBroadcastJoinThreshold: 10000 -> 10485760 (10 MB)

Known Issues

A few smaller bugs did not make the release window. They will be fixed in Spark 1.2.1:

  • Netty shuffle does not respect secured port configuration. Work around - revert to nio shuffle: SPARK-4837
  • java.io.FileNotFound exceptions when creating EXTERNAL hive tables. Work around - set hive.stats.autogather = false. SPARK-4892.
  • Exception PySpark zip function on textfile inputs: SPARK-4841
  • MetricsServlet not properly initialized: SPARK-4595

Credits

  • Aaron Davidson – Improvements in Core; bug fixes in Core and Shuffle; improvement in Core and Shuffle
  • Aaron Staple – Improvements in Core, MLlib, and Streaming; new features in PySpark; bug fixes in SQL
  • Adam Pingel – Improvement in Core
  • Ahir Reddy – Improvements in Core
  • Akshat Aranya – Bug fixes in Core
  • Alex Liu – Bug fixes in SQL
  • Alexander Ulanov – New features in MLlib
  • Allan Douglas R. De Oliveira – Improvements in Core
  • Anand Avati – Improvement in Core
  • Anant Asthana – Improvement in Core, MLlib, and SQL
  • Andrew Ash – Documentation and bug fixes in Core
  • Andrew Bullen – Bug fixes in MLlib
  • Andrew Or – Improvements in Core and YARN; bug fixes in Windows, Core, and YARN; improvement in Core and YARN
  • Andy Konwinski – Documentation in Core
  • Aniket Bhatnagar – Bug fixes in Core and Streaming
  • Ankur Dave – Improvements and bug fixes in GraphX
  • Arun Ahuja – Documentation in Core
  • Benoy Antony – Bug fixes in Web UI and YARN
  • Bertrand Bossy – Bug fixes in Core
  • Bill Bejeck – Bug fixes in Core
  • Brenden Matthews – Bug fixes in Mesos
  • Burak Yavuz – New features in MLlib
  • Chao Chen – Improvements and documentation in Core
  • Cheng Hao – Test, improvements, new features, bug fixes, and improvement in SQL
  • Cheng Lian – Improvements in Core and SQL; test in SQL; new features in SQL; bug fixes in Core and SQL; documentation in Core
  • Chester Chen – Bug fixes in YARN
  • Chip Senkbeil – New features in Core
  • Chirag Aggarwal – Bug fixes in SQL
  • Chris Cope – Bug fixes in YARN
  • Christoph Sawade – Improvements in MLlib and PySpark
  • Cody Koeninger – Improvements in SQL
  • Colin Patrick Mccabe – Improvements in Core
  • DB Tsai – Improvements and improvement in MLlib
  • Dale Richardson – Improvements in Core
  • Dan McClary – New features in SQL
  • Dan Osipov – New features in EC2
  • Daoyuan Wang – Improvements in Core and SQL; new features in SQL; bug fixes in Core and SQL; documentation in Core
  • Davies Liu – Improvements in Core, SQL, MLlib, and PySpark; new features in Core, Streaming, PySpark, and MLlib, and PySpark; bug fixes in Streaming, Core, SQL, MLlib, and PySpark; documentation in Core
  • Derek Ma – Bug fixes in Core and Streaming
  • DoingDone9 – Bug fixes in SQL
  • Egor Pahomov – Bug fixes in Core
  • Eric Eijkelenboom – Bug fixes in Core
  • Eric Liang – Bug fixes in Core and SQL
  • Erik Erlandson – Improvements and improvement in Core
  • Eugen Cepoi – Improvements in Core
  • Fairiz Azizi – Improvements in Core
  • Felix Maximilian Moller – Documentation in Core
  • Gankun Luo – Bug fixes in SQL
  • Grega Kespret – Documentation in Core
  • GuoQiang Li – Improvements in Core and MLlib; bug fixes in Core, Web UI, MLlib, and PySpark; improvement in YARN
  • Hari Shreedharan – Bug fixes and improvement in Streaming
  • Henry Cook – Documentation in Core
  • Holden Karau – Documentation in Core; bug fixes in PySpark
  • Hong Shen – Improvements in Core
  • Hossein Falaki – Bug fixes in Web UI
  • Ian Hummel – Improvements in Core
  • Jacky Li – Bug fixes in Core
  • Jakub Dubovsky – Bug fixes in Core
  • Jascha Swisher – Bug fixes in Core
  • Jay Vyas – Documentation in Core
  • Jeremy Freeman – New features in Streaming and MLlib; bug fixes in Core and PySpark
  • Jey Kottalam – Bug fixes in Core
  • Jie Huang – Documentation and bug fixes in Core
  • Jim Carroll – Improvements and bug fixes in SQL
  • Jim Lim – Improvements in Core; bug fixes in YARN
  • Jongyoul Lee – Bug fixes in Core and Mesos
  • Joseph Bradley – Improvements in MLlib
  • Joseph E. Gonzalez – Documentation in Core; bug fixes in GraphX and MLlib
  • Joseph K. Bradley – Improvements in Core and MLlib; new features in MLlib and SQL; bug fixes in MLlib; documentation in Core and MLlib
  • Josh Rosen – Improvements in Java API, Core, Web UI, and Shuffle; new features in Java API, Core, and Web UI; bug fixes in Core, PySpark, and Streaming; documentation in Core
  • Kai Sasaki – Bug fixes in Core
  • Kay Ousterhout – Improvements in Core and Web UI; bug fixes in Core and Web UI
  • Ken Takagiwa – Documentation in Core
  • Kenichi Maehashi – Improvements in Core
  • Kevin Mader – Improvements in Java API and Core
  • Kousuke Saruta – Improvements in Project Infra, Core, PySpark, YARN, SQL, and Web UI; bug fixes in Core, PySpark, MLlib, YARN, SQL, and Web UI; documentation in Core
  • Larry Xiao – Improvements and bug fixes in GraphX
  • Li Zhihui – Improvements in Core
  • Liang-Chi Hsieh – Improvements in Core; bug fixes in Core and SQL
  • Lianhui Wang – Bug fixes in GraphX
  • Lijie Xu – Bug fixes in Core and GraphX
  • Liquan Pei – Documentation in Core; new features in MLlib and PySpark
  • Liu Hao – Bug fixes in Core
  • Lu Lu – Improvements in GraphX
  • Madhu Siddalingaiah – Documentation in Core
  • Manish Amde – Improvements and new features in MLlib
  • Marcelo Vanzin – Test in YARN; improvement in Core and YARN; new features in Core; bug fixes in Core and YARN; improvements in Core
  • Mario Pastorelli – Documentation in Core
  • Mark G. Whitney – Documentation in YARN
  • Mark Hamstra – Bug fixes in Core
  • Mark Mims – Improvements in Web UI
  • Martin Weindel – Documentation in Core and Mesos
  • Masayoshi TSUZUKI – Improvements in Windows, Core, and PySpark; bug fixes in Windows, Core, and PySpark
  • Matei Zaharia – Improvement in Core and SQL; bug fixes in Core and SQL
  • Matthew Cheah – Bug fixes in Core
  • Matthew Farrellee – Improvements in Core; new features in PySpark; bug fixes in Core and PySpark
  • Matthew Rocklin – Bug fixes in Core
  • Matthew Taylor – Bug fixes in SQL
  • Michael Armbrust – Improvements in SQL; new features in SQL; bug fixes in Core, PySpark, and SQL; documentation in Core
  • Michael Griffiths – Bug fixes in PySpark
  • Michelangelo D’Agostino – Improvements in MLlib and PySpark
  • Mike Timper – Bug fixes in SQL
  • Min Shen – Bug fixes in YARN
  • Mingfei Shi – Bug fixes in Core
  • Mubarak Seyed – Improvements in Streaming
  • NamelessAnalyst – Improvements in GraphX
  • Nan Zhu – Bug fixes and Improvements in Core
  • Nathan Artz – Documentation in Core
  • Nathan Howell – Bug fixes in SQL
  • Nicholas Chammas – Improvement in Core; improvements in Project Infra, Core, and EC2; bug fixes in Project Infra, EC2, and SQL; documentation in Core
  • Niklas Wilcke – Improvements in MLlib; bug fixes in Core
  • Nishkam Ravi – Bug fixes in Core
  • Oded Zimerman – Bug fixes in GraphX
  • Patrick Wendell – Improvements in Core; bug fixes in Project Infra, Core, and Mesos
  • Prashant Sharma – Improvements in Core; bug fixes in Streaming and Core; improvement in Core, YARN, and Streaming
  • Praveen Seluka - New feature in Core
  • Qiping Li – Improvements and new features in MLlib
  • RJ Nowling – Improvements in MLlib; bug fixes in GraphX; documentation in Core
  • Ravindra Pesala – Improvements, new features, and bug fixes in SQL
  • Raymond Liu – Improvement in Core and Shuffle
  • Renat Yusupov – Bug fixes in SQL
  • Reno Zhang – Improvements in YARN
  • Reynold Xin – Improvements in Core, Shuffle, EC2, and SQL; new features in Project Infra, Core, and EC2; bug fixes in Core and SQL; improvement in Core, Shuffle, and SQL
  • Reza Zadeh – Improvements in Core; new features in MLlib; documentation in Core
  • Rob O’Dwyer – Improvements in PySpark
  • Rong Gu – Improvements in Core
  • Rui Li – New features in Java API
  • Saisai Shao – Improvements in Streaming; bug fixes in Streaming and Shuffle
  • Sandy Ryza – Improvements in Core, MLlib, and YARN; new features in Core; bug fixes in Core and SQL
  • Santiago M. Mola – Documentation in Core
  • Sean Owen – Improvement in Streaming; improvements in Core and Streaming; new features in Core; bug fixes in Java API, Core, MLlib, and Streaming; documentation in Core
  • Shane Knapp – Bug fixes in Core
  • Shiti Saxena – Improvement in Core
  • Shivaram Venkataraman – Improvements in Core; bug fixes in Core and EC2
  • Shixiong Zhu – Test in Core; improvements in Core and Web UI; bug fixes in Core, Web UI, and YARN; documentation in Streaming and Core
  • Bai Shou – Improvements and bug fixes in SQL
  • Shuo Xiang – New features and bug fixes in MLlib
  • Su Yan – Bug fixes in Core
  • Sung Chung – Improvements in MLlib
  • Surong Quan – Improvements in Streaming
  • Takuya UESHIN – Test in SQL; documentation in Core; bug fixes in Core and SQL; improvements in SQL
  • Tal Sliwowicz – Bug fixes in Core
  • Tathagata Das – Improvements in Core and Streaming; bug fixes in Streaming and Core; improvement in Streaming
  • Ted Yu – Bug fixes and improvement in Core
  • Thomas Graves – Bug fixes in Core and YARN
  • Tianshuo Deng – Bug fixes in Core and Shuffle
  • Timothy Chen – Bug fixes in Mesos
  • Tingjun Xu – Bug fixes in YARN
  • Tomohiko K. – Bug fixes in Core and PySpark; improvement in PySpark
  • Uncle Gen – Improvements in GraphX
  • Uri Laserson – Improvements in PySpark
  • Varadharajan Mukundan – Improvements in Core
  • Venkata Ramana Gollamudi – New features and bug fixes in SQL
  • Victor Tso – Bug fixes in Core
  • Vida Ha – Improvements in SQL; bug fixes in EC2
  • Viper Kun – Documentation in Core
  • Wang Fei – Test in SQL; improvements in Core and SQL; bug fixes in Core and SQL; documentation in Core
  • Wang Tao – Improvements in Core, YARN, and SQL; bug fixes in Core and YARN
  • Ward Viaene – Bug fixes in PySpark
  • Wenchen Fan – Bug fixes in SQL
  • William Benton – Improvements and bug fixes in SQL
  • Xiangrui Meng – Improvements in Core, PySpark, MLlib, SQL, Java API, and Web UI; documentation in Core; new features in SQL, MLlib, and PySpark; bug fixes in Core, MLlib, and PySpark; improvement in PySpark, MLlib, and SQL
  • Xinyun Huang – Improvements in SQL
  • Yadong Qi – Test in Core; improvements and bug fixes in Streaming
  • Yanbo Liang – New features in MLlib
  • Yantang Zhai – Improvements in Core; bug fixes in Core, Web UI, and SQL
  • Yash Datta – Improvements in SQL
  • Ye Xianjin – Improvements in Core
  • Yin Huai – Documentation in Core; bug fixes in SQL
  • Zdenek Farana – Bug fixes in SQL
  • Zhan Zhang – Build fixes in SQL
  • Zhang, Liye – Improvements and bug fixes in Core

Thanks to everyone who contributed!


Spark News Archive