Thrown when a query fails to analyze, usually because the query itself is invalid.
Thrown when a query fails to analyze, usually because the query itself is invalid.
1.3.0
A column that will be computed based on the data in a DataFrame
.
A column that will be computed based on the data in a DataFrame
.
A new column can be constructed based on the input columns present in a DataFrame:
df("columnName") // On a specific `df` DataFrame. col("columnName") // A generic column not yet associated with a DataFrame. col("columnName.field") // Extracting a struct field col("`a.column.with.dots`") // Escape `.` in column names. $"columnName" // Scala short hand for a named column.
Column objects can be composed to form complex expressions:
$"a" + 1 $"a" === $"b"
1.3.0
The internal Catalyst expression can be accessed via expr, but this method is for debugging purposes only and can change in any future Spark releases.
A convenient class used for constructing schema.
A convenient class used for constructing schema.
1.3.0
Functionality for working with missing data in DataFrame
s.
Functionality for working with missing data in DataFrame
s.
1.3.1
Interface used to load a Dataset from external storage systems (e.g.
Interface used to load a Dataset from external storage systems (e.g. file systems,
key-value stores, etc). Use SparkSession.read
to access this.
1.4.0
Statistic functions for DataFrame
s.
Statistic functions for DataFrame
s.
1.4.0
Interface used to write a Dataset to external storage systems (e.g.
Interface used to write a Dataset to external storage systems (e.g. file systems,
key-value stores, etc). Use Dataset.write
to access this.
1.4.0
A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations.
A Dataset is a strongly typed collection of domain-specific objects that can be transformed
in parallel using functional or relational operations. Each Dataset also has an untyped view
called a DataFrame
, which is a Dataset of Row.
Operations available on Datasets are divided into transformations and actions. Transformations
are the ones that produce new Datasets, and actions are the ones that trigger computation and
return results. Example transformations include map, filter, select, and aggregate (groupBy
).
Example actions count, show, or writing data out to file systems.
Datasets are "lazy", i.e. computations are only triggered when an action is invoked. Internally,
a Dataset represents a logical plan that describes the computation required to produce the data.
When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a
physical plan for efficient execution in a parallel and distributed manner. To explore the
logical plan as well as optimized physical plan, use the explain
function.
To efficiently support domain-specific objects, an Encoder is required. The encoder maps
the domain specific type T
to Spark's internal type system. For example, given a class Person
with two fields, name
(string) and age
(int), an encoder is used to tell Spark to generate
code at runtime to serialize the Person
object into a binary structure. This binary structure
often has much lower memory footprint as well as are optimized for efficiency in data processing
(e.g. in a columnar format). To understand the internal binary representation for data, use the
schema
function.
There are typically two ways to create a Dataset. The most common way is by pointing Spark
to some files on storage systems, using the read
function available on a SparkSession
.
val people = spark.read.parquet("...").as[Person] // Scala Dataset<Person> people = spark.read().parquet("...").as(Encoders.bean(Person.class)); // Java
Datasets can also be created through transformations available on existing Datasets. For example, the following creates a new Dataset by applying a filter on the existing one:
val names = people.map(_.name) // in Scala; names is a Dataset[String] Dataset<String> names = people.map((Person p) -> p.name, Encoders.STRING));
Dataset operations can also be untyped, through various domain-specific-language (DSL) functions defined in: Dataset (this class), Column, and functions. These operations are very similar to the operations available in the data frame abstraction in R or Python.
To select a column from the Dataset, use apply
method in Scala and col
in Java.
val ageCol = people("age") // in Scala Column ageCol = people.col("age"); // in Java
Note that the Column type can also be manipulated through its various functions.
// The following creates a new column that increases everybody's age by 10. people("age") + 10 // in Scala people.col("age").plus(10); // in Java
A more concrete example in Scala:
// To create Dataset[Row] using SparkSession val people = spark.read.parquet("...") val department = spark.read.parquet("...") people.filter("age > 30") .join(department, people("deptId") === department("id")) .groupBy(department("name"), people("gender")) .agg(avg(people("salary")), max(people("age")))
and in Java:
// To create Datasetusing SparkSession
Dataset<Row> people = spark.read().parquet("..."); Dataset<Row> department = spark.read().parquet("..."); people.filter(people.col("age").gt(30)) .join(department, people.col("deptId").equalTo(department.col("id"))) .groupBy(department.col("name"), people.col("gender")) .agg(avg(people.col("salary")), max(people.col("age")));
1.6.0
A container for a Dataset, used for implicit conversions in Scala.
A container for a Dataset, used for implicit conversions in Scala.
To use this, import implicit conversions in SQL:
val spark: SparkSession = ... import spark.implicits._
1.6.0
:: Experimental ::
Used to convert a JVM object of type T
to and from the internal Spark SQL representation.
:: Experimental ::
Used to convert a JVM object of type T
to and from the internal Spark SQL representation.
Encoders are generally created automatically through implicits from a SparkSession
, or can be
explicitly created by calling static methods on Encoders.
import spark.implicits._ val ds = Seq(1, 2, 3).toDS() // implicitly provided (spark.implicits.newIntEncoder)
Encoders are specified by calling static methods on Encoders.
List<String> data = Arrays.asList("abc", "abc", "xyz"); Dataset<String> ds = context.createDataset(data, Encoders.STRING());
Encoders can be composed into tuples:
Encoder<Tuple2<Integer, String>> encoder2 = Encoders.tuple(Encoders.INT(), Encoders.STRING()); List<Tuple2<Integer, String>> data2 = Arrays.asList(new scala.Tuple2(1, "a"); Dataset<Tuple2<Integer, String>> ds2 = context.createDataset(data2, encoder2);
Or constructed from Java Beans:
Encoders.bean(MyClass.class);
1.6.0
:: Experimental :: Holder for experimental methods for the bravest.
:: Experimental :: Holder for experimental methods for the bravest. We make NO guarantee about the stability regarding binary compatibility and source compatibility of methods here.
spark.experimental.extraStrategies += ...
1.3.0
The abstract class for writing custom logic to process data generated by a query.
The abstract class for writing custom logic to process data generated by a query. This is often used to write the output of a streaming query to arbitrary storage systems. Any implementation of this base class will be used by Spark in the following way.
open(...)
method has been called, which signifies that the task is
ready to generate data.For each partition with `partitionId`: For each batch/epoch of streaming data (if its streaming query) with `epochId`: Method `open(partitionId, epochId)` is called. If `open` returns true: For each row in the partition and batch/epoch, method `process(row)` is called. Method `close(errorOrNull)` is called with error (if any) seen while processing rows.
Important points to note:
partitionId
and epochId
can be used to deduplicate generated data when failures
cause reprocessing of some input data. This depends on the execution mode of the query. If
the streaming query is being executed in the micro-batch mode, then every partition
represented by a unique tuple (partitionId, epochId) is guaranteed to have the same data.
Hence, (partitionId, epochId) can be used to deduplicate and/or transactionally commit data
and achieve exactly-once guarantees. However, if the streaming query is being executed in the
continuous mode, then this guarantee does not hold and therefore should not be used for
deduplication.close()
method will be called if open()
method returns successfully (irrespective
of the return value), except if the JVM crashes in the middle.Scala example:
datasetOfString.writeStream.foreach(new ForeachWriter[String] { def open(partitionId: Long, version: Long): Boolean = { // open connection } def process(record: String) = { // write string to connection } def close(errorOrNull: Throwable): Unit = { // close the connection } })
Java example:
datasetOfString.writeStream().foreach(new ForeachWriter<String>() { @Override public boolean open(long partitionId, long version) { // open connection } @Override public void process(String value) { // write string to connection } @Override public void close(Throwable errorOrNull) { // close the connection } });
2.0.0
:: Experimental :: A Dataset has been logically grouped by a user specified grouping key.
:: Experimental ::
A Dataset has been logically grouped by a user specified grouping key. Users should not
construct a KeyValueGroupedDataset directly, but should instead call groupByKey
on
an existing Dataset.
2.0.0
Lower priority implicit methods for converting Scala objects into Datasets.
Lower priority implicit methods for converting Scala objects into Datasets. Conflicting implicits are placed here to disambiguate resolution.
Reasons for including specific implicits:
newProductEncoder - to disambiguate for List
s which are both Seq
and Product
A set of methods for aggregations on a DataFrame
, created by groupBy,
cube or rollup (and also pivot
).
A set of methods for aggregations on a DataFrame
, created by groupBy,
cube or rollup (and also pivot
).
The main method is the agg
function, which has multiple variants. This class also contains
some first-order statistics such as mean
, sum
for convenience.
2.0.0
This class was named GroupedData
in Spark 1.x.
Represents one row of output from a relational operator.
Represents one row of output from a relational operator. Allows both generic access by ordinal, which will incur boxing overhead for primitives, as well as native primitive access.
It is invalid to use the native primitive interface to retrieve a value that is null, instead a
user must check isNullAt
before attempting to retrieve a value that might be null.
To create a new Row, use RowFactory.create()
in Java or Row.apply()
in Scala.
A Row object can be constructed by providing field values. Example:
import org.apache.spark.sql._ // Create a Row from values. Row(value1, value2, value3, ...) // Create a Row from a Seq of values. Row.fromSeq(Seq(value1, value2, ...))
A value of a row can be accessed through both generic access by ordinal, which will incur boxing overhead for primitives, as well as native primitive access. An example of generic access by ordinal:
import org.apache.spark.sql._ val row = Row(1, true, "a string", null) // row: Row = [1,true,a string,null] val firstValue = row(0) // firstValue: Any = 1 val fourthValue = row(3) // fourthValue: Any = null
For native primitive access, it is invalid to use the native primitive interface to retrieve
a value that is null, instead a user must check isNullAt
before attempting to retrieve a
value that might be null.
An example of native primitive access:
// using the row from the previous example. val firstValue = row.getInt(0) // firstValue: Int = 1 val isNull = row.isNullAt(3) // isNull: Boolean = true
In Scala, fields in a Row object can be extracted in a pattern match. Example:
import org.apache.spark.sql._ val pairs = sql("SELECT key, value FROM src").rdd.map { case Row(key: Int, value: String) => key -> value }
1.3.0
Runtime configuration interface for Spark.
Runtime configuration interface for Spark. To access this, use SparkSession.conf
.
Options set here are automatically propagated to the Hadoop configuration during I/O.
2.0.0
The entry point for working with structured data (rows and columns) in Spark 1.x.
The entry point for working with structured data (rows and columns) in Spark 1.x.
As of Spark 2.0, this is replaced by SparkSession. However, we are keeping the class here for backward compatibility.
1.0.0
A collection of implicit methods for converting common Scala objects into Datasets.
A collection of implicit methods for converting common Scala objects into Datasets.
1.6.0
The entry point to programming Spark with the Dataset and DataFrame API.
The entry point to programming Spark with the Dataset and DataFrame API.
In environments that this has been created upfront (e.g. REPL, notebooks), use the builder to get an existing session:
SparkSession.builder().getOrCreate()
The builder can also be used to create a new session:
SparkSession.builder .master("local") .appName("Word Count") .config("spark.some.config.option", "some-value") .getOrCreate()
:: Experimental :: Holder for injection points to the SparkSession.
:: Experimental :: Holder for injection points to the SparkSession. We make NO guarantee about the stability regarding binary compatibility and source compatibility of methods here.
This current provides the following extension points:
The extensions can be used by calling withExtension on the SparkSession.Builder, for example:
SparkSession.builder() .master("...") .conf("...", true) .withExtensions { extensions => extensions.injectResolutionRule { session => ... } extensions.injectParser { (session, parser) => ... } } .getOrCreate()
Note that none of the injected builders should assume that the SparkSession is fully initialized and should not touch the session's internals (e.g. the SessionState).
Converts a logical plan into zero or more SparkPlans.
Converts a logical plan into zero or more SparkPlans. This API is exposed for experimenting with the query planner and is not designed to be stable across spark releases. Developers writing libraries should instead consider using the stable APIs provided in org.apache.spark.sql.sources
A Column where an Encoder has been given for the expected input and return type.
A Column where an Encoder has been given for the expected input and return type.
To create a TypedColumn, use the as
function on a Column.
The input type expected for this expression. Can be Any
if the expression is type
checked by the analyzer instead of the compiler (i.e. expr("sum(...)")
).
The output type of this column.
1.6.0
Functions for registering user-defined functions.
Functions for registering user-defined functions. Use SparkSession.udf
to access this:
spark.udf
1.3.0
:: Experimental :: Methods for creating an Encoder.
:: Experimental :: Methods for creating an Encoder.
1.6.0
1.3.0
This SQLContext object contains utility functions to create a singleton SQLContext instance, or to get the created SQLContext instance.
This SQLContext object contains utility functions to create a singleton SQLContext instance, or to get the created SQLContext instance.
It also provides utility functions to support preference for threads in multiple sessions scenario, setActive could set a SQLContext for current thread, which will be returned by getOrCreate instead of the global one.
Contains API classes that are specific to a single language (i.e.
Contains API classes that are specific to a single language (i.e. Java).
Commonly used functions available for DataFrame operations.
Commonly used functions available for DataFrame operations. Using functions defined here provides a little bit more compile-time safety to make sure the function exists.
Spark also includes more built-in functions that are less common and are not defined here.
You can still access them (and all the functions defined here) using the functions.expr()
API
and calling them through a SQL expression string. You can find the entire list of functions
at SQL API documentation.
As an example, isnan
is a function that is defined here. You can use isnan(col("myCol"))
to invoke the isnan
function. This way the programming language's compiler ensures isnan
exists and is of the proper form. You can also use expr("isnan(myCol)")
function to invoke the
same function. In this case, Spark itself will ensure isnan
exists when it analyzes the query.
regr_count
is an example of a function that is built-in but not defined here, because it is
less commonly used. To invoke it, use expr("regr_count(yCol, xCol)")
.
1.3.0
Support for running Spark SQL queries using functionality from Apache Hive (does not require an existing Hive installation).
Support for running Spark SQL queries using functionality from Apache Hive (does not require an existing Hive installation). Supported Hive features include:
Users that would like access to this functionality should create a HiveContext instead of a SQLContext.
A set of APIs for adding data sources to Spark SQL.
Contains a type system for attributes produced by relations, including complex types like structs, arrays and maps.
Allows the execution of relational queries, including those expressed in SQL using Spark.