Spark SQL & DataFrames | Apache Spark

Latest News

Spark 1.6.2 released (Jun 25, 2016)
Call for Presentations for Spark Summit EU is Open (Jun 16, 2016)
Preview release of Spark 2.0 (May 26, 2016)
Spark Summit (June 6, 2016, San Francisco) agenda posted (Apr 17, 2016)

Integrated

Seamlessly mix SQL queries with Spark programs.

Spark SQL lets you query structured data inside Spark programs, using either SQL or a familiar DataFrame API. Usable in Java, Scala, Python and R.

context = HiveContext(sc)
results = context.sql(
"SELECT * FROM people")
names = results.map(lambda p: p.name)

Apply functions to results of SQL queries.

Uniform Data Access

Connect to any data source the same way.

DataFrames and SQL provide a common way to access a variety of data sources, including Hive, Avro, Parquet, ORC, JSON, and JDBC. You can even join data across these sources.

context.jsonFile("s3n://...")
  .registerTempTable("json")
results = context.sql(
  """SELECT *
     FROM people
     JOIN json ...""")

Query and join different data sources.

Hive Compatibility

Run unmodified Hive queries on existing data.

Spark SQL reuses the Hive frontend and metastore, giving you full compatibility with existing Hive data, queries, and UDFs. Simply install it alongside Hive.

Spark SQL can use existing Hive metastores, SerDes, and UDFs.

Standard Connectivity

Connect through JDBC or ODBC.

A server mode provides industry standard JDBC and ODBC connectivity for business intelligence tools.

Use your existing BI tools to query big data.

Performance & Scalability

Spark SQL includes a cost-based optimizer, columnar storage and code generation to make queries fast. At the same time, it scales to thousands of nodes and multi hour queries using the Spark engine, which provides full mid-query fault tolerance. Don't worry about using a different engine for historical data.

Community

Spark SQL is developed as part of Apache Spark. It thus gets tested and updated with each Spark release.

If you have questions about the system, ask on the Spark mailing lists.

The Spark SQL developers welcome contributions. If you'd like to help out, read how to contribute to Spark, and send us a patch!

Getting Started

To get started with Spark SQL:

Download Spark. It includes Spark SQL as a module.
Read the Spark SQL and DataFrame guide to learn the API.

Download Apache Spark
Includes Spark SQL