Immuta SparkSession Query Plan
Audience: Data Owners and Data Users
Content Summary: The Immuta SparkSession (or Immuta Context in Spark 1.6) is the client-side plugin in the Immuta Spark ecosystem. This plugin is an extension of the open source SparkSession and, from a user's perspective, operates the same way (e.g., no modifications to the API). This document outlines how policies affect the query plan in Immuta SparkSession.
Immuta SparkSession vs. Open Source SparkSession
The two key differences between Immuta's SparkSession and the open source SparkSession are
- Immuta's external and session catalogs
- Immuta's logical replanning
In order to make the Immuta Spark ecosystem as user-friendly as possible, the Immuta SparkSession resolves relations by reaching out to the Immuta web service rather than (as is normally the case) resolving relations in the Hive Metastore directly. All queryable Immuta data sources are available to the Immuta SparkSession.
Hive and Impala data sources will be queried by accessing files directly on cluster that compose the Hive or Impala table. This is the same type of query execution that would occur in open source Spark when accessing a table in the Hive Metastore.
Any non Hive/Impala queryable (MySQL, Oracle, PostgreSQL, etc.) data source in Immuta will be queried from the user's Immuta SparkSession via JDBC through the Immuta Query Engine. Users can provide query partition information similar to what is available via the JDBC data source in Spark in order to distribute their query to the Query Engine.
In the case of a JDBC data source, policies are enforced at the Query Engine layer. In the case of a Hive or Impala (on cluster) data source, policies are enforced via the following steps:
- Plan modifications in the Immuta SparkSession
- Restrictions to field/method access via the Immuta SecurityManager
- Partition and file access token generation in the Immuta Partition Service
- Token validation and filesystem access enforcement in the Immuta NameNode plugin (HDFS)
- Token validation and remote object store proxying/enforcement in the Immuta Partition Service (S3/ADL/etc)
Plan Modifications
When a user attempts to query any Hive or Impala data source via the Immuta SparkSession, the Immuta catalogs
will first replace the relation in the user's plan with the proper plan that the data source represents. For example,
if the user attempts the following query (immuta
is an instance of Immuta SparkSession)
immuta.sql("SELECT * FROM customer_purchases WHERE age BETWEEN 18 AND 24 AND product_id = 15")
and the customer_purchases
data source is composed of the following query
SELECT * FROM customer JOIN purchase where customer.id = purchase.customer_id
and, in Immuta, the following columns were selected to be exposed in this data source
id
first_name
last_name
age
country
ssn
product_id
department
purchase_date
the resulting Spark logical plan would look like this:
'Project [*]
+- 'Filter ((('age >= 18) && ('age <= 24)) && ('purchase.product_id = 15))
+- 'Project ['id, 'first_name, 'last_name, 'age, 'country, 'ssn, 'product_id, 'department, 'purchase_date]
+- 'Join Inner, ('customer.id = 'purchase.customer_id)
:- 'UnresolvedRelation `customer`
+- 'UnresolvedRelation `purchase`
After the data source is resolved, the policies specific to the user will be applied to the logical plan. If the policy has masking or filters (row level, minimization, time filter, etc.), those filters will be applied to all corresponding underlying tables in the plan. For example, consider the following Immuta policies:
-
Mask using hashing the column
ssn
for everyone -
Only show rows where user is a member of group stored in
Immuta
that matches the value in the columndepartment
for everyone
These policies would modify the plan (assuming the current user is in the "Toys" and "Home Goods" groups) to look like this:
'Project [*]
+- 'Filter ((('age >= 18) && ('age <= 24)) && ('product_id = 15))
+- 'Project ['id, 'first_name, 'last_name, 'age, 'country, 'ssn, 'product_id, 'department, 'purchase_date]
+- 'Join Inner, ('customer.id = 'purchase.customer_id)
:- 'Project ['id, 'first_name, 'last_name, 'country, unresolvedalias('immuta_hash('ssn), None)]
: +- 'UnresolvedRelation `customer`
+- 'Project ['customer_id, 'product_id, 'department, 'purchase_date]
+- 'Filter (('department = Toys) || ('department = Home Goods))
+- 'UnresolvedRelation `purchase`
Notice that masked columns (such as ssn
) are aliased to their original name after masking is applied, which means
that transformations/filters/functions applied to those columns will be applied to the masked columns. Also notice that
filters on the plan are applied before any user transformations/filters/etc. This means that a user's query cannot
modify or subvert the policies applied to the plan, just as if a user applied transformations, filters, or
a UDF to a View.
Immuta does not attempt to change or block optimizations to the Spark plan via the Catalyst Optimizer.