Projects
Audience: Data Owners, Data Users, and Data Governors
Content Summary: Projects allow users to logically group work by linking data sources and can be created by Data Users who want to efficiently organize their work or by Data Owners who want to provide special access to data to specific users. Additionally, Data Governors can act as project owners for any project in their organization.
This overview describes concepts related to and major features of projects; the Project Owner, Project Member, and Project Governor guides provide tutorials for each of these user types.
Project Roles
The features and capabilities of each user differ based on the user's role within the project and within Immuta. Roles and their capabilities are outlined below.
Project Owner Capabilities
Users with the CREATE_PROJECT
permission are considered owners of the projects they create and have the following capabilities:
- manage project members
- manage project documentation
- set subscription policies on the project
- enable Project Equalization
- enable Masked Joins
- manage project data sources
- manage project tags
- post, reply to, delete, and resolve discussion threads
- disable, delete, and restore the project
- create derived data sources
- switch project contexts
- manage native workspaces
Governor Capabilities
Governors have the following capabilities for any project in their organization, even for projects that are private or that they are not members of:
- manage project members
- set subscription policies on the project
- add data sources to and delete data sources from the project
- manage project tags
- post and reply to discussion threads and delete their own threads and replies
- disable and restore a project
- configure project purposes and acknowledgement statements
- switch project contexts
Project Member Capabilities
Once subscribed to a project, all Immuta users have the following capabilities as project members:
- add data sources to the project (unless Project Equalization or Masked Joins is enabled)
- remove data sources they’ve added to the project
- post and reply to discussion threads and delete their own discussion threads and replies
- create derived data sources
- switch project contexts
Project Purposes and Acknowledgement Statements
The Data Governor is responsible for configuring project purposes and acknowledgement statements.
-
Purposes: Purposes help define the scope and use of data within a project and allow users to meet purpose restrictions on policies. Governors create and manage purposes and their sub-purposes, which project owners then add to their project(s) and use to drive Data Policies.
-
Acknowledgement Statements: Projects containing purposes require owners and subscribers to acknowledge that they will only use the data for those purposes by affirming or rejecting acknowledgement statements. If users accept the statement, they become a project member. If they reject the acknowledgement statement, they are denied access to the project. Once acknowledged, data accessed under the provision of a project will be audited and the purposes will be noted. Immuta provides default acknowledgement statements, but Data Governors can customize these statements in the Purposes tab.
Sub-Purposes
Purposes can be constructed as a hierarchy, meaning that purposes can contain nested sub-purposes, much like tags in Immuta. This design allows more flexibility in managing purpose-based restriction policies and transparency in the relationships among purposes.
For example, consider this organization of the sub-purposes of Research:
Instead of creating separate purposes, which must then each be added to policies as they evolve, a Governor could write the following Global Policy:
Limit usage to purpose(s) Research for everyone on data sources tagged PHI.
Now, any user acting under the purpose or sub-purpose of Research
- whether Research.Marketing
,
Research.Onboarding.Customer
, or Research.MedicalClaims
- will meet the criteria of this policy. Consequently,
purpose hierarchies eliminate
the need for a Governor to re-write these Global Policies when sub-purposes are added or removed. Furthermore, if new
projects with new Research purposes are added, for example, the relevant Global Policy will automatically be enforced.
Switching Project Contexts
The Immuta UI provides a simple way to switch project contexts so that users can access various data sources while acting under the appropriate purpose. By default, there will be no project selected, even if the user belongs to one or more projects in Immuta.
When users change project contexts, all SQL queries or blob fetches that run through Immuta will reflect users as acting under the purposes of that project, which may allow additional access to data if there are purpose restrictions on the data source(s). This process also allows organizations to track not just whether a specific data source is being used, but why.
Project Equalization
The same security restrictions regarding data sources are applied to projects; project members still need to be subscribed to data sources in order to access data, and only users with appropriate attributes and credentials will be able to see the data if it contains any row-level or masking security.
However, Project Equalization improves collaboration by ensuring that the data in the project looks identical to all members, regardless of their level of access to data. When enabled, this feature automatically equalizes all permissions so that no project member has more access to data than the member with the least access.
Note: Only project owners can add data sources to the project if this feature is enabled.
For instructions on enabling Project Equalization, navigate to the Project Owner guide.
Project Equalization and Subscription Policies
Once Project Equalization is enabled, the project Subscription Policy builder locks and can only be adjusted by manually editing the Equalized Entitlements. Then, the Subscription Policy will combine with the entitlement settings, depending on the policy type.
For example, consider the Subscription Policy of the following sample project, Fraud Prevention, before Project Equalization is enabled:
Fraud Prevention
Subscription Policy: Allow users to subscribe when approved by anyone with permission Owner (of this project).
After enabling Project Equalization, the following Equalized Entitlement is recommended by Immuta: User is a member of group Accounting.
In this particular example, the Equalized Subscription Policy contains the Equalized Entitlement and the approval of the original policy, so users must satisfy both conditions to subscribe:
- the user must be a member of the group Accounting and
- the user must be approved by anyone with permission Owner (of this project).
However, the way entitlements and approvals combine differs depending on the policy type; for clarity, the table below illustrates various scenarios for each type. Every row demonstrates how a specific project Subscription Policy changes after Project Equalization is enabled (when an equalized entitlement is set and when no entitlement is set) and how the policy reverts if Project Equalization is subsequently disabled.
Original Policy | Equalized Policy (Example Entitlement: member of group Accounting) | Equalized Policy (No Entitlement) | Policy After Disabling Equalization |
---|---|---|---|
Anyone | Allow user to subscribe when user is a member of group Accounting | Individual Users You Select | Individual Users You Select |
Allow users to subscribe when approved by anyone with permission Owner (of this project) | Allow users to subscribe when they satisfy all of the following: is a member of group Accounting and is approved by anyone with permission Owner (of this Project) | Allow users to subscribe when approved by anyone with permission Owner (of this project) | Allow users to subscribe when approved by anyone with permission Owner (of this project) |
Allow users to subscribe to the project when user is a member of group Legal | Allow users to subscribe to the project when user is a member of group Accounting | Individual Users You Select | Individual Users You Select |
Individual Users You Select | Allow users to subscribe to the project when user is a member of group Accounting | Individual Users You Select | Individual Users You Select |
Equalized Entitlements
This setting adjusts the minimum entitlements (i.e., users' groups and attributes) required to join the project and to access data within the project. When Project Equalization is enabled, Equalized Entitlements default to Immuta's recommended settings, but project owners can edit these settings by adding or removing parts of the entitlements. However, making these changes entails two potential disadvantages:
-
If you add entitlements, members might see more data as a whole, but at least some members of the project will be out of compliance. The status of users' compliance is visible from the Members tab within the project.
-
If you remove entitlements, the project will be open to users with fewer privileges, but this change might make less data visible to all project members. Removing entitlements is only recommended if you foresee new users joining with less access to data than the current members.
Validation Frequency
This setting determines how often user credentials are validated, which is critical if users share data with project members outside of Immuta, as they need a way to verify that those members' permissions are still valid. Validation Frequency provides those means of verification.
Masked Joins
This feature allows masked columns to be joined within the context of a project.
Note: Masked columns cannot be joined across data sources that are not linked by a project.
For instructions on enabling Masked Joins, navigate to the Project Owner guide.
Derived Data Sources
When Project Equalization is enabled, members can use data sources within the project to create a derived data source, which dynamically inherits the Subscription Policies and purpose restriction Data Policies from the parent source(s).
For example, consider these data sources, which each contain a Subscription and Data Policy:
Data Source A
Subscription Policy: Allow users to subscribe to the data source when user is a member of group Medical Claims
Data Policy: Mask by making null the value in the column(s) address except for members of group Legal
Data Source B
Subscription Policy: Allow users to subscribe to the data source when user is approved by anyone with permission Owner and anyone with permission Governance
Data Policy: Limit usage to purpose(s) Research for everyone
If a user creates a derived data source, Data Source C, from these two data sources, Data Source C will inherit these policies, which will be unchangeable:
Data Source C
Subscription Policy: Allow user to subscribe when they satisfy all of the following:
- is a member of group Legal and is a member of group Medical Claims
- is approved by anyone with permission Owner (of Data Source B) and anyone with permission Governance
Data Policy: Limit usage to purpose(s) Research for everyone
Note: If members use data outside the project to create their data source, they must first add that data to the project and re-derive the data source through the project connection. When creating a derived data source, members are prompted to certify that their data is derived from the parent data sources they selected upon creation.
For detailed instructions on creating a derived data source, navigate to the Project Owner Guide.
Native Workspaces
HDFS
This workspace allows native access to data on cluster without having to go through the Immuta SparkSession or Immuta Query Engine. Within a project, users can enable HDFS Native Workspace, creating a workspace directory in HDFS (a corresponding database in the Hive metastore) where users can write files.
Accessing Data
After a Project Owner creates a workspace, users are only able to access this HDFS directory and database while acting under the project, and they should use the SparkSQL session to copy data into the workspace. The Immuta Spark SQL Session applies policies to the data; so any data written to the workspace is already compliant with the restrictions of the equalized project, letting all members see data at the same level of access.
Once derived data is ready to be shared outside the workspace, it can be exposed as a derived data source in Immuta. At that point, the derived data source inherit policies appropriately, and it is then available through Immuta outside the project and can be used in future project workspaces by different teams in a compliant way.
Requirements
Administrators
- Administrators can opt to configure where all Immuta projects are kept in HDFS. The default
is
/user/immuta/workspace
. Note: If an administrator changes the default directory, the Immuta user must have full access to that directory. Once any workspace is created, this directory can no longer be modified. - Administrators can place a configuration value in the cluster configuration (
core-site.xml
) to mark that cluster as unavailable for use as a workspace.
Project Owners
- Once a project is equalized, Project Owners can enable a workspace for the project.
- If more than one cluster is configured, Immuta prompts for which to use.
- Once enabled, the full URI of where that workspace is located displays on the project page.
- Project Owners can also add connection information for Hive or Impala to allow a workspace source to be created. The connection information provided and the Kerberos credentials configured for Immuta are used for each derived Hive or Impala data source. The connection string for Hive or Impala is displayed on the project page with the full URI.
- Project Owners can disable the workspace at any time.
- When disabled, the workspace does not allow reading/writing from project members any longer.
- Data sources living in this directory still exist, and their access is not changed. Subscribed users still have access as usual.
- All data in this directory still exists, regardless of whether it belongs to a data source or not.
- After it has been disabled, Project Owners can purge all data in the workspace. They can purge all
non-data-source data only or purge all data (including data source data).
- When purging all data source data, sources can either be disabled or fully deleted.
Project Members
- When a user is acting under the project context, Immuta provides them read/write access to the project HDFS directory (using HDFS ACLs). If there are Immuta data sources already exposed in that directory and the user is acting under the project for the data in that directory, then the user bypasses the namenode plugin.
- Once a user is not acting under the project, all access to that directory is revoked, and they can only access data in that project as official Immuta data sources.
- When users with the CREATE_DATA_SOURCE_IN_PROJECT permission create a derived data source with workspace enabled,
they are prompted with a modified workflow:
- The user selects the directory (starting with the project root directory) of the data they are exposing.
- If the directory contains parquet or ORC files, then the data source options are: Hive, Impala, and HDFS. If the directory does not contain parquet or ORC files then only HDFS is available.
- The Immuta user connection is used to create the data source. This ensures join pushdown and that the data source works even when the user is not acting in the project. Note: Hive or Impala workspace sources are only available if the Project Owner added Hive or Impala connection information to the workspace.
- If Hive or Impala is selected as the data source type, then Immuta infers schema/partitions from files and generates create table statements for Hive.
- Once the data source is created, policy inheritance takes effect.
Note: To avoid data source collisions, Immuta does not allow HDFS and Hive/Impala data sources to be backed from the same location in HDFS.
Snowflake
Snowflake workspaces allow users to access protected data directly in Snowflake without having to go through the Immuta Query Engine.
Accessing Data
Typically, Immuta applies policies by forcing users to query through the Query Engine, acting like a proxy in front of the database Immuta is protecting. However, this process becomes unnecessary with Snowflake's secure views. Immuta enforces policy logic on data representing it as secure views in Snowflake. Secure views are static but creating a secure view for every unique user in your organization for every table in your organization would result in secure view bloat. Immuta projects addresses this problem by virtually grouping users and tables and equalizing users to the same level of access; this ensures that all members of the project see the same view of the data. All members then share one secure view.
Beyond interacting directly with Snowflake secure views in these workspaces, users can create derived data sources and collaborate with other project members at a common access level. These derived data sources inherit all appropriate policies making that data sharable outside of the project. Additionally, derived data sources use the credentials of the Immuta system Snowflake account, and that allows them to persist after a workspace is disconnected.
Derived data sources can persist after a workspace is disconnected because they use the credentials of the Immuta system Snowflake account.
Policy Enforcement
Immuta enforces policy logic on data and represents it as secure views in Snowflake. All members see the same view of the data because the projects group users and tables to equalize members to the same level of access. Consequently, this makes only one secure view needed, and then changes to policies immediately propagate to relevant secure views.
Mapping Projects to Secure Views
Immuta projects are represented as Session Contexts within Snowflake. As they are linked to Snowflake, projects automatically create corresponding
- roles in Snowflake: IMMUTA_[project name]
- schemas in the Snowflake IMMUTA database: [project name]
- secure views in the project schema for any table in the project
If users switch projects, they change their Snowflake Session Context to the appropriate Immuta project. If users are not entitled to a data source contained by the project, they are not able to access the Context in Snowflake until they have access to all tables in the project. If changes are made to a user's attributes, the changes immediately propagate to the Snowflake context.
Using Immuta with an Existing Snowflake Account
The following steps allow Immuta to be used with existing Snowflake accounts.
-
Immuta is configured to integrate with the organization’s Snowflake account and to share a single sign on (such as Okta), allowing users in Immuta to map to the same users in Snowflake. Alternatively, that mapping can be inferred by using the same usernames in both Snowflake and Immuta.
-
CREATE_DATA_SOURCE permissions are granted to specific users to allow them to expose Snowflake table metadata and enforce policies.
-
Tags can be used to drive policies by users manually adding tags when tables are imported, Immuta automatically tagging sensitive data (if Sensitive Data Detection is enabled), or users pulling tags from external catalogs that are mapped to the tables being exposed.
-
Policies are created and enforced on tables.
-
The CREATE_PROJECT permission is granted to specific users so they can create their own Immuta projects and create the appropriate Snowflake contexts. These users can drive what projects and hence what Snowflake contexts exist. Note: When users leave a project or a project is deleted, that Snowflake context is removed from their Snowflake accounts.
-
The CREATE_DATA_SOURCE_IN_PROJECT permission is given to specific users so they can expose their derived tables in the project. The derived tables inherit the policies, and then the data can be shared outside the project.
-
Users access data only through secure views in Snowflake (via Immuta projects), significantly decreasing the amount of role management for administrators in Snowflake. Organizations should also consider having a user in Snowflake who is able to create databases and make GRANTs on those databases. Then having separate users who are able to read and write from those tables.
Benefits
- Few roles to manage in Snowflake: That complexity is pushed to Immuta, and this is designed to simplify it.
- A small set of users has direct access to raw tables: Most users go through secure views only, but raw database access can be segmented across departments.
- Policies are built by the individual database administrators within Immuta and are managed in a single location: This lets changes to policies automatically propagate across thousands of tables’ secure views.
- Self-service access to data based on data policies.
- Users work in various contexts in Snowflake natively: Their contexts are based on their collaborators and their purpose, without fear of leaking data.
- All policies are enforced natively in Snowflake without performance impact: Security is maintained through Snowflake primitives of roles and secure views, and performance and scalability is maintained with no proxy.
- Policies can be driven by metadata: This allows massive scale policy enforcement with only a small set of actual policies.
- Derived tables can be shared back out through Immuta: This improves user collaboration.
- User access and removal are immediately reflected in secure views.
Limitations
- Snowflake workspaces do not support differential privacy policies. Any Snowflake sources with differential privacy policies applied can not be created within the native Snowflake workspace.
- Native derived data sources cannot be query-backed.
Cloudera
This workspace allows native access to data on cluster without having to go through the Immuta SparkSession or Immuta Query Engine.
Accessing Data
Users will only be able to access the directory and database created for the workspace when acting under the project. The Immuta Spark SQL Session will apply policies to the data, so any data written to the workspace will already be compliant with the restrictions of the equalized project, where all members see data at the same level of access. When users are ready to write data back to Immuta, they should use the SparkSQL session to copy data into the workspace.
Workspace Configuration Options
- Cloudera HDFS
- Cloudera S3A
Available Data Source Types
- Amazon S3 (Cloudera S3A)
- Apache Hive
- Apache HDFS (Cloudera HDFS)
- Apache Impala
Databricks
This workspace allows native access to data on cluster without having to go through the Immuta SparkSession or Immuta Query Engine.
Accessing Data
Users will only be able to access the directory and database created for the workspace when acting under the project. The Immuta Spark SQL Session will apply policies to the data, so any data written to the workspace will already be compliant with the restrictions of the equalized project, where all members see data at the same level of access. When users are ready to write data back to Immuta, they should use the SparkSQL session to copy data into the workspace.
When acting in the workspace project, users can read data using calls like
spark.read.parquet("immuta:///some/path/to/a/workspace")
.
To write delta lake data to a workspace and then expose that delta table as a data source in Immuta, you must specify a table when creating the derived data source (rather than a directory) in the workspace for the data source.
Workspace Configuration
- AWS S3
- Microsoft Azure
EMR
This workspace allows native access to data on cluster without having to go through the Immuta SparkSession or Immuta Query Engine.
Accessing Data
Users will only be able to access the directory and database created for the workspace when acting under the project. The Immuta Spark SQL Session will apply policies to the data, so any data written to the workspace will already be compliant with the restrictions of the equalized project, where all members see data at the same level of access. When users are ready to write data back to Immuta, they should use the SparkSQL session to copy data into the workspace.
Workspace Configuration Options
- EMR HDFS
- EMR S3
Available Data Source Types
- Apache Hive
- Apache HDFS (EMR HDFS)
- Amazon S3 (EMR S3)