Databricks Data Source Creation Tutorial
Audience: Data Owners
Content Summary: This guide details how to create a Databricks data source in Immuta. To explore data source creation guides for other storage technologies, see the Object-backed and Query-backed guides.
Databricks Permissions
Before you can create Databricks data sources for target tables and views in Immuta, you must have these permissions for the target tables and views you want to expose as data sources:
SELECT
READ_METADATA
Step 1: Enter Connection Information
-
Complete the first four fields in the Connection Information box:
- Server: hostname or IP address
- Port: port configured for Databricks, typically port 443
- SSL: when enabled, ensures communication between Immuta and the remote database is encrypted
- Database: the remote database
-
Retrieve your Personal Access Token:
- Navigate to Databricks in your browser, and then click the user account icon in the top right corner of the page.
- Select User Settings from the dropdown menu.
- Navigate to the Access Tokens tab, and then click the Generate New Token button.
- Enter this token in the Personal Access Token field in the Immuta console.
-
Retrieve the HTTP Path.
- Navigate to your cluster in Databricks.
- Click Advanced at the bottom of the page, and then select the JDBC/ODBC tab.
- Enter the unique HTTP Path from this page in the HTTP Path field in the Immuta console.
-
You can then choose to enter Additional Connection String Options or Upload Certificates to connect to the database.
- Click Test Connection.
If the connection is successful, a check mark will appear and you will be able to proceed. If an error occurs when attempting to connect, the error will be displayed in the UI. In order to proceed to the next step of data source creation, you must be able to connect to this data source using the connection information that you just entered.
Note: Only Immuta uses the connection you provide and injects all policy controls when users query the system. In other words, users always connect through Immuta with policies enforced and have no direct association with this connection. You can use a read-only restricted connection if desired, but Immuta is always read-only and has no concept of write to exposed databases.
Step 2: Determine Virtual Population
Decide how to virtually populate the data source: Table or SQL Statement.
Table
- If you choose Table, click Edit in the table selection box that appears.
-
By default, all tables are selected if the total number of tables is below the threshold setting, which is 2000. Select and deselect tables by clicking the checkbox to the left of the table name in the Import Tables menu. You can create multiple data sources at one time by selecting multiple tables.
-
After making your selection(s), click Apply.
SQL Statement
Backing data sources with SQL statements allows SQL to build the data source, while the complexities of the statements and the specifics of the database remain hidden. Users just see a data source in Immuta and just a PostgreSQL table in the Immuta Query Engine.
- Before entering a SQL statement, test the statement to verify that it works.
- Enter your SQL statement in the text box.
- Click Validate Statement.
Step 3: Enter Basic Information
Here you provide information about your source that makes it discoverable to users.
- Complete the Data Source Name field, which will be the name shown in the Immuta UI.
-
The SQL Table Name, which auto populates but is editable, will be the name of the table presented to users in the Immuta Query Engine; this name is not the name of the table you are getting the data from.
-
Click Create.
Advanced Options
None of the following steps are required. However, completing these steps will help maximize the utility of your data source.
Option 1: Select File Directory Structure
This step is specific to the Immuta S3 access pattern. If you were to provide your data source as-is to the users in Immuta S3, it would be one big file. However, in this step you can chunk your data into more than one file by building a directory structure.
- Click the Edit button in the File Directory Structure section.
- Click and drag different columns into position at the bottom of the screen to customize the directory structure in Immuta S3 for this data source.
-
Click Apply.
When a column is dragged into position, values from that column are incorporated into the directory structure in Immuta S3, and the data source is partitioned into different records based on the values from the selected column; high cardinality columns will result in many files. For example, if there were a column in your table called "sex" and the unique values were "Male" and "Female," selecting this column would generate the files "Male.json" and "Female.json" (or csv, if you selected that output format) in Immuta S3. If more columns were placed after "sex," then "Male" and "Female" would display as directories and the files would become the values of the newly selected column.
This step is similar to the "Map" step in a MapReduce job. You are "mapping" your data into chunks where they can then be processed by downstream analysts accessing these specific directories in Immuta S3. Those analysts are essentially interactively working as the reduce step against the data. Please note, however, that some column types (like float, double, decimal) will not be available to select in the directory structure.
Option 2: Select Output Format
This step is specific to the Immuta S3 access pattern. You tell Immuta how to convert the result set from the database into a format relevant to a filesystem.
Each file in Immuta S3 is actually a result set from a SQL statement that's fired on demand at read time. Since Immuta knows the structure of a SQL result set, it can convert it and present it to the user. File formats currently supported by Immuta include CSV and JSON.
- Click the Edit button in the Output Format section.
- Select CSV (default) or JSON, and then click Apply.
Option 3: Select Columns
This is an optional step that allows you to decide which columns to include in the data source.
- Click the Edit button in the Columns section.
- By default, all columns are selected. De-select a column by clicking in the checkbox in the top left corner of that column.
- When necessary, convert column types by clicking the Type dropdown menu. Note that incorrectly converting a column type will break your data source at query time.
-
Click Apply.
Further Considerations: Not Null Constraints
These constraints help Business Intelligence (BI) tools send better queries.
-
For SQL query-backed data sources, select Not NULL in the Nullable dropdown menu for any columns that have this constraint in the source database.
-
For table-backed data sources, this constraint is automatically detected.
Option 4: Create Event Time Column
An Event Time column denotes the time associated with records returned from this data source. For example, if your data source contains news articles, the time that the article was published would be an appropriate Event Time column.
- Click the Edit button in the Event Time section.
- Select the column(s).
-
Click Apply.
Selecting an Event Time column will enable
- More statistics to be calculated for this data source including the most recent record time, which is used for determining the freshness of the data source.
- The creation of time-based restrictions in the Policy Builder.
Option 5: Determine Latency
To adjust this optional setting,
- Click Edit in the Latency section.
- Complete the Set Time field, and then select MINUTES, HOURS, or DAYS from the subsequent dropdown menu.
- Click Apply.
This setting impacts the following behaviors:
- How long Immuta waits to refresh data that is in cache by querying the native data source. For example, if you only load data once a day in the native source, this setting should be greater than 24 hours. If data is constantly loaded in the native source, you need to decide how much data latency is tolerable vs how much load you want on your data source. This is only relevant to Immuta S3; SQL will always interactively query the native database.
- How often Immuta checks for new values in a column that is driving row-level redaction policies. For example, if you are redacting rows based on a country column in the data, and you add a new country, it will not be seen by the Immuta policy until this period expires.
-
How long noisy results are cached for a Differential Privacy policy.
Option 6: Add Data Source Tags
Adding tags to your data source allows users to search for the data source using the tags and Governors to apply Global policies to the data source.
To add tags,
- Click the Edit button in the Data Source Tags section.
- Begin typing in the Search by Tag Name box to select your tag, and then click Add.
Tags can also be added after you create your data source from the Data Source details page on the Overview tab or the Data Dictionary tab.