Azure Data Engineer Associate Certification Study Notes
Azure Databricks
Cluster Modes
- Standard
    
- Terminate after 120 seconds
 - Python, R, Scala, SQL
 - For single users
 
 - High Concurrency
    
- Do not terminate automatically
 - Python, R, SQL
 - For multiple users
 - Fault isolation
 - Task preemption for fair usage for all users
 - Table Access Control - Restrict access to data to right people
 
 - Both clusters can have credential passthrough - AD
 - Cluster retention period : 30 days after termination. Pin cluster for > 30 days. Max 20 clusters can be pinned.
 - Cluster cloning: Permissions, installed libraries, attached notebooks not cloned
 - Pool : Azure Databricks pools reduce cluster start and auto-scaling times by maintaining a set of idle, ready-to-use instances.
 - Job : Way of running notebook immediately / by schedule
 
Access Control
** Only available on Premium Plan
- Cluster access control - Default: All users can create / modify clusters unless admin enables ‘Cluster access control’
 - Pool access control - Default: All users can create / modify pools unless admin enables ‘Pool access control’
 - Jobs access control - Default: All users can create / modify jobs unless admin enables ‘Jobs access control’
 - Table access control - Control access to data stored in cluster managed tables Default: All users have access to all data stored in a cluster’s managed tables unless an administrator enables table access control for that cluster.
 - Workspace access control - Control access to create / modify notebooks, folders. Default: All users can create / modify workspace objects unless admin enables ‘Workspace access control’
 - Token-based auth: For REST APIs
 - Conditional Access
 - Data Lake Storage Credential Passthrough - Use AD Identity used in Databricks for Data Lake
 
Security
- To access APIs, use access tokens
 
Azure Storage
Tiers
Cool - At least 30 days. Milliseconds to retrieve. 
Archive - At least 180 days. Hours to retrieve.
Rehydration options
- Standard Priority: 15 hours
 - High Priority: 1 hour
 
Blob storage not supported by ZRS or GZRS 
SAS does not support Azure Files 
Azure AD does not support Azure Files (REST), Azure Tables
Accessing Storage
- Account SAS: Access to multiple resources in 1 or more storage services
 - Service SAS: Access to specific storage account resource
 - User delegation SAS: Secured with Azure AD credentials.
 - Primary/Secondary Storage: Full access to storage acct, including ability to modify/delete resources on it.
 
*** If access key is regenerated, Account SAS, Service SAS and Shared Key will be impacted
Azure Data Lake
Access Control
- RBAC : Container Level
 - ACL : Folder Level
 - Shared Key : ‘Super-User’ access to all operations on all resources
 - SAS
 
Others
Recommended file size: >256 MB
Azure Managed Disk Types
Used with Azure VMs
- Ultra SSD
    
- Disk Size: 65,536 GiB
 - Throughput: 2,000 MiB/s
 - IOPS: 160,000
 
 - Premium SSD
    
- Disk Size: 32,767 GiB
 - Throughput: 900 MiB/s
 - IOPS: 20,000
 
 - Standard SSD
    
- Disk Size: 32,767 GiB
 - Throughput: 750 MiB/s
 - IOPS: 6,000
 
 - Standard HDD
    
- Disk Size: 32,767 GiB
 - Throughput: 500 MiB/s
 - IOPS: 2,000
 
 
SQL Server
Encryption Types
- Always Encrypted: Encrypt sensitive data in client apps without revealing encryption keys to DB engine, providing separation between data owners and data managers.
 - Column-Level encryption - Uses function on server. Data can be seen by DBA.
 
Backup Types
- Full Backup
 - Differential Backup: Only backup what is different from previous full backup
 - Partial Backup : Only Primary Filegroup backed up
 - Transaction Logs : Holds all transactions
 - Tail-end of transaction logs : Log records that have yet to be backed up
 
Azure SQL DB Managed Instance
Backup
No geo-replication for Azure SQL Database Managed Instance
Auditing
To audit Azure SQL Database Managed Instance:
1) Create SAS 2) T-SQL: Create Credentials where SECRET is SAS 3) T-SQL: Create Server Audit
Azure CosmosDB
SLAs
Write - Single / Read - Single: 99.99%, 99.99% 
Write - Single / Read - Multi: 99.99%, 99.999% 
Write - Multi / Read - Multi: 99.999%, 99.999%
Azure SQL Data Warehouse
Backup
- Automatic Restore Points : No configuration required. 7 days retention period. RPO: 8 hours
 - User-Defined Restore Points: Manual trigger snapshots. Max 42 user-defined restore points. 7 day retention period. Can use Azure Portal to create.
 - When data warehouse deleted, final snapshot created which will be retained for 7 days only
 - Geo-Backups happen once per day to paired data center. RPO: 24 hours.
 - On restore, firewall rules need to be re-created.
 - Recommended file size : 256MB to 2GB
 
Azure SQL
Service Tiers
Basic, Standard, Premium 
Storage Size: 2 GB, 1 TB, 4 TB 
Max DTUs: / eDTUs per DB 5, 3000, 4000 
Max Backup Retention: 7 days, 35 days, 35 days 
IO Latency: 5ms(read) & 10ms(write), 2 ms(read/write) \
Basic, Standard, General - Compute/Storage resources separated Premium, Business Critical - Compute/Storage integrated
Hyperscale - DB Size up to 100 TB. Use when DB size > 4 TB
** In-memory OLTP only supported by Premium tier 
** Premium tier integrates storage / compute resources & replicates together 
** Use contained users for AAD authentication to Azure SQL
A contained database user based on an Azure AD identity, is a database user that does not have a login in the master database, and which maps to an identity in the Azure AD directory that is associated with the database. Eg: To create contained user: CREATE USER [bob@contoso.com] FROM EXTERNAL PROVIDER; 
Before creating contained user, need to set admin to AD User through UI Portal. 
** Dynamic Scalability : Manually change resource limits, etc without downtime \
Auditing vs Diagnostics
Auditing: Retain an audit trail of selected events. You can define categories of database actions to be audited. Report on database activity. You can use pre-configured reports and a dashboard to get started quickly with activity and event reporting. Analyze reports. You can find suspicious events, unusual activity, and trends. \
Diagnostics: Performance monitoring. CPU percentage, usage etc
Backup
- Long-term retention not available on Azure SQL DB Managed Instance - Up to 10 years
 - Geo-Redundancy : Cheaper option for backup - if DB ok to be offline for a period of time
 - Geo-Replication: More expensive option
 - Automated Backups - Up to 35 days only
 - When under TDE, offline files are still encrypted
 - Point-in-time restore = Full Backup + Differential Backup + Transaction Log Backup
 - Full Backup: Weekly, Differential Backup: Every 12 Hours, Transaction Logs: Every 5-10 minutes
 
Azure Functions
Plans
Consumption - Scale Out automatically. Only pay when functions running. 
Premium - Require constant running of functions. Require more power. VNet Connectivity. 
Dedicated (App Service) Plan - When there are under-utilized VMs running in other App Service instances. Manual scaling.
Database - General
Optimization
Choosing Distribution column:
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-distribute
- Column where heavily used in JOINs
 - Not in WHERE clause
 - Not a DATE column
 - Has many unique values (Column should have at least 60 unique values)
 - Does not have NULLs, or only few NULLs
 
Choosing Partition key: 
https://docs.microsoft.com/en-us/azure/cosmos-db/partitioning-overview
- Choose a partition key that has a wide range of values and access patterns that are evenly spread across logical partitions.
 - Choose a partition key that spreads the workload evenly across all partitions and evenly over time.
 - Candidates for partition keys might include properties that appear frequently as a filter in your queries.
 
Use non-clustered indexes to improve analytics performance https://docs.microsoft.com/en-us/azure/sql-database/sql-database-in-memory
Migration
Azure Database Migration Service - Migrating with minimal disruptions. Source DB remains online.
Azure AD
- Privileged Identity Management - JIT privileged access to resources / AD
 - Identity Protection - Automate detection / remediation of identity-based risks, investigate risks, export risk data to 3rd party utilities.
 
- 
                        
                        Previous
Azure IOT Hubs vs Event Hubs - What are the differences? - 
                        
                        Next
Azure Streaming Analytics - Export / Import Streaming Analytics Jobs