Healthcare Revenue Cycle Management (RCM) Data Engineering Project on Azure
This document outlines an end-to-end data engineering project for Healthcare Revenue Cycle Management (RCM) using the Azure Data Engineering stack. It covers the architecture, data sources, pipelines, transformations, and best practices for building a robust data pipeline to support RCM reporting and analytics.
Overview of RCM
Revenue Cycle Management (RCM) is the financial process used by hospitals to manage the administrative and clinical functions associated with patient care, from appointment scheduling to final payment collection. The goal is to ensure the hospital remains financially healthy while providing quality care.
Simplified RCM Process
- Patient Visit:
- Collect patient and insurance details to determine payment responsibility (e.g., insurance covers 5,000 of a $20,000 bill).
- Service Delivery:
- Medical services are provided to the patient.
- Billing:
- The hospital generates a bill based on services provided.
- Claims Review:
- Insurance companies review claims, which may be fully paid, partially paid, or declined.
- Payments and Follow-ups:
- Follow up with insurance or patients for outstanding payments.
- Tracking and Improvement:
- Monitor Key Performance Indicators (KPIs) to optimize financial processes.
Key Aspects of RCM
-
Accounts Receivable (AR):
- Focuses on collecting payments from patients and insurers.
- Objectives:
- Maximize cash collection.
- Minimize collection period (e.g., 93% collectible at 30 days, 85% at 60 days, 73% at 90 days).
- KPIs:
- AR > 90 Days: Percentage of AR older than 90 days (e.g., 1M = 20%).
- Days in AR: Average days to collect AR (e.g., 10K/day; $400K AR = 40 days).
-
Accounts Payable:
- Manages payments owed by the hospital to vendors or suppliers.
Patient Payment Risks
Patients are more likely to bear payment responsibility in scenarios like:
- Low insurance coverage.
- Private clinics or dental treatments.
- High deductibles.
Data Engineering Objectives
The data engineering pipeline aims to:
- Ingest data from multiple sources (EMR, Claims, APIs).
- Transform and store data in a structured format (fact and dimension tables).
- Enable reporting teams to generate KPIs for AR optimization.
Data Sources
The pipeline processes data from the following sources:
-
Electronic Medical Records (EMR):
- Source: Azure SQL Database.
- Tables:
- Patients
- Providers
- Departments
- Transactions
- Encounters
- Databases:
trendytech-hospital-a
,trendytech-hospital-b
(e.g., Manipal Hospital, Columbia Asia).
-
Claims Data:
- Source: Flat files (CSV) delivered monthly to ADLS Gen2 landing zone.
- Represents insurance company data.
-
National Provider Identifier (NPI) Data:
- Source: Public API.
- Provides unique identifiers for healthcare providers.
-
ICD and CPT Codes:
- ICD Codes: Standardized diagnosis codes (via API).
- CPT Codes: Procedure codes (flat files in landing zone).
- Difference: ICD codes describe diagnoses, while CPT codes describe procedures (Reference).
Solution Architecture
The pipeline follows the Medallion Architecture with the following layers:
- Landing: Raw data ingestion (flat files, APIs).
- Bronze: Source-of-truth data stored in Parquet format.
- Silver: Cleaned, enriched, and conformed data (Common Data Model, SCD Type 2, Delta tables).
- Gold: Aggregated data in fact and dimension tables for reporting (Delta tables).
Data Flow
graph TD
A[Landing: Flat Files, APIs] --> B[Bronze: Parquet Files]
B --> C[Silver: Delta Tables, CDM, SCD2]
C --> D[Gold: Fact & Dimension Tables]
D --> E[Business Users: Reporting]
C --> F[Data Scientists, Analysts: ML & Analytics]
B --> G[Data Engineers: Source of Truth]
Azure Services Used
- Azure Data Factory (ADF): Orchestrates data ingestion and pipeline execution.
- Azure Databricks: Handles data processing and transformations.
- Azure SQL Database: Stores EMR data.
- Azure Data Lake Storage (ADLS) Gen2: Stores raw, bronze, silver, and gold data.
- Azure Key Vault: Secures credentials and secrets.
- Azure Databricks Unity Catalog: Manages metadata (replacing Hive metastore).
Storage Structure
ADLS Gen2 account: ttadlsdev
- Containers/Folders:
landing/
: Raw flat files (e.g., Claims, CPT).bronze/
: Parquet files (source of truth).silver/
: Delta tables (cleaned, enriched, SCD2).gold/
: Delta tables (fact and dimension tables).configs/
: Metadata-driven pipeline configuration (e.g.,emr/load_config.csv
).
Configuration File (configs/emr/load_config.csv
)
database,datasource,tablename,loadtype,watermark,is_active,targetpath
trendytech-hospital-a,hos-a,dbo.encounters,Incremental,ModifiedDate,0,hosa
trendytech-hospital-a,hos-a,dbo.patients,Incremental,ModifiedDate,0,hosa
trendytech-hospital-a,hos-a,dbo.transactions,Incremental,ModifiedDate,0,hosa
trendytech-hospital-a,hos-a,dbo.providers,Full,,0,hosa
trendytech-hospital-a,hos-a,dbo.departments,Full,,0,hosa
trendytech-hospital-b,hos-b,dbo.encounters,Incremental,ModifiedDate,0,hosb
trendytech-hospital-b,hos-b,dbo.patients,Incremental,Updated_Date,1,hosb
trendytech-hospital-b,hos-b,dbo.transactions,Incremental,ModifiedDate,0,hosb
trendytech-hospital-b,hos-b,dbo.providers,Full,,0,hosb
trendytech-hospital-b,hos-b,dbo.departments,Full,,0,hosb
Pipeline Design (Part 1)
Linked Services
- Azure SQL DB: Connects to EMR databases.
- ADLS Gen2: Accesses landing, bronze, silver, and gold folders.
- Delta Lake: Connects to Databricks Delta tables.
- Key Vault: Stores credentials securely.
Datasets
- Azure SQL DB:
- Parameters: Database name, schema name, table name.
- Delimited Text:
- Path:
configs/emr/load_config.csv
.
- Path:
- Parquet:
- Bronze layer output files.
- Databricks Delta Lake:
- Schema and table names for audit and silver/gold layers.
Audit Table
Stores metadata about data loads in a Delta table.
CREATE SCHEMA IF NOT EXISTS audit;
CREATE TABLE IF NOT EXISTS audit.load_logs (
id BIGINT GENERATED ALWAYS AS IDENTITY,
data_source STRING,
tablename STRING,
numberofrowscopied INT,
watermarkcolumnname STRING,
loaddate TIMESTAMP
);
ADF Pipeline Activities
- Lookup Activity:
- Reads
configs/emr/load_config.csv
to iterate over entities (e.g., encounters, patients).
- Reads
- ForEach Activity:
- Iterates over each config entry (10 tables).
- Checks if the target Parquet file exists in
bronze/<targetpath>/<tablename>.parquet
. - If exists, moves it to
bronze/<targetpath>/archive/<year>/<month>/<day>/<tablename>.parquet
.
- Copy Activity:
- Full Load:
-
Query:
SELECT * FROM <tablename>
. -
Writes to
bronze/<targetpath>/<tablename>.parquet
. -
Logs to audit table:
INSERT INTO audit.load_logs(data_source, tablename, numberofrowscopied, watermarkcolumnname, loaddate) VALUES ('<datasource>', '<tablename>', <rowscopied>, '<watermark>', utcNow())
-
- Incremental Load:
-
Checks
audit.load_logs
for the last load date:SELECT COALESCE(CAST(MAX(loaddate) AS DATE), '1900-01-01') AS last_fetched_date FROM audit.load_logs WHERE data_source = '<datasource>' AND tablename = '<tablename>'
-
Queries source with watermark:
SELECT *, '<datasource>' AS datasource FROM <tablename> WHERE <watermark> >= '<last_fetched_date>'
-
Writes to
bronze/<targetpath>/<tablename>.parquet
. -
Logs to audit table (same as full load).
-
- Full Load:
Pipeline Execution
- Sequential vs. Parallel:
- Initially sequential for stability.
- Optimized to parallel for performance (Part 2).
Pipeline Design (Part 2)
Bronze Layer Enhancements
- Claims and CPT Data:
- Ingested from
landing/
(CSV) tobronze/
(Parquet).
- Ingested from
- NPI and ICD Codes:
- Fetched via API and written directly to
bronze/
(Parquet).
- Fetched via API and written directly to
- Note:
- NPI codes in the dataset may not start with ‘1’ (data discrepancy).
- API data may result in nulls during joins due to limited records.
Silver Layer
- Transformations:
- Cleaning: Remove invalid or incomplete records (e.g.,
is_quarantined
flag). - Common Data Model (CDM): Standardize schema across sources.
- SCD Type 2 (SCD2): Track historical changes with
inserted_date
,modified_date
, andis_current
columns.-
Example for Patients:
-- Existing record patient_key: 101, name: Sumit, address: Hyderabad (Himayat Nagar) inserted_date: 2022-11-11, modified_date: 2022-11-11, is_current: true -- Update patient_key: 101, name: Sumit, address: Hyderabad (Tarnaka) inserted_date: 2024-01-01, modified_date: 2024-01-01, is_current: true -- Previous record updated modified_date: 2024-01-01, is_current: false
-
- Cleaning: Remove invalid or incomplete records (e.g.,
- Tables:
- SCD2: Patients, Transactions, Encounters, Claims, CPT, NPI, ICD.
- Full Load: Providers, Departments.
- Output: Delta tables in
silver/
.
Gold Layer
- Filters:
is_current = true
is_quarantined = false
- Structure:
- Fact Tables: Numeric metrics (e.g., transaction amounts, claim amounts).
- Dimension Tables: Supporting entities (e.g., Patients, Providers, Departments).
- Output: Delta tables in
gold/
for reporting.
Best Practices and Enhancements
-
Azure Key Vault:
- Created:
tt-health-care-kv
. - Secrets stored for Azure SQL DB, ADLS Gen2, and Databricks.
- Access policy configured via Azure App Registration for ADF and Databricks.
- Databricks secret scope:
https://adb-2871789655161196.16.azuredatabricks.net#secrets/createScope
. - Example:
dbutils.secrets.get('tt-hc-kv', 'tt-adls-access-key-dev')
.
- Created:
-
Improved Naming Conventions:
- Standardized folder paths:
bronze/<targetpath>/<tablename>.parquet
. - Archive structure:
bronze/<targetpath>/archive/<year>/<month>/<day>/<tablename>.parquet
.
- Standardized folder paths:
-
Parallel Pipeline Execution:
- Updated ForEach activity to run in parallel for faster processing.
-
Active/Inactive Flag:
is_active
inload_config.csv
enables/disables table processing.
-
Unity Catalog:
- Replaced Hive metastore with Unity Catalog.
- Structure:
catalog_name.schema_name.table_name
.
-
Retries:
- Added retry logic to ADF activities for robustness (e.g., API failures).
Data Flow Diagram
graph TD
A[EMR: Azure SQL DB] -->|ADF| B[Bronze: Parquet]
C[Claims: CSV] -->|ADF| B
D[CPT: CSV] -->|ADF| B
E[NPI/ICD: API] -->|ADF| B
B -->|Databricks| F[Silver: Delta Tables, CDM, SCD2]
F -->|Databricks| G[Gold: Fact & Dimension Tables]
H[Key Vault] -->|Secrets| A
H -->|Secrets| B
H -->|Secrets| F
I[Unity Catalog] -->|Metadata| F
I -->|Metadata| G
Notes and Considerations
- Data Quality:
- All data is generated using the Faker module, which may introduce discrepancies.
- Joins (e.g., NPI, ICD) may result in nulls due to limited API data.
- SCD2 Implementation:
- Ensures historical tracking for slowly changing dimensions (e.g., patient address changes).
- Gold Layer:
- Optimized for reporting KPIs like AR > 90 days and Days in AR.
- Future Improvements:
- Enhance error handling and logging.
- Implement data validation checks before moving to silver/gold layers.
GitHub Integration
- Store pipeline configurations, scripts, and notebooks in a GitHub repository.
- Credentials:
- Username and personal access token stored in Key Vault.
- Use GitHub for version control and collaboration.
This architecture provides a scalable, secure, and efficient pipeline for RCM data processing, enabling hospitals to optimize their financial operations through actionable insights.