UJWAL KANDI - Resume

Professional Summary

Site Reliability & Observability Engineer with 3+ years of proven expertise and an M.S. in Information Technology & Management from UT Austin. Excel at transforming operational challenges into scalable solutions that drive business continuity and cost optimization across production systems. Architect monitoring and incident response frameworks reducing MTTR by 40%+, preventing 60%+ of incidents pre-impact, and maintaining 99.9% SLA compliance. Combine infrastructure automation with AI/ML for intelligent edge diagnostics, delivering measurable ROI through reliability engineering. AWS/Azure proficient across 16+ production environments with demonstrated impact in disaster recovery, distributed systems optimization, and operational efficiency improvements by 35%+.

Experience

Apple Inc

Metadata Operations Engineer (Contract via Welo Data)

Jan 2025 – Present

Austin, TX

Optimized mission-critical metadata ingestion pipelines ensuring 99.9%+ uptime and <2s latency for Apple TV's global content catalog using automated validation, error handling, and rollback procedures across XML workflows
Established monitoring and alerting for metadata pipeline health to detect and triage data quality issues, reducing feed failures by 30% and improving platform stability
Implemented structured debugging and anomaly detection frameworks to identify root causes of batch job failures, reducing mean time to resolution (MTTR) by 25% through systematic RCA and runbook development
Designed disaster recovery procedures for critical metadata systems, ensuring continuity and compliance with Apple’s SLA requirements

Sports Excitement

Data Operations Engineer

Aug 2024 – Jan 2025

Austin, TX

Built and maintained production ETL pipelines using Apache Airflow (DAG orchestration, monitoring, and failure recovery), supporting 50+ daily batch jobs with 99.9% reliability
Optimized Azure Data Lake Storage architecture, reducing query latency by 35% and improving downstream ML analytics throughput
Automated incident notifications via Airflow trigger jobs and alerting workflows, achieving <5min response times within SLA
Implemented data quality checks and load-testing frameworks, reducing post-deployment failures by 30%

Dover Fueling Solutions

Site Reliability Engineer

Jun 2024 – Aug 2024

Austin, TX

Enhanced operational observability by implementing 3-pillar monitoring strategy: configured Grafana dashboards, Azure APIM log analytics integration, and Kafka metrics exporters; reduced anomaly detection time from 15min to 3min and MTTR by 40%
Designed and deployed agentic AI system for edge device diagnostics on resource-constrained hardware (SQLite + Ollama), enabling field technicians to resolve issues autonomously, reducing support tickets by 35% and improving MTTR for critical faults to <10min
Utilized LangChain to prototype multi-agent configurations and employed LangSmith to evaluate agent performance, response accuracy, and token efficiency in edge inference scenarios while maintaining diagnostic accuracy under stringent 2GB memory constraints
Implemented alert tuning and intelligent escalation workflows to reduce alert fatigue by 45% while maintaining 99.9%+ critical incident detection rate; documented best practices in runbooks for on-call rotations

Stronghold Investment Management

AI Engineer - Capstone Project

Jan 2024 – May 2024

Austin, TX

Designed and implemented a custom-built AI agent over Azure AI Studio, integrating OCR and LLM (Mixtral) models to enhance ownership verification processes and automate complex document workflows
Reduced document processing costs by 20% by implementing an automated document intelligence system for document analysis
Conducted model response evaluations across multiple Azure-hosted models using custom test datasets and task-specific tuning to benchmark performance (accuracy, coherence, latency) across query types
Structured prompt design evaluations within Azure AI Studio to refine templates and optimize response quality with prompt tuning for document analysis workflows

Epsilon

Production Support Engineer

Apr 2022 – Jul 2023

Bengaluru, India

Orchestrated production incident response lifecycle (detection → triage → resolution → RCA → prevention) across 16+ client systems, achieving 99.9% SLA compliance and reducing MTTR by 40% through proactive monitoring and automated remediation
Designed and maintained comprehensive monitoring infrastructure: integrated CloudWatch, Kibana, and Dynatrace; established intelligent baselines, thresholds, and anomaly detection algorithms preventing 60% of potential incidents before customer impact
Authored 20+ runbooks, SOPs, and postmortem documentation implementing blameless RCA culture; broke repeat incident cycles by 50% through systematic root cause prevention and knowledge transfer
Implemented Infrastructure as Code disaster recovery framework: automated failover scripts for ETL pipelines using CloudFormation, reducing recovery time objective (RTO) from 2hrs to 30min and maintaining 99.9% uptime during critical incidents
Optimized Matillion ETL job orchestration (SLA schedules, retry logic, maintenance windows), improving batch completion efficiency by 35% and reducing operational overhead by 40%
Resolved critical data load failures within Epsilon’s data mart by optimizing complex batch processes, improving system performance by 89% and ensuring consistent data availability across 99.9% of daily loads