Constructing scalable and resilient AI-driven cloud infrastructure is a problem that requires extra than simply technical experience—it calls for strategic foresight, automation, and a deep understanding of failure mitigation. On this interview, Aditya Bhatia, Principal Software program Engineer at Splunk (Cisco), shares insights from his journey at Yahoo, Apple, and Splunk, masking classes in Kubernetes automation, AI-driven cloud transitions, and management in high-pressure environments. He additionally discusses the evolving position of engineers in an AI-powered future and the way enterprises can construct infrastructure that withstands inevitable system failures.
Uncover extra interviews like this: Prakhar Mittal, Principal at AtriCure — Provide Chain, Digital Transformation, PLM, OCM, ROI Methods, Healthcare Traits, and World Collaboration
From Yahoo to Apple to Splunk (Cisco), your profession has been a journey by way of among the most modern tech corporations. What key classes have you ever discovered about constructing scalable and resilient AI and cloud infrastructure at an enterprise degree?
Over time, working at Yahoo, Apple, and now Splunk (Cisco), I’ve discovered that constructing scalable and resilient AI and cloud infrastructure is as a lot an artwork as it’s a science. At Yahoo, the place I first began engaged on cloud providers and CI/CD automation, I shortly realized that scalability isn’t nearly throwing extra servers at an issue—belief me, that simply results in a much bigger, costlier downside. As an alternative, I discovered the significance of automation and standardization, which not solely make techniques extra environment friendly but in addition preserve engineers from spending their weekends firefighting.
At Apple, engaged on distributed ML frameworks for Siri TTS, I acquired my first actual style of how unpredictable AI workloads could be. One second, all the things is working easily; the subsequent, a job crashes, and also you’re all of a sudden debugging logs at 2 AM. That have taught me the worth of fault-tolerant design and proactive failure dealing with—issues like checkpointing, speculative execution, and autoscaling aren’t simply nice-to-haves; they’re what preserve large-scale AI techniques from turning into costly science experiments.
Now at Splunk, we flip information into doing, the place observability is a core a part of the DNA. I’ve come to understand that you may’t repair what you’ll be able to’t measure. It doesn’t matter how nicely you design an AI or cloud system—should you don’t have real-time monitoring, logs, and metrics, you’re flying blind. I’ve additionally needed to embrace the truth that safety isn’t only for safety groups, particularly as I labored on automating FedRAMP IL2 compliance (as a result of nothing says enjoyable like an already constructed product, compliance automation, proper?). The most important lesson right here? Safety and scalability needs to be baked into the structure from the beginning, not duct-taped on later.
And naturally, if there’s one overarching pattern I’ve seen throughout all these experiences, it’s the shift towards cloud-native architectures. Whether or not it’s Kubernetes, serverless, or AI-driven automation, the business is shifting in the direction of versatile, scalable infrastructure that may deal with the unpredictable nature of contemporary workloads. At Splunk, I lead distributed workflow orchestration on Kubernetes, making certain that our techniques can gracefully deal with the chaos that comes with scale.
On the finish of the day, scalability and resilience aren’t nearly expertise—they’re about technique, tradition, and designing for failure earlier than failure occurs. If I’ve discovered something, it’s that one of the simplest ways to construct actually scalable AI and cloud techniques is to embrace automation, assume issues will break, and all the time, all the time have good observability—as a result of nothing humbles you quicker than an outage in manufacturing.
With the rise of Kubernetes-based infrastructure, how do you see the steadiness between automation and human oversight evolving? What are some essential challenges corporations nonetheless face in totally leveraging cloud-native architectures?
Kubernetes-based infrastructure is revolutionizing how we scale Infrastructure, however let’s be sincere—automation is wonderful… till it isn’t. I’ve used automation to cut back numerous handbook hours in doing the identical repetitive duties, streamline deployments, and construct total extra environment friendly techniques, however constructing such techniques additionally contain accumulating sufficient related metrics, information from underlying techniques such that if issues go haywire there’s a human within the loop.
Corporations nonetheless face some essential challenges when attempting to totally leverage cloud-native architectures. First, observability and debugging at scale are nonetheless onerous. Kubernetes offers you flexibility, however when one thing goes unsuitable in a multi-cluster deployment, good luck sifting by way of logs unfold throughout a number of microservices, GPUs, and networking layers. With out robust observability in place, you’re mainly enjoying detective at midnight.
Even with nice observability, price stays a serious problem. Simply because Kubernetes enables you to auto-scale workloads doesn’t imply you need to! I’ve seen corporations burn by way of cloud budgets at an alarming charge, solely to understand later that half their compute energy was idling away doing nothing. At Splunk, I labored on an initiative to run our cloud assets on extra environment friendly compute assets in AWS, saving the corporate 3M$ yearly. Automation must be paired with clever price administration and governance—in any other case, we find yourself with a really costly science undertaking as an alternative of a scalable platform.
Safety is one other main hurdle. Kubernetes expands the assault floor, and lots of corporations are nonetheless battling correct RBAC insurance policies, secret administration, and community safety in extremely dynamic environments. The pliability Kubernetes offers could be a double-edged sword if safety and compliance aren’t baked in from day one. At Splunk, engaged on automated FedRAMP IL2 compliance, I discovered that safety can’t be an afterthought—it needs to be constructed into the automation framework itself.
In the long run, automation ought to deal with the recognized, whereas people deal with the sudden. The most effective cloud native infrastructure strikes the suitable steadiness—automating what needs to be automated whereas conserving people within the loop for strategic decision-making, safety, and optimization. Corporations that get this steadiness proper will actually unlock the complete potential of cloud-native architectures, whereas people who don’t will both wrestle with inefficiency or, worse, be taught the onerous means when automation fails in manufacturing.
AI and automation are basically reshaping enterprise operations. What do you assume are essentially the most missed features when enterprises transition to AI-driven cloud infrastructure?
AI and automation are reshaping enterprise operations at an unimaginable tempo, however let’s be actual—most enterprises assume flipping the AI swap magically solves all the things. In actuality, the transition to AI-driven cloud infrastructure is stuffed with hidden pitfalls, and essentially the most missed features often come right down to information readiness, price effectivity, and belief in AI-driven decision-making.
First, rubbish in, rubbish out nonetheless holds true. Many corporations rush to deploy AI fashions with out making certain their information pipelines are clear, structured, and truly helpful. AI isn’t a magic wand—if the information is biased, inconsistent, or lacks correct governance, no quantity of fancy ML algorithms will repair it. I’ve seen enterprises pour hundreds of thousands into AI tasks, solely to understand their largest bottleneck was the shortage of a scalable information ingestion and processing technique.
Second, price effectivity in AI-driven cloud infrastructure continues to be a wild west. Kubernetes and cloud suppliers make it simple to spin up large-scale AI workloads, however with out correct guardrails, these GPU clusters begin burning money quicker than a high-frequency buying and selling bot on caffeine. At Splunk, I labored on an initiative to optimize cloud useful resource utilization, saving the corporate $3M yearly by right-sizing workloads and automating compute choice. Enterprises usually underestimate the price of inefficiencies, assuming AI automation will “optimize itself”—however with out cost-aware automation, corporations find yourself with an costly science undertaking as an alternative of a sustainable AI platform.
Lastly, belief and reliability in AI-driven decision-making, I believe, is essentially the most essential and most tough downside to unravel. AI automation is not only about working the scripts created by AI. But in addition how to make sure the suitable adjustments are being carried out with out people within the loop. Many corporations are assuming that AI will make the suitable selections primarily based on basic observations, however these selections won’t work for the corporate use instances that are extra particular and totally different for every firm and staff. Finest AI deployments needs to be dependable, interpretable, and will include guardrails to make sure that automation enhances stability somewhat than introducing new dangers.
In the end, enterprises that blindly soar into AI-driven cloud infrastructure with out addressing information high quality, price governance, and AI reliability are setting themselves up for a impolite awakening. The businesses that succeed would be the ones that steadiness automation with clever human oversight, construct scalable information methods, and guarantee AI-driven selections are each explainable and reliable.
Given your expertise mentoring and judging hackathons, what qualities or improvements in AI and cloud tasks have a tendency to face out essentially the most to you? What recommendation would you give to early-career engineers aiming to interrupt into this house?
The most effective hackathon tasks aren’t those that simply look spectacular for a two-day demo—they’re those which have the potential to grow to be actual merchandise. What stands out to me essentially the most in AI and cloud tasks is when groups give attention to fixing an actual downside with innovation and ease somewhat than simply chasing the most recent tech traits. Probably the most profitable tasks use AI and cloud applied sciences as instruments, not simply buzzwords, to create options which are environment friendly, scalable, and simple to make use of.
Innovation in hackathons isn’t about complexity—it’s about discovering the only, most elegant approach to resolve a tough downside. I’ve seen tasks that leverage AI for automation in cloud workflows, construct light-weight AI inference techniques on edge units, or rethink how Kubernetes manages ML fashions—all by conserving the answer centered, clear, and simple to scale. The groups that win and transcend the hackathon stage are those that don’t over-engineer however as an alternative give attention to what actually provides worth.
For early-career engineers, my largest recommendation is to give attention to fundamentals and fixing actual issues, not simply following traits. As an alternative of beginning with the most recent buzzword expertise, begin with the issue itself—then decide one of the best expertise to unravel it effectively. The most effective engineers don’t pressure AI, blockchain, or any trending tech into their tasks only for the sake of it; they deal with expertise as a device, not the tip purpose. True innovation comes from understanding the issue deeply and utilizing the only, simplest resolution to unravel it at scale.
Leadership in expertise is extra than simply technical experience—it’s additionally about imaginative and prescient and execution. What has been your strategy to main engineering groups successfully, significantly in high-pressure, mission-critical environments?
That’s appropriate, management in expertise is considerably extra than simply technical experience. It’s all about balancing agility together with resilient deliverables. As a Principal Engineer main a staff of seven engineers, my focus is to set the suitable tradition and technical customary which permits us to maneuver quicker with out breaking issues on the way in which.
First, readability is all the things. Excessive-pressure conditions demand exact execution, and that begins with nicely outlined priorities and execution. I, in my staff, observe Agile methodologies, making certain we’ve tight suggestions loops by way of every day stand-ups, dash planning, and retrospectives. For essential adjustments, my staff and I all the time start with a one-pager or ERD. This units a transparent design course from the beginning. Making the suitable design selections early prevents expensive rework later. When in an incident, uncertainty causes nervousness, everybody should perceive the intent behind the staff’s selections, why they matter, and the way they match into the broader system.
Second, I imagine in constructing a sturdy engineering ecosystem that helps effectivity at scale. Which means designing techniques with multi-stage testing environments with unit, integration, acceptance, efficiency, UAT, and even chaos testing. We don’t simply ship code; we battle-test it. The purpose? Discover as many failures as potential, earlier than they discover us in manufacturing. It’s all about eradicating ambiguity, automating what we will, and making certain our CI/CD pipelines are all the time delivering nicely examined adjustments shortly in order that engineers spend extra time fixing issues and fewer time debugging deployment points.
Thirdly, execution isn’t nearly instruments—it’s about engineering tradition. Code critiques aren’t simply checkboxes; they’re knowledge-sharing periods. I encourage everybody in my staff to overview the code of each different member. Engineers aren’t simply writing code; they’re designing options that may stay and evolve past them. I foster a collaborative, high-trust surroundings the place engineers really feel possession over their work but in addition know they’ve help when issues go sideways.
And lastly, management in high-stakes environments is about staying composed underneath strain. Issues will break every so often, and that’s okay too! My studying from such experiences has been that each incident is a chance to be taught from it, strengthen our techniques and put sufficient safeguards such that we don’t make the identical errors once more. The top purpose is steady enchancment, tending in the direction of perfection.
The intersection of AI, cloud, and automation is quickly redefining the way forward for work. What shifts do you foresee within the roles and expertise required for engineers within the subsequent 5 to 10 years?
The subsequent 5 to 10 years will see a elementary shift in engineering roles and required expertise as AI, cloud, and automation proceed to reshape the panorama. Whereas entry to info and AI-powered improvement instruments are making coding simpler, the core expertise of essential considering, problem-solving, and system design will stay invaluable. The position of an engineer will evolve far past simply writing code—it can embody market analysis, product technique, and full-stack improvement, all augmented by AI.
I believe conventional software program engineers will evolve into “product builders”, mixing engineering, design, and enterprise considering. AI-generated code will deal with routine programming duties, permitting engineers to give attention to structure, usability, and market-fit. Future software program engineers gained’t simply be coding, they’ll be constructing complete product experiences, optimizing workflows, and integrating AI-driven decision-making into each facet of the software program lifecycle.
Sure, code technology, testing, and infrastructure administration might be extremely automated. Engineers will spend much less time debugging syntax errors and extra time orchestrating AI-driven techniques.This may blur the strains between engineering, design, and enterprise technique. Engineers might want to perceive person conduct, market traits, and product lifecycle to construct options that aren’t solely technically sound but in addition commercially viable.
Additionally with AI producing and optimizing code, testing and safety would require a brand new strategy. Automation will play a key position, engineers might want to design automated testing suites that validate AI-generated outputs, making certain robustness, safety, and compliance.
In the long run laptop science is all about fixing complicated issues with computer systems and that’s not going away even with AI. Crucial considering and downside fixing expertise that are core to the sphere will nonetheless stay in demand. Engineers who can break down complicated issues and design elegant options might be within the highest demand.
In your weblog and convention contributions, you emphasize digital resilience. How can enterprises construct a extra resilient AI-driven infrastructure in a world more and more susceptible to system failures?
Within the business as AI workloads are scaling quickly, system failures are inevitable, and thus digital resilience is a key metric which is able to make or break the companies. Enterprises investing in AI-driven infrastructure should be sure that their techniques are fault-tolerant, scalable, and able to recovering from failures gracefully. This subject I’ve explored extensively in my analysis paper, Fault-Tolerant Distributed ML Frameworks for GPU Clusters: A Complete Assessment, in addition to in my Medium weblog and my web site, the place I talk about key methods for making AI infrastructure extra resilient to failures.
AI fashions aren’t simply computationally costly, they’ll break simply. A single GPU failure could cause hours of coaching time to be misplaced if there are not any correct checkpointing mechanisms in place. In my analysis paper, I talk about the position of distributed coaching methods extensively, on how AI techniques can get well from node failures, reminiscence leaks, and {hardware} crashes with out restarting from scratch.
In my Medium weblog, I define how Kubernetes-based AI workloads face new challenges in multi-cluster, multi-cloud deployments. Functions constructed on deep studying fashions equivalent to LLMs want excessive compute, resilient information pipelines, and dependable networks, however all of those dependency necessities additionally enhance factors of failure. To deal with these dangers, it’s essential to give attention to observability, tracing, and alerting to detect such failures and resolve them with automation. For instance, implementing chaos testing of AI fashions, which deliberately introduces failures in staging environments, ensures that infrastructure is resilient earlier than it reaches manufacturing.
Corporations that may prioritize AI resilience would be the ones that may scale effectively, cut back downtime, and construct AI techniques that may succeed.