AI and the Evolving Landscape of Observability

Artificial Intelligence & Machine Learning,
Cloud Security,
Governance & Risk Management

Leonard Bertelli from FPT Discusses the Shift from Reactive Monitoring to Proactive Insights

Yamini Kalra
•
August 29, 2025

AI and the New Rules of Observability — Leonard Bertelli, Senior Vice President, Enterprise and AI Solutions, FPT Americas

Once regarded as a niche area of engineering, observability has evolved into an essential function for businesses managing complex, distributed, and AI-driven systems. However, numerous challenges remain, including siloed telemetry, outdated dashboards, and cultural inertia that perpetuates a reactive monitoring mindset.

In a dialogue with Information Security Media Group, Leonard Bertelli, Senior Vice President of Enterprise and AI Solutions at FPT Americas, elaborates on the evolving nature of observability, the unique challenges arising from AI workloads, and the necessity of aligning both culture and technology for organizational progress.

Bertelli brings over two decades of experience in IT leadership and enterprise architecture, showcasing a robust background in modernization, cloud adoption, and scalable technology solutions for Fortune 500 businesses.

Edited excerpts follow:

Historically, what have been the key challenges in achieving true observability in enterprise databases, and how have these shaped current architectures?

The journey toward effective enterprise observability has encountered significant hurdles. One major challenge is the isolation of signals, where logs, metrics, and traces reside in disparate systems, hindering engineers from identifying root causes of issues. Google’s experience prior to 2010 highlights this problem, as their development of the Dapper tracing system was essential for debugging across distributed systems.

High cardinality poses another significant barrier, especially when a database column includes a large number of unique values, complicating observability. This was particularly evident in early iterations of tools like Prometheus, which struggled to manage the overwhelming amount of label dimensions. Companies such as Honeycomb emerged specifically to address these challenges in high-cardinality observability data.

Additionally, static dashboards can act as impediments when they fail to update correctly. Netflix identified this issue during its Chaos Monkey experiments, which revealed that conventional dashboards did not adequately capture emergent failures within distributed frameworks.

Observability has frequently been reactive rather than proactive—do you observe a cultural or technological gap that has hindered enterprises from progressing beyond mere monitoring?

The divide between monitoring and genuine observability stems from both cultural and technological dimensions. Enterprises often find themselves entrenched in reactive practices because legacy tools are ill-equipped for modern systems, and organizational cultures have been slow to adapt toward a proactive, collective approach to reliability.

Many organizations continue to regard observability as an add-on, implementing dashboards and alerts post-deployment. This reactive mindset trains engineers to respond to outages rather than to design systems that can elucidate their own functions. In many siloed environments, operational teams are responsible for monitoring while developers create features, resulting in a disconnect that impedes the transition to proactive observability, wherein insights are integrated throughout the development lifecycle.

Today’s complex systems—encompassing microservices, cloud-native architectures, and particularly AI workloads—produce a deluge of high-cardinality, high-dimensional telemetry data. Traditional monitoring methods often fail to identify “unknown unknowns,” focusing instead on threshold-based alerts rather than contextual insights. Without real-time correlations of logs, traces, and metrics, organizations remain locked in a reactive cycle.

Mature organizations are evolving by embedding observability into CI/CD processes and embracing platforms that prioritize correlation, causality, and explainability, moving away from reliance on static monitoring. Nevertheless, lasting change requires cultural alignment.

What specific observability blind spots do AI systems introduce that conventional tools overlook?

One significant blind spot is model drift, where changes in data invalidate the underlying assumptions. Microsoft’s Tay chatbot, for example, demonstrates this principle, as infrastructure monitoring showed operational uptime; only semantic observability of outputs could have signaled the model’s drift toward problematic behaviors.

Unseen technical debt or complexities within code can further undermine observability. Machine learning systems often experience silent failures, with retraining processes and feedback loops creating fragile dependencies that standard monitoring tools might miss.

Opacity in predictions presents another challenge, particularly when systems produce outcomes that users cannot readily comprehend. A loan approval model might function but still yield biased results, which conventional monitoring would fail to detect. The compromised recruitment algorithm at Amazon stands as an example of a system functioning well in terms of infrastructure but being semantically faulty due to bias in its training data.

As AI workloads generate exponentially more telemetry data, when does “observing” transition from being an enabler to a computational burden?

Several indicators often signal this inflection point. A collapsed signal-to-noise ratio arises when teams indiscriminately capture data without a strategic framework, rendering observability pipelines ineffective due to an excess of redundant information. Furthermore, the exponential scaling of telemetry data can lead to a significant rise in storage and computational costs, compelling teams to spend more on maintaining observability tools than on managing AI workloads themselves. Additionally, cognitive overload occurs when vast amounts of information overwhelm teams, hampering rather than enhancing response capabilities.

Paradoxically, AI is also being leveraged to enhance observability. How does AI-driven anomaly detection contrast with traditional pattern recognition, particularly in predictive capabilities?

Traditional pattern recognition identifies issues based on preconceived parameters. In contrast, AI-enhanced detection dynamically adapts to evolving systems, facilitating real-time correlation and predicting failures ahead of their occurrence. This shift transforms observability from a reactive methodology into a forward-thinking ability.

Standard monitoring relies heavily on fixed thresholds and established deviations, which may successfully address known issues but falter when systems behave unpredictably. Tools like Nagios or Zabbix exemplify this reactive approach. Conversely, AI and machine learning models engage with dynamic, high-dimensional telemetry—encompassing logs, traces, and unstructured signals—allowing for baseline evolution based on workload variations and user behavior. By correlating data across multiple layers, AI can uncover anomalies that basic rules might miss.

AI’s ability to anticipate failures hinges on recognizing early indicators, such as latency drifts and memory pressures. This transition shifts observability from a paradigm focused on firefighting to one emphasizing prevention.

How can organizations address the paradox where poor observability data misguides the AI systems intended to enhance observability?

Acknowledging observability data as a primary asset rather than a secondary concern is crucial. Ensuring data hygiene is essential since inaccuracies can lead to misguided analyses, erroneous conclusions, and detrimental business choices.

Signal prioritization must be approached judiciously; an unconsidered ranking of alerts, metrics, or logs could mislead AI-driven observability systems. AI models often learn from historically prioritized metrics, which may cause them to emphasize traditional signals, like CPU usage, at the expense of new, critical patterns such as memory leaks. This scenario exemplifies how bias might escalate, leading to neglect of evolving failure modes.

Implementing feedback loops for AI is vital, where models are retrained with human input to identify false positives and root cause findings, enabling the system to discern legitimate issues. Validating data from multiple sources is equally important; reliance on singular data streams can introduce blind spots. Correlating information across different logs, traces, and metrics mitigates the risk of being misled by incomplete or corrupted data.

The resolution of this paradox lies in treating observability data as a mission-critical dataset. Ensuring quality, minimizing noise, and validating data across diverse sources while maintaining human oversight is essential. If one team’s telemetry dominates while another’s remains scant, the AI could systematically misidentify priority incidents. Governance is increasingly significant in observability; similar to data governance in analytics, “observability governance”—defining critical metrics, ensuring consistency, and monitoring data drift—is essential in current operational landscapes.

Source link

AI and the Evolving Landscape of Observability

Help Prevent Exploitation, Report Breaches

Sector alert bulletin

Help Prevent Exploitation, Report Breaches

Sector alert bulletin

Related Posts

Webinar | Creating a Smarter Ecosystem: Unveiling the Latest Innovations from Darktrace and Microsoft

Concerns Rise Over Undetected Coupang Data Breach Lasting Five Months

Capital One Hacker Allegedly Compromised 30 Additional Companies and Engaged in CryptoJacking

2026 Predictions and Trends in Observability