world

Working on complex systems: What I learned working at Google

admin

May 15, 2025 - 11:45

0 0

Working on complex systems: What I learned working at Google

Hello! Today, let’s discuss a fascinating topic: complex systems.

Introduction

Throughout my career, I’ve worked in many complicated environments. For instance, I worked on optimizing driver-passenger matching in ride-hailing at a Uber's competitor. This context, like others, was technically challenging. Yet, nothing comes close in terms of complexity with my current experience at Google and two years there have refined my perception of complexity.

In this post, we will break down the very concept of complexity. Next, we will take a step back to understand what makes certain environments rather complex than complicated and then explore patterns for navigating complex systems effectively.

Complicated vs. Complex

Understanding the distinction between complicated and complex problems is crucial because each requires a fundamentally different approach:

Complicated problems are intricate but predictable. They followed structured, repeatable solutions. For example, filing taxes is complicated, but it's a structured and conventional problem since the process remains mostly the same year after year.
Complex problems are intricate but unique. They require adaptive and often novel solutions. For example, climate change mitigation is a complex problem because it requires new and adaptive solutions and existing methods alone can’t really address its evolving challenges.

Back to software engineering, at the Uber competitor, one of the challenges was to efficiently find the nearest driver for a passenger. This was far from trivial, but it wasn't complex per se. Indeed, many solutions exist, such as applying geo-hashing (example), and implementing one of these solutions was the right approach.

At Google, I work as a Site Reliability Engineer (SRE), focusing on the systems powering Google’s ML infrastructure. Here, I consider the challenges genuinely complex, as new paradigms and scheduling approaches are required, especially at Google’s scale.

Recognizing whether a system is complicated or complex is really important. Indeed, we mentioned that complicated systems are by definition repeatable, while complex systems require unique and customized approaches. Therefore, if we try to apply a common solution to a complex problem, it may not lead to effective results.

Characteristics of Complex Systems

In this section, we will discuss five common characteristics that help identify complex systems. Not all complex systems share every characteristic, but they tend to exhibit at least some of the following.

Emergent Behavior

Emergent behavior arises when a system's overall behavior cannot be predicted solely by analyzing its individual components in isolation.

For example, Gemini producing unexpected results was an emergent behavior. While I can't disclose the root cause, this behavior was nearly impossible to foresee by analyzing all the different components separately.

This is one possible characteristic of complex systems: they behave in ways that can hardly be predicted just by looking at their parts, making them harder to debug and manage.

Delayed Consequences

Another possible characteristic of complex systems is delayed consequences, where actions don't always have immediate effects, and instead, consequences may only become apparent much later.

For example, deploying a new version of a system might introduce a subtle issue that only appears days or even weeks later. This delay complicates debugging since identifying the root cause becomes much harder compared to immediate impacts.

In complex systems, relying solely on immediate feedback can create a false sense of stability, leading to major surprises when an issue finally emerges. Keeping in mind that consequences may take time to surface is essential when working in such environments.

Local vs. Global Optimization

In complex systems, optimizing one part doesn’t necessarily improve the whole system, and in some cases, it can even make things worse.

Unlike in non-complex systems, where improving one part generally leads to positive gains, complex systems are much more difficult to reason about. The components interact in non-obvious ways, and local optimizations can create ripple effects that are difficult to predict, sometimes leading to negative outcomes at the system level.

This highlights a key trait of complex systems: the whole is more than the sum of its parts. As a result, local gains don’t always translate into global improvements and in some cases, they can even degrade the overall system.

Hysteresis

Hysteresis describes how a system's past state continues to influence its behavior, even after the original cause is removed.

A real-world example to illustrate hysteresis is traffic congestion: even after a road accident is cleared, delays persist because vehicles remain clustered. Similarly, in distributed systems, failures can cause cascading slowdowns, even after the root issue is fixed. Indeed, dependent systems may take time to recover for various reasons, such as caches, retries, or queued requests.

In complex systems, simply fixing the root cause is not always enough. Therefore, it's crucial to assess whether a system is prone to hysteresis and, if so, anticipate its effects.

Nonlinearity

In complex systems, small changes can produce disproportionately large or unpredictable effects.

For example, in queueing theory, system load increases latency predictably. However, as a queue approaches saturation, even a small increase in requests can cause response times to spike exponentially.

Complex systems often reach tipping points, where behaviors shift suddenly, making past trends unreliable for prediction. This nonlinearity means that traditional linear assumptions where inputs map predictably to outputs isn't always effective for designing, testing, and reasoning about complex systems.

Summary

To summarize this section, complex systems:

Are difficult to understand just by looking at its parts separately.
Don’t always show their effects right away, consequences can be delayed.
Don’t always improve as a whole when one part is optimized and changes can sometimes make things worse.
Can keep being influenced by past states, even after the original cause is gone.
Can react to small changes with big or unexpected effects.

Note that scale alone doesn’t make a system complex: even small systems can exhibit complex behaviors like emergence or nonlinearity.

Patterns for Navigating Complex Systems

Given these characteristics, how can we operate effectively in complex environments? Below are some strategies that I personally found effective.

Reversibility

When dealing with complex systems, we should favor reversible decisions whenever possible, meaning changes that can be undone if they don't work out.

Amazon's one-way vs. two-way doors framework captures this idea quite well:

One-way doors represent irreversible decisions that require careful deliberation.
Two-way doors represent reversible decisions, allowing us to move fast and iterate with lower risk.

In many contexts, especially in complex systems, favoring two-way doors leads to better outcomes because we can experiment, learn, and refine rather than overengineering upfront.

That being said, not all decisions should be reversible. For example, some choices like security policies or compliance-related changes require upfront commitment. The key is knowing when to optimize for speed and iteration versus when to be deliberate and careful.

Think Beyond Immediate Metrics

Because complex systems don't always respond predictably to local optimizations, defining the right metrics for success is probably just as important as the changes we make. Indeed, focusing too much on isolated, local metrics can create a false sense of success while masking unintended negative consequences elsewhere in the system.

To avoid this, before making a change, we should define both local and global metrics to get a holistic view of system health. This ensures that we measure impact beyond the immediate area of focus and consider the system as a whole.

Well-chosen metrics shouldn't just confirm the success of a local change; instead, they should help us make better decisions and ensure meaningful improvements at the system level, not just isolated areas.

Innovation

As discussed, complex systems often demand unique solutions. Since conventional strategies may not always apply, we must be willing to think out of the box and embrace innovation.

I recall one of my first meetings at Google. Someone presented a problem that seemed absurd in terms of complexity, especially given the scale. My immediate reaction in my head was: "This is impossible". But then, a teammate said: "But we're Google, we should be able to manage it!".

That remark stuck with me. While not every company obviously has Google's resources, I think the mindset is what truly matters. When facing a complex problem, we should assume it's solvable, then break it down, experiment, and iterate until we find a path forward.

One may find this section cliché, but again, complex problems demand unconventional thinking. In many cases, being open to innovative solutions when facing a complex problem isn’t just helpful, it’s necessary.

Controlled Rollout

When deploying changes in complex systems, we should rely on proven best practices to minimize risk. These include:

Feature flags: Enable or disable functionality dynamically without deploying new code, allowing for safe experimentation and quicker rollbacks.
Canary release: A limited rollout to a small, controlled subset of production, well suited for environments with only a few production instances.
Progressive rollouts: Gradually increasing the scope of a rollout, best suited for large-scale production setups with multiple clusters or regions.
Shadow testing: Running a change in parallel with production traffic without impacting real users. This helps validate the correctness of a change before enabling it.

By leveraging these techniques, we reduce the blast radius of failures, improving the confidence in our changes and enabling faster iterations.

Observability

Observability is one of the main pillars of complex systems. My working definition of observability (mainly inspired by Observability Engineering) is the following:

You can understand any state of your system (no matter how novel or bizarre) by slicing and dicing high-cardinality and high-dimensionality telemetry data without needing to ship new code.

Without observability:

Systems become more fragile as unknown issues remain hidden until they cause real impacts.
Debugging unexpected failures becomes significantly harder.
Innovation is slowed down due to a lack of efficient feedback loops.

In complex environments, where unknowns are inevitable, observability is essential. It enable teams to navigate uncertainty, experiment more safely and getting short feedback loops to continuously improve systems.

Without proper observability, changes remain opinions rather than informed decisions.

Simulation

Predicting the behavior of complex systems is rarely simple, and, sometimes, nearly impossible.

I recall a case where we spent considerable time designing a change, carefully backing every assumption with data. Yet, due to unaccounted factors such as lurking variables, the change was ultimately ineffective.

Sometimes, instead of relying solely on predictions, a more effective approach can be to simulate a change before rolling it out. There are multiple ways to leverage simulation testing, including:

Replaying past events: If we design a system to record all its input, we can replay past events against our new version and analyze its impact. This allows us to validate changes in a controlled manner, reducing uncertainty and improving decision-making in complex systems.
Deterministic simulation testing: Instead of relying on real-world data, we can create controlled, repeatable simulations that model system behavior under specific conditions. This allows us to test how a system reacts under various conditions in a fully deterministic way.

Note that the ideas presented in this section also rely heavily on observability.

Machine Learning

In complex environments, rules-based approaches often reach their limit because of the complexity of anticipating all scenarios. In these contexts, ML can become particularly effective.

Indeed, unlike static heuristics, ML models can continuously adapt based on feedback loops and learn from real-world data rather than relying on rigid, predefined logic.

This allows systems to:

Detect emerging patterns that weren't explicitly programmed.
Adapt dynamically to changes without requiring constant human intervention.
Make probabilistic decisions rather than relying on strict if-else conditions.

Strong Team Collaboration

Last but not least, I believe that in complex environments, more than anywhere else, strong team collaboration is an absolute necessity. For instance, clearly conveying why a change is complex, discussing available options, and debating trade-offs with teammates are all critical skills.

In complex systems, there's often no single right answer. Therefore, a team that collaborates effectively and navigates ambiguity together can make a huge difference, ultimately leading to stronger decision-making.

Final Thoughts

Again, complicated problems can be solved with repeatable solutions, whereas complex systems require adaptability and a different way of thinking. This is why recognizing whether a system is complicated or complex is so important: it shapes how we should approach problem-solving.

However, in many environments, systems are neither purely complicated nor purely complex. Some parts can follow structured, predictable solutions, while others require adaptability and novel approaches. The key is learning to recognize when adaptability is needed and when structured solutions are enough.

💬 I hope this post has helped you recognize the characteristics of complex environments and provided you with practical patterns to navigate them effectively. Did any of these patterns resonate with you? What other strategies have you used in complex environments? Let me know in the comments.