We have become increasingly aware that we are living in an interconnected and interdependent world. This is generally related back to the field of study of “Complex Systems” and includes terms such as: complex adaptive systems, emergence, chaos, non-equilibrium systems, network effects, and so on.
However, while there is a degree of comfort to be taken from the acknowledgement that many of the systems we are part of or create are so called “complex systems,” this does not in and of itself help us to manage the complexity. So, the question is, are there principles and practices that can help one to mitigate and manage complexity?
More specifically, it can be clearly seen that the development of software systems, within the context of a rapidly changing technological landscape and shifting market forces, constitutes a complex system. However, are there approaches that work with the complexity, rather than against it?
In soft terms there are some insights that do apply to working with complex systems, or in complex environments. To quote some points from the VUCA overview:
“In environments where uncertainty is pervasive, our traditional risk-based analysis of the future breaks down. The only way to respond to this is to perform multiple simulations and experiments that will allow us to explore how things will really play out on the ground and to maintain a diverse and complementary system that is capable of responding to a number of different possible environmental conditions.”
“We may not be able to intervene or directly control the outcome to events, but we can manage the initial conditions, the tools, protocols and connections.”
“Instead of trying to describe and understand the event by describing its properties, systems thinking reasons backward. By first having an overview to the environment, we can understand a system through its connections to other systems.”
But how can we see these notions applying more directly to building software?
I capture some of these ideas in what could be termed “second order management.” The idea being that in order to elicit the emergence of a particular outcome that one does not attempt to directly manage the implementation work (or as tends to happen, micro-manage the implementers). Rather one puts in place a number of mechanisms that constrain the context in which the software is being developed. That is introduce tools, protocols and connections that link the outputs to the environment in which the software exists while supporting multiple concurrent experiments.
More concretely we can consider the following mechanisms and see how they might help to influence the outcome:
- control surfaces: formalise the notion that at any one point in time that our software tends to be the union of a number disjoint code paths that cover the different experiments. As such, we explicitly track the points in our code base where we may branch to select one behaviour over another in order to experiment with a different implementation. By making this explicit it frees up developers to feel comfortable to evolve the system as a sequence of many and potentially concurrent experiments, while still ensuring that the team has an explicit handle on the number of experiments that they are controlling at any one point in time, and getting visibility as to which code paths should be deprecated and removed.
- unit tests: formalise the notion that our code does not always behave as expected when we simply think and read through the implementation. Rather, we need to explicitly test subsets of our code base. By making this explicit it gives developer a clear place to encode their understanding of steady-state cases, edge cases and other use cases, and then demonstrate in a repeatable manner that their implementation does in fact do what it needs to do.
- probers: formalise the notion that things in production do not always behave the same way as they do in trivialised and synthetic development environments. Instead we explicitly build up checks and validations that continually operate against the production system and test the sanity, correctness and performance of the system. Again, this gives the developer a place to encode demonstrations of the behaviour of the live system, and additionally it becomes a natural place from which to extract performance characteristics from the live system for later analysis.
- deployment descriptors: formalise the notion that things in production are constructed out of a number of separate pieces. These pieces are often encoded as binary artefacts that then need to be pushed into production and allocated resources and knowledge of some subset of the other components of the system. This gives the developer a concrete place in which to encode the granularity of the pieces that can be meaningfully changed in isolation, while also making it possible to document the dependencies that these subsystems have on each other or on infrastructural elements. This then also provides a summarised view of the structure of the complete system.
- telemetry: formalise the notion that the performance characteristics of systems can often not be sufficiently calculated in close from or forecast, but that rather it is necessary to take an empirical approach to the behaviour of the system. The developer can then identify quantifiable metrics that can captured from the production environment for later analysis. This analysis may be automated to detect deviations from predefined bounds that indicate the need for the system to alert an operator to intervene manually. Or this analysis may be done out of band by the development team in order to verify their own predications, or formulate new hypothesis for future experiments.
- process validation: formalise the notion that complex software systems are built out of many subsystems where the behaviour of the whole depends on the correct causal data flow between the parts. To this end, these data flows can be modelled outside of the core production code paths, and then the production system can capture complete traces of processes. These process traces can then be verified after the fact against the process models. Again, this gives the development team an explicit place to capture orchestration complexity that spans many subsystems, while having this “place” be distinct and separate from the implementation of the subsystems themselves.
- operational closure: formalise the notion that all system elements have a complete life-cycle from instantiation, to operation and onto reclamation. Any element that can be created, must also have reciprocal mechanisms by which to be shutdown so that the resources can be reclaimed. The developer should feel comfortable in the knowledge that they need to code components in a manner that closes the loop. This can be verified when unit tests setup and tear-down fixtures or when probers allocate and deallocate isolated contexts.
- work sequencing: formalise the notion that any given implementation effort is ultimately the result of a number of asynchronous yet interdependent undertakings by different members of the team, where tasks may come to light at arbitrary points in the development effort (i.e. the develop effort itself constitutes a complex system). Coordination of this, as with other high concurrency systems with dependencies, is better managed as an event driven system than attempting to preordain a fixed plan. This enables the development team enumerate well defined goals, while still ensuring that the implementation path is flexible yet sufficiently communicated for the purposes of coordination.
- capability models: formalise the notion that not all systems should be operated in an open and trusting manner. But that rather certain subsystems should be restricted from accessing other subsystems. For user facing systems some subset of the capabilities would then ultimately be coupled to user credentials. This gives the developer a manner by which to control and restrict access to subsystems even if the full authentication and authorisation model (from an external perspective) has not been finalised.
- governors: formalise the notion that systems may need to be continually monitored and managed in order to remain healthy. The governor needs to respond to change requests that alter requirements of the system (i.e. updates to deployment descriptors and the components and SLAs therein). The governor also needs to automate the process of driving corrective measures (e.g. triggering restarts or escalating to outer systems and operators). In order to maintain a view of the behaviour of the system the governor may leverage other second-order management subsystems (e.g. pull from telemetry, monitor probers, monitor deployment descriptors, manipulate control surfaces, etc.) That is, the governor provides an explicit place for pulling together the various aspects of system oversight.
As can be seen, each of the above mechanisms does not directly define the design or implementation of the concrete system that is under development. Rather it constrains the developed system in ways that make it easier to maintain an ongoing handle of the complexity, and exposes various perspectives by which to interpret and understand the actual structure and implementation of the system.
An interesting side effect of many of these mechanisms is that they ultimately tend to facilitate greater locality in decision making and isolation of implementation. That is, they tend to work with the fact that in complex systems we expect or desire to have the complexity spread across the full scale of the system rather than expecting to be able to move it completely into a top level controller (as with very homogeneous coherent systems), or delegating it fully to the lowest elements (as with very heterogeneous random systems). That is, in some sense these mechanisms should allow for complexity to be exhibited and managed at multiple scales, as expected with correlated networked systems.
Ultimately the second order approach helps one to decouple the intent and strategic decisions from the implementation. This means that the implementation by a team of developers is more free to explore the search space for a solution in a manner that works well for the team, while still having some surety that the result will meet short term and long term requirements.