When designing a stateful cloud native applications, and especially a stateful cloud native network application, you must determine which data model matches the functional requirements of the system. There is often confusion about vocabulary within the telco and networking space when it comes to data (for instance, there are stateful protocols like tcp/ip and stateful data like that which is saved into a database). The confusion is exacerbated when these concepts are combined with cloud native non-functional requirements, such as resilience and availability. The telecommunications space is exceptionally difficult to navigate because requirements are usually outlined in RFPs instead of in an iterative agile development cycle.  The following nine questions are often overlooked when considering cloud native statefulness, so we provide them here.

Correctness

When talking about stateful applications, correctness comes into play. Traditionally, relational database management systems (RDBMS) handle the issue of correctness of data, but with the low latency (1 ms) requirements of telecommunication network components, some RDBMSs may not be up to the task.

1) How safe must your system be?

The difference between a cloud native system and a traditional application can be described as the difference, on a spectrum, between safety and liveness.  Safety is related to correctness in that it describes stopping something undesirable from happening.  In other words it puts a constraint on the system.  Liveness is the ability of the system to return to a desired state. Cloud native systems opt for the liveness side of the spectrum, otherwise known as anti-fragility, over safety.

2) Do you require strong consistency?

There are some problems, oftentimes in the financial space, that require strong consistency. The double spending problem, or making sure that a resource isn’t used twice, is a specific example of this.

3) Can you tolerate write skew?

When strong consistency is not required, relaxing the constraints on correctness can often increase other properties, such as performance.  Write skew is when data is writing data based on a decision from an outdated premise. This can sometimes be tolerated because the data can be fixed after the write.

4) How do you address write skew?

The decision to allow write skew should be based on if the solution for the write skew is good enough for your domain.  For instance, for a phone call where the packets sometimes are received out of order, a solution that drops the packets (i.e. first write wins) is often tolerable. For a shopping cart that has many items added at the same time, but the writes were received out of order or delayed, a solution to merge the items (e.g. a solution that uses Conflict-free Replicated Data Types) into the cart as they are received is also often tolerable.

Performance

Performance is mandatory for network applications in the telco space. A common requirement base requirement is for network applications to be ‘line speed’ which means any software should not interfere if the speed of the network cards, which can be at 400G at this point.  This puts heavy requirements on the bottleneck of the system, which is often the data.

5) Are there latency sensitive parts of the system where the updates can be applied in any order, such as incrementation?

Latency sensitive data that can be applied in any order can often be addressed with merge solutions such as Conflict-free Replicated Data Types (CRDTs).  Incrementation and decrementation of a resource are examples of this.

6) Is your data write heavy?

Write-heavy data has latency benefits from using write-friendly data models, such as append-only datastores.  Read-heavy data has latency benefits from using column databases if your data model supports it.

7) How fast should your responses be?

For network applications (and especially network components that are heavily dependent upon), 1 millisecond latency or latency measured in microseconds can often be the requirement.  This is rarely a requirement in traditional DBMS systems.  So when designing these systems you really need to ask if the system requires 1ms latency response times. What are the latency requirements for the 99th percentile?  Do you have any hard real time guarantees?

8) Will the users be distributed geographically?

Geographic distribution is a challenge and an opportunity. The challenge comes from finding and implementation that is smart enough to geographically replicate and partition. The opportunity comes from the latency reduction that comes from the proximity of data to the application.

9) How secure must your system be?

A compromised node can wreak havoc on your consensus algorithm if it doesn’t have byzantine fault tolerance. If a compromised node requests to become the leader and proceeds to act destructively, it will bring your system down.

Conclusion

If you are designing or in the market to purchase a stateful cloud native system, and especially a stateful cloud native network application, you would do well to ask some of these questions.  The design of the system will vary greatly based on the answers.

 

 

Community post by W. Watson from CNCF’s cloud native network function working group
Source CNCF

 

Previous Go Runtime: 4 Years Later
Next Karmada And Open Cluster Management: Two New Approaches To The Multicluster Fleet Management Challenge