System Characteristics
About
System characteristics define the fundamental properties that determine how a system performs, scales, and handles failures. These characteristics help architects design robust, scalable, and fault-tolerant systems.
1. Scalability
Scalability is the ability of a system to handle increasing amounts of work by adding resources. A scalable system ensures that performance does not degrade as demand grows.
Types of Scaling
Vertical Scaling (Scaling Up)
Increasing the capacity of a single server (e.g., adding more CPU, RAM, or disk).
Has a physical limit—hardware can only be upgraded so much.
Example: Upgrading a database server from 32GB RAM to 128GB RAM.
Horizontal Scaling (Scaling Out)
Adding more servers to distribute the load.
Often preferred in cloud-based architectures for better redundancy.
Example: Adding multiple web servers behind a load balancer.
Auto-Scaling
Dynamically adding or removing resources based on demand.
Used in cloud environments (AWS Auto Scaling, Kubernetes Horizontal Pod Autoscaler).
Challenges in Scalability
Data consistency across multiple nodes.
Load balancing efficiently.
Database sharding complexities.
2. Availability
Availability refers to the percentage of time a system remains operational and accessible. It is usually expressed as a percentage (e.g., 99.99%), often called “nines” of availability.
Availability Levels
99% (Two nines)
~3.65 days
~7.2 hours
99.9% (Three nines)
~8.76 hours
~43.8 minutes
99.99% (Four nines)
~52.6 minutes
~4.38 minutes
99.999% (Five nines)
~5.26 minutes
~26.3 seconds
Methods to Improve Availability
Redundancy: Deploying backup servers to avoid single points of failure.
Failover Mechanisms: Switching to standby resources if the primary system fails.
Load Balancing: Distributing traffic across multiple servers.
Replication: Keeping multiple copies of data to avoid data loss.
Trade-offs
High availability often comes at the cost of complexity and additional resources.
3. Reliability
Reliability is the ability of a system to perform correctly and consistently over time without failures. A reliable system minimizes unexpected downtimes and data inconsistencies.
Factors Affecting Reliability
Hardware Failures: Server crashes, disk failures.
Software Bugs: Memory leaks, race conditions, deadlocks.
Network Failures: Packet loss, connection timeouts.
Techniques to Improve Reliability
Error Handling and Recovery: Implementing retry mechanisms and circuit breakers.
Data Replication: Ensuring backups exist in case of failures.
Testing Strategies: Unit tests, integration tests, and chaos engineering.
Difference Between Availability and Reliability
Aspect
Availability
Reliability
Focus
Ensuring system is operational
Ensuring system works correctly over time
Metric
Uptime percentage (e.g., 99.99%)
Mean Time Between Failures (MTBF)
Example
A website is up 99.99% of the time
A website never crashes due to software bugs
4. Fault Tolerance
Fault tolerance is the system's ability to continue operating even when components fail. A fault-tolerant system does not crash completely due to failures.
Types of Faults
Transient Faults: Temporary network failures, server timeouts.
Intermittent Faults: Occasional hardware failures.
Permanent Faults: Hardware crashes, disk corruption.
Fault Tolerance Mechanisms
Redundant Components: Standby servers, multiple database replicas.
Graceful Degradation: Partial functionality when some services fail.
Self-Healing Systems: Detecting and automatically recovering from failures.
Example
A fault-tolerant database might use leader-follower replication. If the leader node fails, a follower takes over automatically.
5. Consistency
Consistency ensures that all clients see the same data at any given time.
Types of Consistency
Strong Consistency: Every read receives the latest write.
Eventual Consistency: Data is updated eventually but might be inconsistent for a short time (used in NoSQL databases).
Causal Consistency: Guarantees that causally related updates appear in the correct order.
Trade-offs: CAP Theorem
According to CAP Theorem, a distributed system can only provide two out of three properties:
Consistency (C) – All nodes return the same data.
Availability (A) – The system remains responsive.
Partition Tolerance (P) – The system can function even when network partitions occur.
Example:
SQL databases prioritize Consistency and Partition Tolerance (CP).
NoSQL databases prioritize Availability and Partition Tolerance (AP).
6. Durability
Durability ensures that once a transaction is committed, it remains permanently stored even in case of failures.
Durability Mechanisms
Write-Ahead Logging (WAL): Logging every write operation before applying it.
Data Replication: Copying data to multiple locations.
Snapshots and Backups: Periodic data dumps to prevent data loss.
Example
A bank transaction that deducts money from one account and adds it to another must be durable. If a power outage occurs after the deduction, the system must ensure that the addition is completed when it restarts.
Comparison
Characteristic
Definition
Key Considerations
Scalability
Ability to handle increased load
Vertical vs. Horizontal Scaling
Availability
Uptime percentage
Redundancy, Failover, Load Balancing
Reliability
Correct and consistent performance over time
Error Handling, Testing, Replication
Fault Tolerance
System's ability to function despite failures
Redundant components, Self-healing systems
Consistency
Ensures all users see the same data
CAP Theorem, Strong vs. Eventual Consistency
Durability
Data remains intact after crashes
Write-Ahead Logging, Data Replication
Last updated
Was this helpful?