A truly fault-tolerant system, no aspect of Riak involves single points of failure. Since no machines are special or central, the user of Riak can decide exactly how fault-tolerant they would like their application to be.
Designed for availability without headaches
Anyone who has worked in Web operations knows that scalability and availability are not the same thing; there are many ways to build a system that can grow to match capacity without necessarily making it any more tolerant of failures.
The designers of Riak don't just care about distributed systems, we also have a background and interest in good operations. As a result, Riak was built from the beginning to allow operators to provide high availability and fault tolerance much more easily than with a traditional RDBMS or any system with "master" servers and other single points of failure.
The basic architecture of Riak is shared-nothing from a data point of view. All coordination between nodes is done via asynchronous messages, with no underlying shared state. This means that if a given node in a Riak cluster hangs, crashes, or catches fire, the other nodes can continue without incident. As no data is ever persistently stored on only one node, clients will generally not even notice that this has occurred.
In situations like the above (where a node becomes unreachable for any reason) the system will compensate using hinted-handoff. This means that neighbors of the errant node will receive the messages it would otherwise have handled, ensuring that features such as replication are not degraded in the window before that node is repaired or replaced.
Due to use of techniques such as consistent hashing, Riak has no master nodes, "name nodes", "key nodes", or any other similar central elements of the archtecture. Every node in the system is structurally equivalent. This has great value operationally, as there's no system or set of systems that you have to be especially watchful or protective of; as long as a quorum of the overall network is available, you don't have to care which hosts go down.
One of the effects of exposing the consistency, availability, and partition-tolerance (CAP) controls to the application developer is that the developer can work with operations like never before in making decisions about tuning the relative availability of the whole system under different circumstances -- including circumstances under which any traditional system would provide no choice but to become unavailable.
Riak is also easy to administer. If you wish to add nodes to add capacity or remove nodes to either reclaim resources or repair broken systems, there is very little operational impact. You can leave the whole cluster running continuously, and the content will rebalance itself to best suit the new size of the cluster.