Relational databases have been the dominant data management model for decades, but some problems are better fits for non-relational (NoSQL) databases. Today, we have a variety of non-relational database management systems that offer a variety of data management features and performance benefits. Choosing a good database for your needs is a matter of aligning application requirements and database features while considering the overhead costs of managing a potentially new data management system.
The benefits of relational database management systems (RDBMS) include:
- support transactions;
- a well-defined query language;
- mature development and administration tools;
- a substantial set of best practices for data modeling and physical database implementation.
However, the rich set of features found in a modern RDBMS comes with drawbacks. For example, large numbers of concurrent users performing operations that need to read and write to disk can slow response times. Using a cache, such as memcached, is one way to improve overall application performance by shifting some of the read load to the cache, therefore reducing the read load on the database server.
Another common performance challenge with RDBMS is joining normalized tables. Data normalization is a set of principles for data modeling that reduces the potential for data anomalies, such as a customer having two concurrent primary phone numbers when only one is allowed. A highly normalized data model is a thing of beauty to a data modeler but potentially the start of a nightmare for a developer. These data models often require a large number of database joins which are computationally expensive and can generate substantial levels of I/O. Proper indexing and judicious use of optimizer hints can reduce join overhead, but there are limits to the benefits of these techniques. One approach is to denormalize the database. This reduces the need for expensive joins at the cost of duplicating data and possibly introducing data anomalies.
Key-value data stores allow developers to store arbitrary types of data using attributes that are defined as needed. Unlike relational databases, key-value data stores do not require a predefined schema with all attributes defined. This model is suited for applications storing simply structured data with the potential for a large number of frequently changing attributes. Examples of key value data stores include DynamoDB, Berkley DB and Redis. Even on low-end hardware, key-value data stores can achieve read and write rates of 100,000 I/Os per second.
Graph databases such as Neo4j use nodes and links between nodes as the basic building blocks. Networks are easily modeled with graph databases, making them suitable for social network analysis, workflow modeling, and other systems of linked or interacting entities. Graph databases allow one to easily create queries about paths and relationships while providing read performance comparable to other NoSQL databases. Write performance on graph databases may not achieve the same levels as other NoSQL databases so they might not be the best option for write intensive applications. InfiniteGraph and Sones are two other NoSQL graph database options.
Column store databases such as HBase andCassandra work well for big data applications that can benefit from MapReduce analysis. HBase’s distributed architecture is designed for applications storing up to billions of rows and millions of columns and may be a good option to replace a relational database that cannot support such large data sets. Cassandra uses a column family structure that includes support for column indexes and materialized views. Cassandra is a distributed database that can achieve linear increase in performance as nodes are added to the cluster. In spite of increased performance that comes with additional hardware, column data stores are not known for supporting complex queries and rapid query response times. Riak is another column store NoSQL database.