By Rick F. van der Lans
AerospikeDB, Akiban, Aster, Cassandra, CouchDB, HBase, MongoDB, Ne04j, NuoDB, Paraccel, VoltDB, are all database servers that did not exist ten years ago, and most did not even exist five years ago. Five years ago, the dominant database servers were IBM DB2, Microsoft, SQL Server, MySQL, and Oracle. But the database server market has drastically changed. This is primarily due to a phenomenon that has turned into one of the hottest topics in the IT industry: big data.
No one knows exactly what the definition of big data is. In fact, it’s probably the worst defined term we have ever had in the IT industry. Nevertheless, we all have a feeling what it means. It refers to systems that store or process vast amounts of data. Amounts of data that we would not have thought possible about a few years ago. In a way, the word big implies that it’s an amount of data that is unusual for an organization. And these big data applications are in most cases analytics-driven. They are developed to improve the analytical strength of an organization. For example, big data allows organizations to optimize their performance and business processes, to increase customer loyalty and customer care, to reduce development costs, and to improve the monetization of their data.
To return to the market of database servers, some of the requirements of these big data systems are too much for classic database servers, or, if, we develop these systems with classic database servers, the solution becomes too expensive, maybe even unaffordable.
This was the sign for many new vendors to develop a database server that is suitable for big data systems and for a reasonable price. These new database servers can be divided in three groups: NoSQL, NewSQL, and Analytical SQL database servers.
NoSQL database servers: As the name suggests, NoSQL database servers do not support SQL, nor pure relational concepts. This group can be further divided in four subcategories: key-value stores (MemcachedDB, HamsterDB, Oracle Berkeley DB, LevelDB, Redis, AreospikeDB, and Riak), document-stores (Apache CouchDB, Apache Jackrabbit, MongoDB, OrientDB, Terrastore, and RavenDB), column-family stores (Apache Cassandra, Hbase, Hypertable (Google’s Bigtable), and Amazon SimpleDB), and graph stores (AllegroGraph, FlockDB, HyperGraphDB, InfiniteGraph, and Neo4J).
Most support a data model and API that makes it easier for applications to store and retrieve data. Most of them support an aggregate data model; data can be stored in nested and hierarchical structures, records in tables can have different structures, and when a new record is inserted, a value can be added for which no column has been defined yet. Another commonality is that they don’t support all the features usually found in classic SQL systems, such as data security, data integrity, and high-level query languages. But because of their non-relational data model, their lean and mean architecture, their different transaction mechanism, they offer high transaction performance, high scalability, high availability, and low storage costs. And this makes them very suitable for particular big data systems.
A special product that must be mentioned here is Hadoop. This is a file system used by several of the mentioned NoSQL database servers. Hadoop, with its highly scalable distributed file system and MapReduce-based architecture, empowers these database servers. It was not mentioned in the lists above because, as indicated, it’s a file system and not a database server.
NewSQL database servers: The second group of database servers is called NewSQL. Note that these products don’t support a different kind of SQL. They do support traditional SQL. They are “new” SQL database servers because internally they are different. Examples of NewSQL products are Akiban, Clustrix, GenieDB, JustOneDB, MemSQL, NuoDB, ScaleBase, TransLattice, VMware SQLFire, and VoltDB.
NewSQL products support most of the functionality usually found in classic SQL systems. Like the NoSQL database servers, they aim for high transaction performance, high scalability, and high availability. They implement this by using different internal, shared-nothing architectures, and by exploiting efficiently the use of internal memory. But their strength, of course, is that they support SQL. The fact that they support SQL has a number of practical advantages:
• minimal training costs for developers because they already understand SQL
• many existing development and reporting tools support SQL and are thus able to extract data from these new products
• integration of data in NewSQL and classic SQL systems is easy
Although they all support rich SQL-dialects and they can all execute complex SQL queries, their strength is running transactions fast and running simple queries fast. Their query features make them very suitable for what some call real-time analytics.
Analytical SQL database servers: While the NewSQL products focus on transactions with simple real-time analytics, the analytical SQL database servers aim at analyzing big data. Some of them are designed specifically for data scientists to act as sand boxes where the most complex queries can be run on massive amounts of data and still offer good performance. Examples of products that belong to this category, are: Actian VectorWise, Exasol, EMC/Greenplum, HP/Vertica, IBM/Netezza, InfoBright, Information Builders HyperStage, Kognitio, Oracle ExaLytics, ParAccel, SAP HANA, Teradata Appliances the Aster Database. These are all SQL database servers designed and optimized for analytical processing on large sets of data.
To summarize, big data has an enormous potential. The prominent McKinsey Global Institute indicated in their June 2011 report entitled “Big data: The next frontier for innovation, competition, and productivity” that big data has the potential to increase the value of the US Health Care industry by $300 Billion, to increase the industry value of Europe’s public sector administration by €250 Billion, decrease manufacturing (development and assembly) costs by 50%, increase service provider revenue by $100 Billion due to global personal location data, and to increase US Retails net margin by 60%.
The “big database servers” mentioned in this article make it possible to develop and run big data systems. Whether “big” refers to a high number of transactions or to a massive amount of data to be analyzed, nowadays database servers exist that are designed and optimized for these application areas. They are big data ready. In other words, the need for big data systems has opened up a whole new market of database servers. Without any doubt, some of these will not survive the battle that is about to start, but some will endure and will become as well-known in the future as some of the classic SQL systems are today. My recommendation is to check out some of these systems and their potential.
Rick F. van der Lans è un analista indipendente, consulente, autore e docente specializzato nel Data Warehousing, Business Intelligence e tecnologia database. E’ Managing Director di R20/Consultancy, è un docente di fama internazionale, scrive per diversi siti Web, tra cui il ben noto B- eye-Network .com, ed è l’autore di numerosi articoli tecnici. I suoi libri di successo, tra cui “Introduction to SQL” e “The SQL Guide to Oracle”, sono stati tradotti in molte lingue e hanno venduto oltre 100.000 copie. Il suo ultimo libro di recente pubblicazione si chiama “Data Virtualization for Business Intelligence Systems”.
Rick F. van der Lans presenterà a Roma per Technology Transfer i seminari “Data Virtualization” il 13 giugno 2013 e “La nuova generazione della tecnologia database” il 14 giugno 2013.