Optimizing Database Infrastructure for Complex Systems: Exploring the Transition from Oracle to Cassandra and the Multidatabase Approach at Netflix

        The Netflix platform may have first begun as a small video streaming application, but it has evolved into a complex application that integrates advanced user features, such as user preference video recommendations, and it continues to evolve today. At the time of Netflix’s creation in 2007, the Oracle relational database was known for its data integrity and scalability; however, as the streaming platform grew, the Oracle database struggled to meet massive data volumes and traffic load demands.

Netflix's transition from the Oracle relational database to the Cassandra distributed NoSQL database represents a significant shift in their database infrastructure that required utilizing a combination of their old Oracle legacy system and their new beta Cassandra system for several years during the transition to assess differences in performance and scalability(Carpenter, 2022). Cassandra Vs RDBMS explores a key difference between Cassandra’s flexible NoSQL database and Oracle’s nonflexible Relational Database: “In Cassandra, relationships are represented using collections. In RDBMS, there are concept of foreign keys, joins etc. In Cassandra, column is a unit of storage. In RDBMS, column represents the attributes of a relation.” (RDBMS vs Cassandra, n.d.). It denotes that relational databases enforce referential integrity and data consistency in structured tables, whereas Cassandra utilizes column storage through collections like sets, lists, and maps to store related data together in a single row. Collections eliminate the need for joins allowing Cassandra to more easily retrieve data in a single query and provide increased flexibility that meets high scalability and performance demands (Hammink, n.d.). These features benefit Netflix specifically because of the high traffic and massive data volume demands that video streaming entails. By allowing more flexibility in how data will be accessed through single queries, Netflix gains performance by losing a little bit of control over data integrity. In a streaming platform, this subtle loss of control over data integrity is worth the performance boost, whereas in an accounting situation where data integrity reigns supreme, it may not be worth the loss. This suggests that companies must understand complex differentiations between different database solutions to make informed decisions regarding their database infrastructure. The transition from Oracle to Cassandra databases also justifies the use of microservice architecture that helps organize application purposes by assessing which database will be best suited for a specific task.

While Cassandra has proven to be a robust and scalable database solution for Netflix, graph databases, such as Neo4j, are designed to handle highly connected data and complex relationships (Hunger et al., 2016). A Review of Graph Databases reveals that when researching database infrastructures, one must recognize that companies often utilize multiple databases for different purposes: “Netflix adopted JanusGraph + cassandra + elasticsearch as their graph database infrastructure” (Johhan, 2020). While Netflix’s complex infrastructure is proprietary, It suggests that while Netflix may use Cassandra as its primary database for storing and managing streaming data, their database infrastructure is far more complex than just one system. For example, JanusGraph might work on top of Cassandra to provide advanced graph-based querying and analysis capabilities that allows Netflix to model and analyze complex relationships such as user preferences, content recommendations, and social connections (Janusgraph, n.d) and Netflix might use AWS elasticsearch to enhance its search capabilities to allow it to generate personalized recommendations for millions of users (Turnbull, 2016). This suggests that simply building a few databases can not replace the necessity of in-depth research in understanding the complex multiple database infrastructures that create the monstrous systems that provide extraordinary user capabilities in today’s mainstream Fortune 100 companies.


References:

Carpenter, J. (2022, December 13). Why cios need to understand Apache Cassandra. CIO. https://www.cio.com/article/415227/why-cios-need-to-understand-apache-cassandra.html#:~:text=The%20company%20launched%20its%20streaming,data%20was%20housed%20in%20Cassandra. 

Hammink, J. (n.d.). An introduction to apache Cassandra®. Aiven. https://aiven.io/blog/an-introduction-to-apache-cassandra 

Hunger, M., Boyd, R., & Lyon, W. (2016, February 15). RDBMS & Graphs: Why relational databases aren’t always enough. neo4j. https://neo4j.com/blog/rdbms-graphs-why-relational-databases-arent-enough/ 

Janusgraph. Hackolade. (n.d.). https://hackolade.com/help/JanusGraph.html 

Johhan. (2020, March 31). A review of Graph databases. Open Source Distributed Graph Database. https://www.nebula-graph.io/posts/review-on-graph-databases 

RDBMS vs Cassandra - javatpoint. www.javatpoint.com. (n.d.). https://www.javatpoint.com/rdbms-vs-cassandra#:~:text=Cassandra%20is%20used%20to%20deal,RDBMS%20has%20fixed%20schema. 

Turnbull, D. (2016, September 9). High-quality recommendation systems with Elasticsearch. OpenSource Connections. https://opensourceconnections.com/blog/2016/09/09/better-recsys-elasticsearch/ 



Comments