Have you ever wondered how companies like Netflix or Amazon “know” what movies or products to recommend to you?
Or how researchers identify variants in human genomes? Companies and individuals typically derive these insights from data, using tools from data science and machine learning. As data continues to grow in scale and complexity, Apache Spark has emerged as a leading computational platform to tackle these data-driven applications.
Apache Spark was developed in UC Berkeley’s AMPLab in 2009, and is now the most active open-source project in big data. Spark is a cluster computing framework coupled with a suite of tools for processing streaming, graph, and structured data, as well as an integrated machine learning library, MLlib. Due to its expressive and high-level APIs, Spark enables developers to quickly write concise code. Furthermore, Spark is fault-tolerant — by tracking the lineage of transformations performed during the course of a program, Spark can seamlessly recover from machine failure. Finally, Spark is fast! In 2014, Spark was the fastest system to sort 100TB of data (1 trillion records) and won the Daytona Graysort competition.
Numerous organizations have incorporated Spark into their development, including companies like Netflix, Uber, MyFitnessPal, and Amazon to name a few. Spark is also used in research settings for applications such as astronomy image processing, exploratory analyses in computational neuroscience, and human genome sequence analysis. Let’s next explore a few of these use cases.
- Netflix not only provides movies to users, but also unique recommendations for each user. With more users and more data, Netflix is able to make better personalized recommendations. By integrating Spark Streaming, Netflix was able to run recommendation experiments on 8x the number of users at a speed 5-9x faster using the same or fewer number of nodes. Spark Streaming allowed Netflix to gain immediate insights about user engagement, continuously update their machine learning models, and provide more real-time movie recommendations.
- MyFitnessPal (an Under Armour company) is an online health and fitness community that allows people to track their daily diet and exercise to achieve their fitness goals. Spark enabled data scientists at MyFitnessPal to build a reliable food database. They used Spark to build their data pipeline, parallelize data processing, remove duplicate entries submitted by users, and create a list of ‘Verified Foods.’ Spark provided a ten-fold speed improvement over their previous data pipeline implementation and increased team productivity.
- Deborah Siegel from the Northwest Genome Center and the University of Washington, along with Denny Lee from Databricks, have accelerated the process of genome variant analysis (i.e., finding differences between genome sequences) by sequencing genomes in parallel using Spark. Identifying variations in human genome sequences is a crucial step towards personalized medicine, both in helping to identify people predisposed to common diseases and in providing treatments tailored to an individual’s genetic profile. Spark is a vital component in speeding up this process.
You don’t have to be in the industry or a researcher to work with Apache Spark – it’s open source and anyone can use it for free! If you’re interested in learning more about Apache Spark, Data Science, Machine Learning, Distributed Computing, or any combination thereof, check out the Data Science and Engineering with Apache Spark XSeries. This 5-course series is sponsored by Databricks – the company founded by the team who created Apache Spark – and will teach students how to perform data science and data engineering tasks at scale using Spark. Students will be presented with an integrated view of data processing, and will gain hands-on experience building and debugging Spark applications using the Databricks platform.