The IT team at Arity, a subsidiary of Allstate focused on collecting and selling driving data, is nearing the conclusion of a significant project aimed at loading over a trillion miles of driving data into an Amazon S3 database. The project's progress has notably accelerated following the company's decision to transition from using Apache Spark to Starburst, a move that has allowed them to overcome challenges associated with data processing speed and cost efficiency.
Arity’s extensive database consists of more than 2 trillion miles of driving data compiled from over 50 million drivers, primarily servicing auto insurers, retailers, and mobile app developers. The data is used for various applications, including customer identification by insurers and real-time driver monitoring through applications like Life360. Additionally, Arity's geolocation data provides valuable insights for state departments of transportation (DOT), helping them analyse traffic patterns to optimise road safety and infrastructure without conducting on-site assessments, which can be hazardous and costly.
As interest from DOT agencies surged, Arity recognised the need to automate its data delivery process. This led the company to reconsider its technology stack. “Traditionally, we use Spark and AWS EMR clusters,” said Reza Banikazemi, Arity’s director of system architecture. He elaborated on the need for improvement, noting, “For this particular project, with about six years’ worth of driving data, over a petabyte, the cost and runtime were big challenges.”
In the initial stages, Arity’s engineers employed Spark for processing the dataset, specifically optimising routines written in Scala, its native programming language. During a proof of concept earlier this year, they tested their approach on a sample dataset and recorded a staggering processing time of 45 minutes for the initial load—an impractical duration for a project of this magnitude.
Banikazemi expressed concerns about the cost implications, stating, "Every time you run a job, you’ve got to boot up the cluster," highlighting the potential costs associated with using Amazon EC2 Spot instances. Additionally, the frequent failures of EMR clusters during jobs further complicated the process. Attempts to utilise Amazon Athena for serverless querying were also met with disappointment due to its unreliability with large queries.
In search of alternatives, Arity discovered Starburst, which offers a managed Trino service known as Galaxy. When tested on the same sample data, this new service dramatically reduced the processing time to just four-and-a-half minutes, leading to a swift decision to adopt Starburst for the ongoing project. “It was almost like a no-brainer when we saw those initial results,” Banikazemi commented.
Now operating within Arity’s virtual private cloud on AWS, Starburst has taken the lead on the initial data load and subsequent processing required for backfilling. The shift in technology has enabled data querying to shift from requiring complex programming skills to using standard SQL, making it accessible to data analysts rather than exclusively to data engineers. “Something that we needed engineering to do now we can give to our professional services people and sales engineers,” Banikazemi noted, reflecting on the enhanced capability within the team.
The transition to Starburst has not only resulted in a significant reduction of hundreds of thousands in EMR processing costs but has also provided satisfactory security and privacy assurances vital to Arity’s operations. According to Banikazemi, “At the end of the day, Starburst hit all the marks," confirming that they achieved their data objectives both economically and in a timely manner.
As the project reaches completion, it underscores a shift in data processing methodologies within the industry, highlighting the increasing reliance on advanced technologies like Starburst to leverage data effectively and affordably.
Source: Noah Wire Services