Scaling for 10x User Growth

comma ai
4 min readMar 8, 2021

--

Weekly openpilot users grew from less than 275 users per week in the second half of 2018 to more than 2,750 users per week at the end of 2020. On some days, openpilot users now drive 100,000 miles, and every week openpilot users drive well over 500,000 miles. That is hundreds of terabytes of data per month that we process and make available to our users in explorer, cabana and connect app on android/iOS.

Providing all of our user-facing services costs less than $0.003 per mile.
Let’s pull back the curtain so you can see how we built new services while decreasing the cost of processing all this data.

Video Transcoding

We love H.265 for its great quality and compression, but web browsers only widely support H.264 and we want to use a lower bitrate for streaming. Transcoding everything uploaded to us wastes a lot of money (since most video is never watched). Transcoding on the fly results in large, rapid bursts requiring significant compute resources sitting idle most of the time (to prevent users from waiting for video to buffer). When we didn’t have very many users, either of these solutions was acceptable. The cost-effective solution, however, was encoding the video twice (H.265 and H.264 simultaneously) when recording on the device. Now the video that you watch in our apps is uploaded almost in real-time. It’s great when the solution is simpler and eliminates significant amounts of code!

Data Retention

Renting cloud storage is expensive and with user growth, the cost can easily grow quadratically. We used to hoard data because we thought it was super valuable. Then we realized we have more data than we know what to do with. So, we implemented a short-term retention policy for large, high-quality video and raw logs. We wanted to give users the option to look back much further (up to 1 year with comma prime) so we added small, streaming-quality video and decimated logs. If you have some drives you want to keep forever, we also implemented functionality for preserving a limited number of routes.

Large volumes of data for research purposes are not stored in the cloud because used servers and slow spinning disks are cheap. Petabytes of data require a distributed file system, so we created minikeyvalue instead of using any of the complex alternatives which have many features we have no use for.

Data Processing Pipeline

When you upload video and logs from driving, we generally have the files processed and available within minutes. We used to use dedicated virtual machines that scaled horizontally as the load changed. Now we use spot instances that cost a small fraction of the original cost. However, your cloud provider can pull these virtual machines out from under you at any time without notice (because someone else is willing to pay more). So how do you lower the probability of losing all of your instances to nearly zero? Cloud providers generally need to have a large buffer of capacity and they have many different virtual machine configurations with different CPU and memory sizes. So, don’t just spin up virtual machines of a single configuration type, spin up lots of them from many different configuration types! Assuming the probability of losing virtual machines of each type is independent, the probability of losing everything goes down significantly with every additional configuration type you spin up. Entire companies have been founded (and acquired by cloud providers) based on this idea.

SSH Jump Server

We have always had SSH access to our internal company-owned devices in our vehicles. We used to run OpenVPN and it wasn’t great. We figure anything we use every day is something our users would also want, so we decided to offer it as a service. How can we provide an SSH jump server that gives only you access to your device, even if it cannot accept incoming connections while on a mobile cellular network? We have the device make a secure websocket connection (outbound) to the jump server, and then when you SSH into the jump server it forwards traffic through the websocket to the SSH port of your device (after validating the connection is authenticated with your public key). We only allow asymmetric key authentication and we do not have SSH access to your device (only you hold the private key).

The Future Looks Bright

What will be needed in the future for the next 10x user growth? We will be building new services to support new features such as maps, navigation and vehicle security. We could further decentralize our current services by utilizing peer-to-peer connections with WebRTC. Many other things will need improvements to continue to scale horizontally.

Join the team

Are you interested in building out our services and underlying infrastructure? If you think you have what it takes to lead and implement our next 10x user growth apply for a job!

Greg Hogan,
Head of Infrastructure @ comma.ai

--

--

comma ai
comma ai

No responses yet