Put on the GPU tag
In the past decade, GPU has become an essential component in the system stack of modern data analytics, data science, machine learning, etc. Along with the vision (or slogan) of “General Purpose GPU” (GPGPU), building up a software company upon GPU looked promising.
Our startup was founded in May 2017. Yes, it was positioned as a GPU software company. At that time, we called ourselves “Reinvent Data Science”. Part of the reason was, like many startups, we didn’t know what exactly we would build with GPU later. In the meantime, several software companies announced the “GPU Open Analytics Initiative” (GOAI)[1].
Although people had started to use GPU in training deep learning models and mining cryptocurrency since 2010, that was more for academic research or blockchain. In 2017, PyGDF[2] and GPU databases[3] were the hottest GPU analytics technologies in business analytics (or big data). And GOAI covered both of them. I thought GOAI was a mark of GPU arising in the business analytics scenarios. Honestly, I was worried about us not being part of that initiative for a long time.
MapD was the pioneer of GPU databases and probably still the technology leader. MapD inspired many teams. Our first attempt was also a GPU database with distributed scalability (according to our initial plan). But soon, the team was caught in a big challenge of database fundamentals. It is well described in the first comment of this Hacker News thread[3].
Most traditional SQL databases are I/O-bound, meaning that the cost of pulling index pages from disk into memory overwhelms everything else, and adding more CPU doesn’t do much good.
When vendors talk about GPGPU, it does not mean GPU could serve general requirements. It only means GPU is more than graphics, and it could help other computation-intensive workloads as well. However, most big data scenarios are I/O sensitive, which means either it’s I/O-intensive or spending the most time on data transmission across different nodes in the network. Imagine GPU is an F1 car, the straights on which GPU could fully accelerate are too short.
I was not sure if there were compelling reasons for users to adopt GPU in a data system. The fanciest applications of GPU databases were spatial and temporal data analysis and geo-visualization powered by server-side GPU rendering. Remember it was 2018; 5 G/IoT/cloud gaming were hot topics. So server-side GPU rendering seemed pretty cool. We made a prototype without much technical challenge. Then the new problem was who had the spatial and temporal data, and they were eager to analyze the data. Few users we could find, and all of them are either big companies or government-authorized companies in the GIS industry. We, a small startup, have no way to build a scalable business by serving them.
In later 2018, Nvidia announced the RAPIDS project[4]. RAPIDS was an upgrade and replacement of GOAI. I guess Nvidia wanted another win like CUDA[5].
I thought GPU software companies would be even hard then.
Remove the GPU tag
During a demo case, users asked us if they could store arrays into the GPU database. The arrays are actually the embeddings generated from machine learning models. After some research, our intuition told us this new type of data deserved serious treatment.
Thus we modified our product to focus on embeddings. And we finally found a solid GPU use case, building the ANN indexes for embeddings. While creating the indexes, the embeddings will be transferred into the GPU memory, so there is much less I/O overhead, and GPU could achieve 4X ~ 5X faster than CPU. (It is still less than 10X.) Moreover, we designed a hybrid index. The similarity search will be run on both CPU and GPU using the hybrid index.
We received positive feedback about our work. But most user inquiries were, “Could I run this project without GPU?”. We created something people want to use, and it’s not about GPU. We were relieved from GPU, great. At the end of 2019, we removed the GPU dependency from our project.
The end
After that, I still regularly checked the status of the GPU software ecosystem. I am curious if someone could figure out the way of GPU software companies. Some of my observations:
- Spark 3.x started to support GPU 2020
- People from the GPU software ecosystem founded a new startup: Voltron Data: Joining Forces for an Arrow-Native Future 2021. GPU was not highlighted.
- MapD has rebranded twice. And GPU is no longer mentioned.
- Nvidia failed to acquire Arm at last: NVIDIA and SoftBank Group Announce Termination of NVIDIA’s Acquisition of Arm Limited 2022
- So far, there are 70+ RAPIDS repos on GitHub. I have the feeling that RAPIDS is a massive demo. You can build many things with it as long as you spend lots of effort.
Reference
- GOAI announcement Data Science And Deep Learning Application Leaders Form GPU Open Analytics Initiative 2017
- Getting Started with GPU Computing in Anaconda 2017
- Hacker News thread: MapD Open Sources GPU-Powered Database 2017
- NVIDIA Introduces RAPIDS Open-Source GPU-Acceleration Platform for Large-Scale Data Analytics and Machine Learning 2018
- Wikipedia: CUDA
- Paper in Pattern Recognition: GPU implementation of neural networks 2004
- Who introduced GPU to deep learning?
- The History and Future of Bitcoin Mining
- Wikipedia: OpenCL