How We Chose a New Database for Our Analytics Needs

As an analytics company, how you store data can make or break the product. Not only does it have a direct consequence on product performance, but it also affects development efficiency and product philosophy.

Before you ask, of course we (it was just me and Buğra at the time) didn’t think this stuff thorough when we started HockeyStack. We just went ahead and chose the only database that we both knew how to use. No joke, one of the fastest decisions we’ve ever made. When building an MVP, development speed is more important than how robust and refined the product is. At that stage, you will only see scalability problems in your dreams. The fact that we had to start looking for a more scalable database kind of validated our product.

In this post, I will take you through the actual steps the team followed while choosing the new home of our analytics data. Spoiler alert: We chose ClickHouse and we’ll be sharing more about our experience. This article relies heavily on the internal messages from the engineering team during this process, so you will actually see what was going through our minds in each step.

Current Solution: MongoDB

MongoDB is a document database, which means it stores data in JSON-like documents. This is a more intuitive way to think about data for beginners (I mean us, not you).

{
     "first_name": "Tom",
     "email": "[email protected]",
     "cell": "765-555-5555",
     "likes": [
        "fashion",
        "spas",
        "shopping"
     ],
     "businesses": [
        {
           "name": "Entertainment 1080",
           "status": "Bankrupt"
        },
        {
           "name": "Swag for Teens"
        }
     ]
  }

Folks at MongoDB list the advantages of this data model as such:

  • An intuitive data model that is fast and easy for developers to work with.
  • A flexible schema that allows for the data model to evolve as application needs change.
  • The ability to horizontally scale out.

For the first 7-8 months of HockeyStack, we were fairly pleased with Mongo. Their aggregation framework might seem basic at first sight but for most use-cases, you can get to perform the most complex SQL queries using aggregation pipelines. Plus, it was easy to get started with and the docs are pretty straightforward and filled with good examples. When we got stuck with certain operations, we saw that the team at MongoDB updated the product and made the aggregation pipelines even more powerful and versatile. But as with all good things, it didn’t last forever.

This was the worst quality stock photo I could find. But it somehow reflects my thoughts about the subject.

Performance

As we gained traction, high-end customers with millions of sessions started facing serious performance issues regarding database queries. Our initial reaction was increasing server costs and implementing caches to stabilize performance. This worked at the beginning. In fact, after adding the main cache of the data tracking system, our CPU usage decreased by approximately 82 percent! We went to bed reassured that day (and only that day) but after a couple more big websites signed up, it was back to square one.

Connection

If you know a little bit about MongoDB, you can guess how we managed multi-tenant analytics data in it. Keep in mind that we had to make this decision at the very beginning of the journey when the name of the product wasn’t even HockeyStack. The short answer is: we separated everything. And by everything, I mean everything. Every customer had a separate database so anytime they queried something or had a new session, we had to call the useDB() method from Mongoose. This works fine if you have a small number of databases but when you have more than 5000 websites, things can quickly get out of hand.

After countless hours optimizing the aggregate pipelines, the engineering team started slashing their wrists and seriously considering switching to a more performance-oriented solution for their storage needs.

A New Database for a New HockeyStack

Full disclosure, no one at the engineering team was an SQL expert when we first put forward the idea of a database migration. I myself didn’t even know the difference between WHERE and HAVING! But as always, internet came to the rescue. The online resources not only helped us migrate the database queries but also gave us confidence to explore the edges of this new SQL land. At this stage, a great trick I’ve found was reading other people’s SQL queries to understand how everything works together and what optimizations people chose over others. After these few rough months of rigorous SQL training, I can proudly say that I now have some idea about the difference between WHERE and HAVING.

Back to the topic. Looking at the analytics market, we noticed the heavy use of relational databases as opposed to object-oriented ones. This led us to certain database solutions. Of course we looked at more options but none of them survived after a quick chat between the engineering team.

SingleStore

Previously known as MemSQL, SingleStore set out to become the “all-in-one” database for all your storage needs. It can store both row-oriented data and column-oriented data. It first got our attention through Hacker News where a lot of people weighed in on database performance discussions. We didn’t come across many bad reviews about its performance. The website itself also felt very refreshing. The focus on speed was exactly what we were looking for. Its comparison pages also reassured us to make a switch.

The original Slack channel we opened for this migration was called “single-store-migration”. That’s how much we hoped we could use it. Since it had a cloud-based option, we didn’t really thought about how it would work on our server. Then we realized that SingleStore doesn’t actually work well with ARM architecture.

PostgreSQL

One of the most widely used databases. There are tremendous resources about the caveats of Postgres on the internet. And with extensions, vanilla Postgres can be made a lot faster. But what really impressed us was the amount of built-in functions as well as the support for UDFs. So naturally, we gravitated towards it. But oh boy, then we encountered the monstrosity that went by the name of pgAdmin 4…

Words cannot describe the boiling fury I’ve held back as I interacted with this patience-testing, sanity-wasting piece of shitware. I would’ve showed you the queries I wrote in it but wait, pgAdmin has decided to freeze on its own and therefore not save any of my work from the past hour. I can write a whole blog post about my mental strife with this crime against humanity, but that’s for another day. If you are curious, we then moved to DBeaver and then to DataGrip. And if anyone from DBeaver is reading this, just know that you guys are on thin fricking ice.

But all in all, for the first month of this migration, Postgres seemed like a great option. Its query optimizer felt like a relief and after the first time we ran a query, it substantially sped up over time. The UDF support made the funnels page so fast that I couldn’t even believe it at first. But as with all good things, it didn’t last for— well, you get the idea. This is sort of like the summary of our experience with databases.

We were continuously comparing the performance of the queries with our old setup, and at the beginning, it was actually performing much better than Mongo. But after finishing up the queries in the main dashboard page, I wanted to give it a test ride. The screenshots below are the exact ones that I sent to the engineering team afterwards:

The result from MongoDB
The result from Postgres

What the hell is going on here? The system that forced us to embark a complete database migration is actually performing better than the new solution!? This was our reaction as well, omitting the swear words. Of course, it didn’t stay this way. Remember the query optimizer I mentioned? After a couple of runs, Postgres performance came to around 3 seconds but was still enough to scare us for good. Whatever system we choose at the end, we had to trust its performance fully. This time, back to square two.

TimeScaleDB

After the horrifying results of vanilla Postgres, we also looked at Postgres extensions that supposedly made the queries much faster. I don’t want to throw dirt here, there are companies that are using extensions like TimeScaleDB or Citus at scale. But when we tried it for ourselves, we didn’t really see that much of a difference. It is probably a mistake on our end and we didn’t really investigate it afterwards, but the general feeling at this stage was that we had to find a new shiny toy that didn’t have a connection with Postgres. Plus, the blog posts and forum threads that we were reading pointed us to an exciting and exotic database that really carved its name thanks to its speed.

Here Comes the ClickHouse

I don’t know how many times I tried setting ClickHouse up in my local machine. I lost count of it after I ran out of funny names for the new installations. But our dissapointment in Postgres forced us to try it one more time. After setting it up through another app’s docker container dependencies, it didn’t take me long to understand what was meant by ClickHouse-fast.

When we first started this migration journey, I was very skeptical of this idea of column-oriented data storage. During our research, we saw that many companies were using ClickHouse, but I also knew that I wouldn’t believe the hype until I saw it with my own eyes on my own computer screen. After a couple of hours playing with the basics and trying to get the data insertion to work, I wrote a couple of SQL queries to test it out on our own data. Let me tell you, it was like magic. Everything just worked, but so much faster. I don’t have a screenshot of those initial results but I didn’t even need them. It was clear that we had found a match made in data heaven. AND it worked with ARM architecture so suck it SingleStore.

As we got deeper into the ClickHouse rabbit hole, we saw an already amazing software with an even greater potential. Some of the features that really made us “click” with it were the extensive support for url and array functions, the ease of materialized views (more on that later), and the amazing webinars and articles by Altinity. But the biggest feature that it lacked in our opinion was and still is a decent support for UDFs.

OK, But How Can I Make It Faster?

Yep, that’s what we also asked ourselves. But it wasn’t because the system kept crushing under the heavy toll of a thousand queries, nor the fact that it sometimes gave timeout errors because the queries just couldn’t finish in time. No, we already saw and were quite sure that ClickHouse was fast. But when a company is so focused on performance, trying to find ways to make it faster actually becomes a fun job. More articles about how we actually managed that will come but I just wanted to mention materialized views for a second here. They are one of the best features of ClickHouse. Of course, other databases also have some version of it but with ClickHouse, they are much more prominent due to the ease-of-use. Creating a mat view and connecting it to other tables is a piece of cake and the performance improvement is the cherry on top.

Final Words

A complete database migration is a hard task, with an emphasis on the “r”. In fact, it was probably the hardest task that we’ve ever undertook at HockeyStack. Sometimes it can feel like the data accuracy problems will never end, and sometimes a mistake in SQL formatting will make you question all your life choices. But at the end of the day, when you find the right database, everything will finally feel like in its right place. Oh by the way, congratulations if you think you got all the song references 😉