Author’s note: Hacker News comments here.
The Slack messages are coming in hard and fast. Why is the cluster slow? How is gross revenue smaller vs net revenue? When will the marketing columns get added? Why does the dashboard not match the enterprise accounts table? How long will the backfill take? You feel like quitting.
Getting technical people to solve problems faced by business people is not new. What’s new is how every part of the business now runs on data, and thus the growing need for data experts to keep things moving. The tricky part is when context goes ‘over the wall’ - the consumer finds an issue with the data but the producer is not familiar with the domain to see it’s a problem. Over time, information transfer between the two sides becomes the bottleneck and issues pile up.
In proposing a solution, Stitch Fix has the boldly-titled blog post Engineers Shouldn’t Write ETL. The idea here is the engineering team creates ‘Lego blocks’ that consumers then assemble into end-to-end data pipelines. This tends to work better when consumers are themselves technical; in any case a lot of upfront investment is needed. More and more startups get created every day with the promise to make this process effortless, but for now you’re the data engineer and you need to figure it out.
Maybe this isn’t what you signed up for. Maybe there was some bait-and-switch; you were hired to write Python but due to ‘critical initiatives’ you end up writing SQL (side note: SQL is Turing complete). Maybe the manager who hired you left and the new one feels the team should have a wider scope. Maybe the CTO left and sales/marketing/finance/ops now has the CEO’s ear.
My friends, I feel your pain. We can’t turn back time, but perhaps we can reflect on the more interesting things to work on moving forwards.
Tell your customers you’ll speed up their queries and they’ll love you for it. I’ve worked with the following relational databases (or if you’re pedantic, RDBMS) in production: MySQL, Hive, Vertica, Snowflake, Redshift, Postgres. The common thread? Complaints about slow queries.
Relational databases have been around for a while, and are among the most ‘battle-tested’ software around. If your query is slow, odds are it’s your query and not the database.
<aside> ⚡ How do we speed up our queries?
</aside>
Databases used in data warehouses tend to be columnar-store; in this section we’ll focus on Redshift. The simplest way to make your queries run faster is to select only the columns you need. Let’s say you have a table where each row represents an event.
CREATE TABLE events (
event_id VARCHAR
event_ts TIMESTAMP
user_id VARCHAR
browser_id VARCHAR
);