tutorial · 2026-06-13 · duckbert · 12 min read

Shaping the Roster: SQL's Core Clauses, the DuckDB Way

Who scored highest? Which team came out ahead? How many games did each duck play? We've got a database full of ducks — let's learn the SQL to ask.

Pondville's weekend duck league finished its season — four ducks, two teams, six game days, twenty-four rows. You could read all of them top to bottom, but reading isn't answering, and most questions a coach asks don't want every row anyway. Show me the top three scorers. Show me only Mallards. Show me how many games each duck played.

A handful of clauses shape almost every analytical answer: SELECT picks columns and ORDER BY sorts rows on the shaping side, WHERE and HAVING on the filter side, GROUP BY with aggregate functions (and FILTER) on the collapse side, plus LIMIT and CASE WHEN. These are the universal core; every SQL dialect has them in some form. On top of them DuckDB layers a few ergonomic shortcuts that do the same work with far fewer keystrokes. Every widget below mirrors a real query against the bootstrap table, and the SQL blocks are live — edit one, hit Run, and the result reshapes in place.

bootstrap.sql

CREATE OR REPLACE TABLE players(name, team, score, played_on, weather) AS
SELECT * FROM (VALUES
  ('Eider',    'Mallards', 92, DATE '2026-04-01', 'sunny'),
  ('Duckbert', 'Mallards', 88, DATE '2026-04-01', 'sunny'),
  ('Pippin',   'Pintails', 79, DATE '2026-04-01', 'sunny'),
  ('Mira',     'Pintails', 95, DATE '2026-04-01', 'sunny'),
  ('Eider',    'Mallards', 88, DATE '2026-04-08', 'sunny'),
  ('Duckbert', 'Mallards', 94, DATE '2026-04-08', 'sunny'),
  ('Pippin',   'Pintails', 90, DATE '2026-04-08', 'sunny'),
  ('Mira',     'Pintails', 88, DATE '2026-04-08', 'sunny'),
  ('Eider',    'Mallards', 90, DATE '2026-04-15', 'rain'),
  ('Duckbert', 'Mallards', 91, DATE '2026-04-15', 'rain'),
  ('Pippin',   'Pintails', 85, DATE '2026-04-15', 'rain'),
  ('Mira',     'Pintails', 99, DATE '2026-04-15', 'rain'),
  ('Eider',    'Mallards', 86, DATE '2026-04-22', 'windy'),
  ('Duckbert', 'Mallards', 89, DATE '2026-04-22', 'windy'),
  ('Pippin',   'Pintails', 77, DATE '2026-04-22', 'windy'),
  ('Mira',     'Pintails', 91, DATE '2026-04-22', 'windy'),
  ('Eider',    'Mallards', 91, DATE '2026-04-29', 'sunny'),
  ('Duckbert', 'Mallards', 87, DATE '2026-04-29', 'sunny'),
  ('Pippin',   'Pintails', 82, DATE '2026-04-29', 'sunny'),
  ('Mira',     'Pintails', 78, DATE '2026-04-29', 'sunny'),
  ('Eider',    'Mallards', 84, DATE '2026-05-06', 'rain'),
  ('Duckbert', 'Mallards', 95, DATE '2026-05-06', 'rain'),
  ('Pippin',   'Pintails', 88, DATE '2026-05-06', 'rain'),
  ('Mira',     'Pintails', 97, DATE '2026-05-06', 'rain')
) t(name, team, score, played_on, weather);

SELECT — pick your columns

Pondville's table has five columns, but say we only care about three — each duck's name, its score, and when it played. SELECT is how we ask for exactly those: name the columns we want and the engine returns only those, nothing else (LIMIT 6 just trims the preview — LIMIT and row ordering each get their own section soon):

select-named.sql

SELECT name, score, played_on
FROM players
LIMIT 6;

Naming three columns is fine. Want every column? SELECT * hands back the whole row. But naming forty-nine out of fifty in some warehouse table is annoying, and that's exactly the case SELECT * EXCLUDE solves: start from *, then list what you don't want.

select-exclude.sql

SELECT * EXCLUDE (played_on)
FROM players
LIMIT 6;

Click columns in the widget below to flip them on/off — the SQL rewrites itself between * and * EXCLUDE.

ORDER BY — rows in the order you asked for

We've got the columns we want, but the rows arrive in no particular order — whatever order the engine happened to find them, which a query rewrite, a parallel scan, or a new DuckDB version can reshuffle next run. Say we want them ranked, highest score first. That's ORDER BY: name the keys to sort on and it turns that random pile into a leaderboard. If order matters at all, never leave it to chance; name the keys.

order-by.sql

SELECT name, team, score, played_on
FROM players
ORDER BY score DESC, name
LIMIT 8;

Mira's 99 tops it, then her 97 — but watch the 95s: Duckbert and Mira each scored one, and name breaks the tie, so Duckbert sorts above her. That second key earns its place. And DESC flips only the key it's attached to; the rest stay ASC.

ORDER BY ALL is the DuckDB shortcut: sort by every column in your SELECT list, left to right — handy when you want stable, eyeball-able output without naming keys. It stands alone (you can't tack on extra keys) and takes one direction for the whole list, so ORDER BY ALL DESC flips everything. Here that's team, then name, then score, so Duckbert's six games stack together at the top with Eider's right under them:

order-by-all.sql

SELECT team, name, score
FROM players
ORDER BY ALL
LIMIT 8;

The widget below starts preloaded with the score DESC, name keys from the first query in this section — click chips to recompose, and the SQL and result follow.

LIMIT — the top-N pattern

The leaderboard's sorted — but we rarely want all 24 rows, just the top three. LIMIT is how we say "stop after the first N," and bolted onto ORDER BY it becomes the top-N pattern behind every leaderboard and "best of" list. Top three scores in the league:

top-three.sql

SELECT name, team, score, played_on
FROM players
ORDER BY score DESC, name
LIMIT 3;

Mira takes two of the three slots — her 99 and her 97 — and Duckbert is the only other duck on the podium, with his 95. That's the same 95 tie from last section: the name tiebreaker is all that kept Mira from sweeping the podium. LIMIT N keeps the first N rows after the order is applied. Drop the ORDER BY and you get whichever three rows the engine happens to return, which is almost never what you want.

Slide N below and the cut line slides down the leaderboard. The struck-through rows are still in the table — they just didn't survive the LIMIT.

WHERE — keep only the rows you need

Say we don't want every duck in the result — only some of them. Which Mallards scored 90 or more? WHERE is how we ask for exactly those rows: keep only the ones whose values pass a test. And it cuts first — before ORDER BY, before GROUP BY, before LIMIT — so every other clause sees only the survivors.

where-basic.sql

SELECT name, score, played_on
FROM players
WHERE team = 'Mallards' AND score >= 90
ORDER BY score DESC, name;

Six rows survive — Duckbert's and Eider's strongest games, nothing else. Every other Mallard game, every single Pintail row, gone before the sort even ran.

The predicates you'll use most:

Comparison: =, <> (or !=), <, <=, >, >=
Logical: AND, OR, NOT
Set membership: team IN ('Mallards', 'Pintails')
Range: score BETWEEN 80 AND 89 (inclusive on both ends)
Pattern: name LIKE 'D%' (% = any string, _ = any single char)
Null test: score IS NULL / score IS NOT NULL — never score = NULL, which yields NULL (not true) and so matches nothing

where-compound.sql

SELECT name, score, played_on, weather
FROM players
WHERE weather IN ('rain', 'windy')
  AND score BETWEEN 85 AND 95
  AND name LIKE 'M%'
ORDER BY score DESC;

Three predicates stacked, and the whole season narrows to a single row: Mira's 91 on the windy April 22nd — the only M-name whose rain-or-wind game landed inside the 85–95 band. Each AND is one more sieve.

Toggle teams, weather, and the score predicate below to see the matching-row count and result table update live. With no chips selected the widget falls back to WHERE false — an empty set matches nothing, and you'll see zero rows.

Aggregate functions — many rows into one number

Every query so far has handed back rows — fewer columns, sorted, filtered, but always a list of games. What if the answer we're after isn't a list at all? What's the average score across the whole league? How many games were played all season? What was the highest anyone hit? None of those is about a single duck — each one boils a whole column down to one number. That's an aggregate function:

aggregate-whole-table.sql

SELECT COUNT(*) AS games,
       ROUND(AVG(score), 1) AS avg_score,
       MIN(score) AS lowest,
       MAX(score) AS highest
FROM players;

One row back, not 24. COUNT(*) tallies the rows, AVG averages the score column, MIN and MAX grab the extremes — the season ran to 24 games, averaging 88.5, from a low of 77 to a high of 99. Notice there's no GROUP BY anywhere: with no grouping, the whole table is the group, so each aggregate folds all 24 rows into a single summary value. One distinction to bank early: COUNT(*) counts every row, while COUNT(score) counts only rows where score isn't NULL — identical here, since the data has no gaps, but they diverge the moment a column does. COUNT, SUM, AVG, MIN, and MAX are the everyday five — DuckDB ships plenty more, but these answer most questions.

GROUP BY — collapse rows into summaries

That single number blended both teams together. Split it per team — the Mallards' average on one row, the Pintails' on another — and that's what GROUP BY does: the same aggregates as above, but computed once for each group instead of once for the whole table. It's the moment SQL starts answering the questions a coach actually asks — how is each team doing? how many games did each duck play?

Mechanically, GROUP BY buckets the rows by the key you name — one pile per team — then runs the aggregates once over each pile:

group-by-team.sql

SELECT team,
       COUNT(*) AS games,
       ROUND(AVG(score), 1) AS avg_score,
       MIN(score) AS lowest,
       MAX(score) AS highest
FROM players
GROUP BY team
ORDER BY team;

Twenty-four rows in, two out — one per team (the AS bits just name the output columns). The Mallards average 89.6 and run from 84 to 95; the Pintails sit a touch lower at 87.4 but swing wider, 77 all the way to 99. Three views share the same group-key + aggregate picker below.

FOLD colors the input grid by group and shows where each row lands. Toggle keys and the count multiplies: team alone gives two groups, add weather for 2 × 3 = 6 groups, push all four keys (team, name, weather, played_on) and you're back to 24 groups of one — the table itself.

ZOOM drills into a single group. Pick Mallards + SUM and you can read the formula off: twelve scores summed to 1075. The same twelve yield COUNT = 12, AVG ≈ 89.6, MIN = 84, MAX = 95 — five questions, one input set.

RANK flips the camera. Each group becomes a bar sorted by the chosen aggregate (the chart shows the top 12), and it shrinks or grows as you add or remove group keys.

COUNT(DISTINCT name) is the unique-values variant — per team it returns 2, the two ducks, where COUNT(*) returns their 12 games. Reach for it on "how many distinct ducks, or weather types?" questions.

Every column in the SELECT list that's not inside an aggregate function must appear in GROUP BY — otherwise the engine can't decide which value to show for a collapsed group. DuckDB's GROUP BY ALL automates that bookkeeping: it groups by every non-aggregated SELECT column for you.

group-by-all.sql

SELECT team, weather, COUNT(*) AS games, ROUND(AVG(score), 1) AS avg
FROM players
GROUP BY ALL
ORDER BY team, weather;

Six rows out — the 2 × 3 grid FOLD predicted.

FILTER — one aggregate, its own WHERE

Suppose we want one of those per-team numbers to count only part of the group. How many of each team's games were strong ones — 90 or more — while still seeing their total games and sunny-day average right alongside? FILTER (WHERE …) gives a single aggregate its own private WHERE, so one GROUP BY can answer several such questions in the same row:

filter-where.sql

SELECT team,
       COUNT(*) AS games,
       COUNT(*) FILTER (WHERE score >= 90) AS strong_games,
       ROUND(AVG(score) FILTER (WHERE weather = 'sunny'), 1) AS sunny_avg
FROM players
GROUP BY team
ORDER BY team;

Read it off per team: the Mallards played twelve, six of them strong, and averaged 90 on sunny days; the Pintails, also twelve, went five strong with a sunny average of 85.3 — three differently-filtered numbers, one pass over the table. Each predicate stays bolted to its own aggregate; the others don't even notice.

HAVING vs WHERE — the two filter stages

This time we want to filter on the result of the grouping, not the raw rows — keep only the groups whose summary passes a test. WHERE can't help here: it runs before the groups, and their averages, even exist. HAVING is the filter that runs after grouping — it throws out whole groups, where WHERE throws out input rows. The two feel similar but run at different stages of the pipeline, and the same test gives different answers depending on which one you use.

Question: which ducks averaged above 90 across the season?

having-avg.sql

SELECT name, COUNT(*) AS games, ROUND(AVG(score), 1) AS avg
FROM players
GROUP BY name
HAVING AVG(score) > 90
ORDER BY avg DESC;

Two ducks clear the bar: Mira at 91.3, Duckbert at 90.7. Eider (88.5) and Pippin (83.5) fall just short and drop out.

HAVING AVG(score) > 90 doesn't make sense in a WHERE clause — WHERE runs before GROUP BY and doesn't see aggregates. The convention: row-level predicates in WHERE, aggregate predicates in HAVING. Apply the same threshold both ways below and watch the answers diverge.

CASE WHEN — bucketing scores into tiers

One transform left: making raw scores easier to read at a glance. Is an 88 a strong game or a middling one? CASE WHEN sorts each value into a labelled bucket — strong, mid, weak — so we can count games by tier instead of squinting at numbers. It's SQL's if/elif/else: one answer per row, decided by the first branch that matches, and paired with GROUP BY it turns "raw scores" into "tier counts" in one query:

case-buckets.sql

SELECT
  CASE
    WHEN score >= 90 THEN 'strong'
    WHEN score >= 80 THEN 'mid'
    ELSE 'weak'
  END AS tier,
  COUNT(*) AS games
FROM players
GROUP BY tier
ORDER BY games DESC;

The season sorts into eleven strong games, ten mid, and three weak — Pippin's 77 and 79 and Mira's 78 are the only games that dip under 80. Branches evaluate top to bottom; the first match wins. ELSE is the catch-all — drop it and unmatched rows return NULL for that column.

One more convenience hiding in that query — a Postgres-ism DuckDB shares: GROUP BY tier groups by the SELECT alias, where standard SQL would make you repeat the whole CASE expression inside GROUP BY. Push the tier thresholds around below and watch the bucket counts redistribute.

The pipeline, end to end

The clause order behind all of this isn't trivia — it's the whole post in one line: FROM → WHERE → GROUP BY → HAVING → SELECT → DISTINCT → ORDER BY → LIMIT. The one stage we didn't drill is DISTINCT, which drops duplicate rows after the projection — SELECT DISTINCT team FROM players returns each team just once. Each clause hands a transformed row set to the next, and once you internalize the order most of the "why can't I reference X in Y?" questions answer themselves. The widget below shows that handoff for four canonical queries — click a scenario, watch the rows flow stage by stage.

What if grouping collapses too much?

GROUP BY team collapses Pondville's 24 rows into 2 — clean answer if you want team totals, lossy if you also want to see Eider's individual 92 in the same result. WHERE filters before grouping. HAVING filters after. Neither one lets you keep the raw rows and the team aggregate side by side in the same query.

That's exactly the gap window functions fill — same aggregates, but every input row stays put and gets its own view of the surrounding rows. Looking Through the WINDOW walks through OVER, PARTITION BY, frames, ranking, and row-to-row peeks.

Until then, drop your own table into Pondville's schema — a different sport, a different league, your sales numbers — and run the same clauses. Pick, filter, sort, slice, group, bucket: that's the toolkit, and you've now met all of it.