Summary

A practical SQL reference for data scientists covering the 90% of queries used day-to-day: SELECT/DISTINCT, CASE statements, GROUP BY/ORDER BY, WHERE/HAVING filtering, JOINs, aggregation functions (COUNT/SUM/MIN/MAX/AVG), CTEs, and window functions. Explicitly scoped to data scientist needs — not DBA-level SQL — and recommends BigQuery/PostgreSQL as starting flavors.

面向資料科學家的實用 SQL 參考,涵蓋日常 90% 查詢的核心語法:SELECT、CASE、GROUP BY/ORDER BY、WHERE/HAVING、JOIN、聚合函數、CTE 和視窗函數。明確定位為資料科學家使用級別,而非 DBA 級別的 SQL。

Key Points

  • Scope: data scientists need SQL for extraction + transformation before loading into Python/Pandas — not for database design or admin tasks
  • Flavors: Transact-SQL (MS SQL Server), GoogleSQL (BigQuery), and PostgreSQL are the most common; article recommends starting with T-SQL
  • Core operations covered: SELECT with DISTINCT, CASE for feature engineering, GROUP BY + ORDER BY, WHERE + HAVING for filtering, INNER/LEFT/RIGHT/FULL OUTER JOINs, COUNT/SUM/MIN/MAX/AVG aggregations
  • Intermediate next steps: CTEs (temporary named subqueries, heavy use in BigQuery) and window functions — both key for real data scientist workflows
  • Key insight on filtering: always use WHERE with date partitions to control query cost in BigQuery/cloud warehouses

Insights

The article’s scoping is its most practical feature: explicitly saying “you don’t need to be an expert, you need to be able to extract, manipulate, and analyse data” avoids the common tutorial trap of teaching 200 SQL commands when 20 cover 90% of real work.

The cost-awareness point about BigQuery’s column-based pricing is easy for SQL beginners to miss. Cloud data warehouses price by bytes scanned, so selecting only needed columns and always including date partition filters can reduce costs by orders of magnitude on large tables.

CTEs are underused by beginners but become the dominant pattern in production analytics SQL — they function like named subqueries that can be referenced multiple times in the same query, making complex transforms readable.

Connections

Raw Excerpt

Being able to extract, manipulate, and analyse data using SQL should be enough for the majority of data scientists’ tasks.