Select Queries | What's Your IQ

Introduction

SELECT: fundamental SQL statement retrieving data from tables. Declarative: specify what data needed, not how. Query optimizer: determines execution strategy. Core operation: all database queries build on SELECT.

Syntax simple: SELECT columns FROM table WHERE conditions ORDER BY columns. Semantically rich: filters, aggregations, joins, subqueries enable complex analysis. Foundation: essential for all database work.

Performance critical: unoptimized SELECT impacts database. Index usage, query plan optimization, cost estimation: determinant of speed. Understanding SELECT execution: key to database performance.

"SELECT is the query foundation: retrieve, transform, aggregate data. Declarative simplicity hides complex optimization. Mastery: understanding execution plans, indexes, and performance tuning." -- SQL essentials

Basic SELECT Statement

Syntax

SELECT column1, column2, ...
FROM table_name
WHERE conditions
ORDER BY columns
LIMIT/OFFSET count

Execution Order

Logical order (not parsing order): FROM (which table), WHERE (filter rows), SELECT (choose columns), ORDER BY (sort), LIMIT (restrict result count). Understanding order: predicts query behavior.

Example Query

SELECT employee_id, name, salary
FROM employees
WHERE department='IT' AND salary>50000
ORDER BY salary DESC
LIMIT 10;

Result: top 10 highest-paid IT employees

NULL Handling

NULL: unknown/missing value. Comparisons with NULL: always NULL (not true/false). Special: IS NULL, IS NOT NULL. Aggregates: ignore NULLs. Sorting: NULLs first or last (DBMS-dependent).

Case Sensitivity

Keywords: case-insensitive (SELECT, select, Select equivalent). Column names: usually case-insensitive (configuration-dependent). String values: case-sensitive (depends on collation).

Projection (Column Selection)

Column Selection

Specify columns: SELECT col1, col2, col3. Omit: SELECT * (all columns). Projection: reduces columns (may reduce I/O and memory). Selective: specify needed columns.

Column Aliases

SELECT
 employee_id AS emp_id,
 name AS employee_name,
 salary * 1.1 AS projected_salary
FROM employees;

Expressions and Functions

Select columns, computed values, function results. Arithmetic: +, -, *, /. String functions: UPPER, LOWER, SUBSTRING. Date functions: DATEADD, DATEDIFF. Type casting: CAST, CONVERT.

Aggregate Functions

COUNT(), SUM(), AVG(), MIN(), MAX() reduce multiple rows to single value. Covered deeper in aggregation section. Basic: SELECT COUNT(*) FROM table.

CAST and Type Conversion

SELECT
 CAST(salary AS DECIMAL(10, 2)),
 CAST(hire_date AS VARCHAR(20))
FROM employees;

Filtering with WHERE

Basic Conditions

Equality: column=value. Comparison: >, <, >=, <=, <>. Logical: AND, OR, NOT. BETWEEN: column BETWEEN a AND b. IN: column IN (v1, v2, v3). LIKE: pattern matching.

WHERE Clause Evaluation

Evaluated per row. True: row included. False or NULL: row excluded. Short-circuit: AND stops if first false, OR stops if first true (optimization).

Example Conditions

WHERE salary > 50000
 AND (department='IT' OR department='Finance')
 AND hire_date >= '2020-01-01'
 AND status IS NOT NULL;

LIKE Pattern Matching

% wildcard: any characters. _ wildcard: single character. Example: LIKE 'A%' (starts with A). Case sensitivity: DBMS-dependent. Performance: may not use indexes (depends on pattern).

IN vs. OR

IN (v1, v2, v3): cleaner for multiple values. OR: equivalent but verbose. IN: can be optimized better (index usage possible). Preference: use IN for multiple equality checks.

NULL Conditions

IS NULL: test for NULL. IS NOT NULL: test for non-NULL. Important: NULL!=NULL (always false). Comparison operators (=, <>) return NULL with NULL operands.

Sorting with ORDER BY

Single Column Sorting

ORDER BY salary ASC; -- Ascending (default)
ORDER BY salary DESC; -- Descending

Multi-Column Sorting

ORDER BY department ASC, salary DESC;

Result: rows grouped by department, within each department sorted by salary (highest first)

Sorting by Expression

ORDER BY YEAR(hire_date) DESC. Order by computed values or functions. Not restricted to selected columns.

NULL Ordering

NULLs first or last (DBMS-dependent). SQL Server: NULLs first in ASC. PostgreSQL: NULLs last in ASC. Explicit: NULLS FIRST / NULLS LAST (PostgreSQL, Oracle).

Performance Impact

Sorting: expensive (O(n log n) typical). Without index on sort column: requires sorting entire result. With index: may use index order (fast). Query optimizer: chooses strategy.

LIMIT and OFFSET

SELECT ... ORDER BY ... LIMIT 10; -- First 10 rows
SELECT ... ORDER BY ... LIMIT 10 OFFSET 20; -- Rows 21-30

DISTINCT and Deduplication

Remove Duplicates

SELECT DISTINCT department FROM employees;

Result: unique departments (each once, duplicates removed)

Multi-Column DISTINCT

SELECT DISTINCT department, salary FROM employees;

Result: unique (department, salary) combinations

Implementation

Hash-based: build hash table of seen values, output unique. Sort-based: sort, consecutive duplicates removed. Cost: O(n) hash or O(n log n) sort. Expensive: avoid if possible.

DISTINCT with ORDER BY

ORDER BY applied after DISTINCT. Can only sort by selected columns. Common: SELECT DISTINCT department FROM employees ORDER BY department.

Performance Consideration

DISTINCT: full result processing. Cannot use early termination (need all rows). Cost: linear in result size. Avoid: if not necessary.

COUNT(DISTINCT)

SELECT COUNT(DISTINCT department) FROM employees;

Result: number of unique departments (5 instead of 100 rows)

Aggregation Functions

Basic Aggregates

COUNT(*): number of rows. SUM(column): sum of values. AVG(column): average. MIN/MAX: minimum/maximum value. Reduce many rows to single value.

Example Aggregation

SELECT
 COUNT(*) AS total_employees,
 AVG(salary) AS avg_salary,
 SUM(salary) AS total_payroll,
 MIN(salary) AS min_salary,
 MAX(salary) AS max_salary
FROM employees;

NULL Handling in Aggregates

Aggregates ignore NULLs: COUNT(column) excludes NULLs (use COUNT(*) for row count). SUM(salary): sums non-NULL values. Average: SUM / COUNT (non-NULL).

COUNT Variations

COUNT(*): all rows (including NULLs). COUNT(column): non-NULL values. COUNT(DISTINCT column): unique non-NULL values. Difference: important for correct results.

FILTER Clause

SELECT
 COUNT(*) AS all_employees,
 COUNT(*) FILTER (WHERE salary > 50000) AS well_paid
FROM employees;

Window Functions

Advanced: aggregates per group/window. SUM(...) OVER (...). Running totals, moving averages. Separate section (covered in advanced SQL).

Grouping with GROUP BY

Basic Grouping

SELECT department, COUNT(*) AS num_employees
FROM employees
GROUP BY department;

How GROUP BY Works

Partition rows: by grouped column(s). Within each group: apply aggregates. Result: one row per group (each group reduced by aggregate). Powerful: summarization.

Multiple Columns

SELECT department, salary_level, COUNT(*) AS count
FROM employees
GROUP BY department, salary_level;

Restrictions with GROUP BY

SELECT list: grouped columns or aggregates only. Cannot select non-grouped columns without aggregate (dependent on group, ambiguous). SQL standard (MySQL strict).

NULL in GROUP BY

GROUP BY department: NULLs grouped together. One group for all NULLs. Handled consistently. Useful: identify missing department assignments.

GROUP BY Expressions

SELECT YEAR(hire_date) AS hire_year, COUNT(*) AS count
FROM employees
GROUP BY YEAR(hire_date);

Filtering Groups with HAVING

GROUP BY vs. WHERE

WHERE: filters rows before grouping. HAVING: filters groups after aggregation. Different purposes: row-level vs. group-level filtering.

Example

SELECT department, AVG(salary) AS avg_sal
FROM employees
WHERE hire_date >= '2020-01-01' -- Filter rows
GROUP BY department
HAVING AVG(salary) > 60000; -- Filter groups

HAVING Clause Logic

After GROUP BY: each group created. Aggregates computed per group. HAVING evaluated: true groups included, false groups excluded. Final: result set with matching groups.

Complex HAVING Conditions

HAVING COUNT(*) > 5 AND AVG(salary) > 50000

Performance Implication

WHERE reduces rows early (faster). HAVING: processes all groups (slower). Use WHERE when possible. HAVING: for aggregate-based conditions.

Aliases in HAVING

Some systems: allow aliases (SELECT AVG(salary) AS avg_sal... HAVING avg_sal > 50000). Standard SQL: requires full expression. Portability: use expression.

Joins (Basic Overview)

Basic Join Syntax

SELECT e.name, d.dept_name
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id;

Join Types

INNER JOIN: matching rows only. LEFT JOIN: left table all rows + matching right. RIGHT JOIN: right table all rows + matching left. FULL OUTER: all rows from both. CROSS JOIN: Cartesian product.

Join Condition

ON clause: specifies join predicate. Example: ON employees.dept_id = departments.dept_id. Can be complex: ON a.id=b.id AND a.type=b.type.

Table Aliases

Shorten names: FROM employees AS e, or FROM employees e. Required when joining same table twice. Improves readability.

Join Performance

Join strategy: nested loop, hash join, sort-merge. Optimizer: chooses based on table size, indexes. Indexes: critical for join performance. Detailed in query optimization section.

Multiple Joins

SELECT e.name, d.dept_name, m.name AS manager
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id
LEFT JOIN employees m ON e.manager_id = m.emp_id;

Subqueries

Scalar Subquery

SELECT name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);

IN Subquery

SELECT name
FROM employees
WHERE dept_id IN (SELECT dept_id FROM departments WHERE location='NYC');

EXISTS Subquery

SELECT d.dept_name
FROM departments d
WHERE EXISTS (SELECT 1 FROM employees e WHERE e.dept_id = d.dept_id);

Subquery in FROM (Derived Table)

SELECT dept, avg_sal FROM (
 SELECT department, AVG(salary) AS avg_sal
 FROM employees
 GROUP BY department
) AS dept_stats
WHERE avg_sal > 50000;

Correlated Subqueries

Subquery references outer query. Executed per row: expensive. Example: WHERE salary > (SELECT AVG(salary) FROM employees e2 WHERE e2.dept_id=e1.dept_id).

Performance Consideration

Subqueries: often less efficient than joins. Optimizer: may convert to join. Correlated: very expensive (per-row execution). Avoid if possible: prefer joins.

Query Execution and Optimization

Query Optimizer

Parses SQL: builds abstract syntax tree. Generates plans: multiple valid execution strategies. Estimates cost: chooses lowest. Plan: execution blueprint (operators, order).

Execution Plan Steps

Parse: syntax check. Compile: query plan generation. Optimize: choose best execution. Execute: run plan. Fetch: return results. Cost: parsing/optimization negligible for repeated queries.

Index Usage

Indexed column WHERE conditions: index scan (fast). Non-indexed: table scan (slow). Multiple indexes: optimizer chooses best. No index: full table scan inevitable.

Join Order

Multiple tables: join order matters. Small table first: hash join beneficial. Indexed column: use index. Optimizer: picks best order (not always intuitive).

EXPLAIN Plans

EXPLAIN SELECT ... FROM ...;

Output: operation sequence, estimated rows, cost, index usage

Query Tuning

Examine plan: identify bottlenecks. Add indexes: for WHERE/JOIN columns. Rewrite queries: expose optimizer opportunities. Iterative: measure improvements.

References

Ramakrishnan, R., and Gehrke, J. "Database Management Systems." McGraw-Hill, 3rd edition, 2003.
Garcia-Molina, H., Ullman, J. D., and Widom, J. "Database Systems: The Complete Book." Pearson, 2nd edition, 2008.
Silberschatz, A., Korth, H. F., and Sudarshan, S. "Database System Concepts." McGraw-Hill, 6th edition, 2010.
Kleppmann, M. "Designing Data-Intensive Applications." O'Reilly Media, 2017.
ISO/IEC 9075-1:2016 Information Technology - Database Languages - SQL - Part 1: Framework.