Banner &amp; Argos, explained simply

Reading a Banner Table Name — The Seven-Letter Code

You see `SPRIDEN`, `SFRSTCR`, `NBBPOSN`, `GOBEACC` every day. They look like random seven-letter strings. They are not. Each one is a road map — and once you can read it, you can guess what domain any Banner table belongs to without opening a data dictionary.

bannertable-namingconvention

STV* and GTV* — Banner's Code Dictionaries

Banner stores codes — not names, not descriptions, not the words a human reads. `SGBSTDN_MAJR_CODE_1 = 'BIO'`, `SFRSTCR_RSTS_CODE = 'RE'`, `STVTERM_CODE = '202610'`. The code is compact, efficient, and completely opaque. The translation is in a second set of tables — the STV* and GTV* dictionaries — and if you don't know they exist, you're reading a foreign language without the dictionary.

bannervalidation-tablesstvterm

Schemas — Which Drawer the Table Lives In

You type `SELECT * FROM gobeacc` in your SQL editor. Oracle returns `ORA-00942: table or view does not exist`. The table definitely exists — you saw it in BSS. The problem is not whether it exists. The problem is which drawer it lives in.

banneroracleschema

Effective Dating — Why Banner Never Forgets

A student changes majors. Banner does not cross out the old one and write the new one on top. It lays a new row on top of the old one and dates it. If your query does not specify which layer you want, Banner hands you all of them — and your report is silently wrong.

bannereffective-datingsgbstdn

Argos, X-Rayed — The DataBlock, the Report, the Parameters

Everyone calls it 'a report.' But what you see on screen — the columns, the headers, the dropdowns at the top — is only one of three components layered behind the glass. X-ray the thing, and you see a structure that nobody taught you explicitly: the DataBlock, the Report, and the Parameters. Three subsystems, one device, each invisible to the end user.

argosdatablockreport

TERM Codes — The Academic Timestamp Banner Uses Everywhere

You see `'202610'` in every WHERE clause you write. You have used `MAX(sgbstdn_term_code_eff)` a hundred times. But nobody ever told you why the format was chosen, why it sorts correctly without casting, or what `STVTERM` actually holds. The term code is not a magic number. It is ISO 8601 adapted to academic time — and the format IS the feature.

bannerterm-codestvterm

BThe canonical joins

Joining by PIDM — SPRIDEN and the Universal Key

Every report that displays a person's name uses the same three-line SQL incantation. It looks like boilerplate. It is not. Each condition earns its place — and if you move any of them to the wrong clause, you change what the word LEFT means.

bannerpidmspriden

TERM_CODE + CRN — The Registration Compound Key

You write `JOIN ssbsect ON ssbsect_crn = sfrstcr_crn`. The query runs. It returns rows — five times more than expected. The CRN looked global. It is not. CRN is unique only WITHIN a term, and you just joined across every term that ever reused it.

bannerterm-codecrn

The MAX() Subquery — Getting the Row That's Current

You will write this pattern a hundred times in your Banner career. Four lines of SQL that look like noise the first time you see them, and like the only thing holding the report together every time after. It is the most important SQL idiom in the entire Banner codebase, and once you can read it in your sleep, every effective-dated table in the ERP opens up.

bannersql-patterneffective-dating

The Double SPRIDEN — Naming Two People in One Query

You need a student's name and their advisor's name on the same row. Both live in SPRIDEN. You join SPRIDEN once and try to get both — and Oracle returns the same name twice. The fix is not a different table. The fix is a second alias.

bannerspridendouble-join

The Security Audit Join — GURACLS Done Right

An auditor asks: 'Show me everyone who has the STUDENT_RECORDS access class.' The answer lives in a single table — GURACLS. But GURACLS doesn't know anyone's name. It only knows user IDs. To answer the auditor's question, you need a three-table chain, and if you miss the active-account filter, the report includes people who left in 2018.

bannerguraclsgobeacc

Catalog vs Section — SCBCRSE and SSBSECT

SCBCRSE has a column called `eff_term`. SSBSECT has a `term_code`. They look related — so people join them. And when they do, three catalog versions of the same course silently multiply the result by three, and a 2020 transcript retroactively shows the 2024 course title. The join needs a bound, not just an equality.

bannerscbcrsessbsect

CFrom generic SQL to Banner

Banner Runs on Oracle — The Dialect You Will Meet

SQL is a standard. Oracle's version of it has its own vocabulary — small differences scattered through every query, none hard, none avoidable. You can't read Banner SQL for ten minutes without meeting `SYSDATE`, `NVL`, `DUAL`, `||`, `ROWNUM`, and `DECODE`. Learn them once, and the dialect becomes the language.

oraclebannersql-dialect

From SQL Server to Oracle — Translating Your Instincts

You know how to write SQL. You've written hundreds of queries on SQL Server. Then you open a Banner DataBlock and see `SYSDATE`, `NVL`, `ROWNUM`, `DUAL`, `||` — and every instinct you have about what to type is a half-second wrong. The skill carries. The syntax doesn't. Here is the translation.

oraclesql-serverdialect-translation

From Oracle to PostgreSQL — the Banner SaaS Migration

Ellucian's cloud Banner targets PostgreSQL, not Oracle. Every Argos DataBlock you write today in Oracle SQL will eventually run against a PostgreSQL database. Some of the SQL translates mechanically. Some doesn't. And one difference — `'' = NULL` — will silently change what rows your query returns without raising an error.

oraclepostgresqldialect-translation

From (+) to ANSI — Retiring Oracle's Old Outer Join

You open an older Banner SR report and see `WHERE a.x = b.x(+)`. It looks like a typo. It is not. It is Oracle's pre-ANSI outer join syntax — the stick-shift of the SQL world. It still runs, but PostgreSQL won't accept it, and the modern world has moved on. Here is the translation.

oracleansi-joinlegacy

DThe craft of Argos

Argos Parameters — `:main_`, `:lcl_`, `:dbn_`

Every Argos report is a building full of rooms, and every parameter is a microphone. The question is never 'does this parameter exist?' It is always 'can this room hear it?' The three prefixes — `:main_`, `:lcl_`, `:dbn_` — are the three answers to that question.

argosparametersscope

How Argos Assembles Your Query — Filters on the WHERE

You type `:main_DD_term_code` in your DataBlock SQL, the user picks 'Fall 2026' from a dropdown, and Oracle runs the query. What happens between the click and the execution is not parameter binding — it is string substitution, like a mail merge. The distinction explains every performance surprise, every silent breakage, and every 'it worked yesterday' your Argos users have ever reported.

argosparameterssubstitution

Seven Patterns Every Argos Report Needs

You have written the same WHERE clause a hundred times. Required filter, optional filter, multi-checkbox, date range, partial-text search, toggle, cascading dropdown. You debug the NULL edge case and the empty-selection syntax error from scratch every time. You don't need to. There are exactly seven patterns. Learn them once, copy them forever.

argosparameterswhere-clause

Shared DataBlocks — One SQL, Many Reports

You have two reports that need the same underlying data — a summary and a detail view, both backed by the same financial aid transactions. You could write two DataBlocks. Two SQL bodies. Two sets of filters. Two copies of business logic that will drift apart the first time someone updates one and forgets the other. Or you could write one DataBlock with a discriminator column and let the consumer reports filter their slices. That is the shared-DataBlock pattern, and it is how Waubonsee's FAID1084 and FAID1006 work.

argosdatablockunion-all

EWhere intuition fails

The Phantom INNER JOIN — When a WHERE Breaks Your LEFT JOIN

A report told to list every student lists only some — and the LEFT JOIN that was supposed to keep them is spelled out, correct, and innocent.

joinsleft-joinwhere-clause

SPRIDEN Without CHANGE_IND — The Duplicate-Name Trap

You join to SPRIDEN, run the query, and scan the output. The names look right. The row count is wrong. You have just shipped a report with phantom duplicates — and the error is invisible because every column looks correct except the number at the bottom of the page.

bannerspridenchange-ind

PHRHIST Without DISP — In-Progress vs Posted Payroll

You sum `PHRHIST_GROSS` for the fiscal year and the number looks right. It matches what you remember from the last payroll run. It is wrong. You have included rows from the payroll that is still being calculated — rows that look identical to posted rows in every column except one. The bank calls them 'pending.' Banner calls the column `PHRHIST_DISP`.

bannerphrhistdisposition

LISTAGG Overflow — The List That Silently Truncates

You run a security report listing every role per user. The output looks fine — every user has a role list, every list looks plausible. But the user with 80 roles has only 47 in your output. The rest were truncated. No error fired. No warning appeared. You have shipped a report with missing data, and the only way to discover it is to count the commas by hand.

bannerlistaggoracle

Soft Deletes — The Rows That Aren't Really Gone

You withdraw a student in Banner. The row in SGBSTDN does not disappear — it gets a status code. You drop a registration. SFRSTCR keeps the row with a drop flag. You delete a security role. The audit log keeps an entry with AUDIT_ACTION = 'D'. Banner does not hard-delete. The rows stay in the table forever. Every report that does not filter them out is silently counting ghosts.

bannersoft-deletesfrstcr

The Effective-Date Trap — Joining to Yesterday's Row

You run a report: 'Fall 2022 enrollment by current major.' The row count is right. The CRNs match. Every student has exactly one major. What nobody told you is that the major is from today — not from Fall 2022. You used the unbounded MAX-effective subquery from B3, and it silently tagged every historical registration with present-tense labels. The report is a history book whose author walked into the archive and swapped all the old placards for new ones.

bannereffective-datingsgbstdn

The `> 0` Trap — The Filter That Drops Reversals

You add `AND phrhist_gross > 0` to your payroll report. The intent is defensive: exclude zero rows, count only real amounts. The effect is the opposite of defensive. You have silently dropped every payroll reversal — every void, every adjustment, every back-out. Your 'total gross earnings' now includes money that was keyed by mistake and reversed the next day. The filter that was supposed to protect the report broke it.

bannerphrhisttbraccd

FFrom Banner to a warehouse

What Waubonsee Actually Reports Today — and Where the Warehouse Should Land First

Before you draw your first star, look at what the campus already prints every week. The Argos folder will tell you which warehouse to build first — and the answer is not the one you expected.

argosevidencewarehouse-strategy

Why a Warehouse? — OLTP, OLAP, and the Cost of Asking Banner the Wrong Question

Banner registers a student in milliseconds — that is its job. Ask it how enrollment shifted over the last five years, and the same engine will contend for the very rows the registrar is touching right now. One database cannot be optimal for both tasks.

warehouseoltpolap

Facts, Dimensions, Measures — The Multidimensional View

Every report you have ever written follows the same hidden grammar: a number, sliced by context. You have been thinking in facts and dimensions your whole career. You just never called them that.

warehousekimballfacts

The Star Schema — One Fact, Many Dimensions, and the Grain

A star schema is not a diagramming convention. It is a mechanical guarantee: every dimension is exactly one JOIN away from the fact. No exceptions, no shortcuts, no climbing branches.

warehousekimballstar-schema

Slowly Changing Dimensions — Keeping History When Attributes Change

A dimension says what something *is*. But things change. If you overwrite the old value, you rewrite history. If you keep every version, you need a way to tell them apart. The three choices are the difference between a warehouse you trust and one you quietly stop using.

warehousekimballscd

ETL from Banner — Moving Data on a Schedule, with Windmill

A warehouse that is not fed fresh data every night is not a warehouse. It is a museum. The difference between the two is a scheduled, repeatable, monitored ETL pipeline — and that pipeline is the only part of the system Banner users ever actually feel.

warehouseetlwindmill

The Semantic Layer — Where Argos, Power BI, and Dashboards Sit

The warehouse is not the product. The warehouse is the kitchen. The product is the menu — the single curated view of the data that every report writer, every dashboard, every Argos DataBlock consumes. That menu is called the semantic layer, and if you skip it, every consumer rebuilds it from scratch in their own head.

warehousesemantic-layerpower-bi

The Three Fact-Table Patterns — Transaction, Periodic, Accumulating

A fact table holds measurements. But not all measurements behave the same way. The first design decision when you model a new star is not which columns to include. It is which of three canonical patterns the fact table follows — and picking wrong means building a star that cannot answer the questions the business needs to ask.

warehousekimballfact-patterns

Factless Fact Tables — Events and Coverage

Some of the most valuable questions a warehouse can answer have no numbers in them: 'Which students registered for this course?' 'Which classrooms sat empty this term?' 'Which admitted applicants never enrolled?' A fact table with no measures sounds like a contradiction. It is not. It is the cleanest answer to the 'what happened' and 'what did not happen' questions that dollars-and-hours fact tables cannot touch.

warehousekimballfactless

GBuilding the Waubonsee warehouseplaybook

HDataBlock architecture & engineering decisions

Track H · DataBlock architecture & engineering decisions

One DataBlock Per Report, or One for Many? The Decision Framework

An Argos shop with 500 reports and 500 DataBlocks has a maintenance problem. An Argos shop with 500 reports and 80 DataBlocks has a complexity problem. Neither answer is wrong. But the choice between them — one DataBlock per report, or one DataBlock serving many — is the most consequential architectural decision a Banner reporting team makes after choosing Argos itself. Here is the framework for making it deliberately.

argosdatablockarchitecture

Track H · DataBlock architecture & engineering decisions

Finding Consolidation Candidates — Programmatic Similarity Across the Catalog

Waubonsee's Argos catalog has ~670 DataBlocks. Some of them are near-duplicates of each other — same SQL shape, same tables, same fields, one filter different. Finding them by hand means eyeballing 670 × 669 / 2 ≈ 224,000 pairs. A similarity tool can scan the whole catalog in seconds and surface the top candidates ranked by what matters. Here is how it works, what it found, and what to do with the list.

argosdatablockconsolidation

Track H · DataBlock architecture & engineering decisions

Safe Consolidation Migration — How to Merge N DataBlocks into One Without Breaking Anyone

The decision to consolidate has been made. The candidates have been identified. Now comes the part where things actually break — rewiring consuming reports to a new DataBlock without the numbers drifting, without a user opening a report to wrong totals, without an emergency rollback nobody has practiced. The safe pattern is not one big swap. It is five sequential phases, every one reversible, and a rule that the old DataBlocks stay alive until the new one has earned every consumer's trust.

argosdatablockconsolidation

Track H · DataBlock architecture & engineering decisions

When 1:1 Wins — The Case for One DataBlock Per Report

H1 framed the debate neutrally. H2 surfaced the consolidation candidates. H3 wrote the careful migration recipe. This article steps back from the neutrality and makes the contrarian case: in most Argos catalogs, **one DataBlock per report is the right default**. Not because consolidation is wrong — it is sometimes right — but because the costs of consolidation are systematically underestimated, and the benefits of 1:1 are systematically undersold. Here is the defense.

argosdatablockarchitecture

Track H · DataBlock architecture & engineering decisions

Running Argos Similarity v2.3 — the operational guide

H2 explains the architecture. This article tells you how to actually use the tool — what to run, what to read first, what to ignore, and what the tool quietly cannot see. Follow the recommended workflow (orphans first, low clusters next, low pairs after that, most high-cost pairs never) and the tool's output becomes a backlog you can act on in a single sprint instead of a thousand-row spreadsheet that nobody opens.

argosdatablockconsolidation

IBeyond direct SQL — Ethos & the integration layer

Track I · Beyond direct SQL — Ethos & the integration layer

What Ethos actually is — one stack, three products, one spec, two brand names

Ellucian renamed Ethos to 'Ellucian Platform' in 2026 — but the airport still lands the same planes through the same gates.

ethosintegrationeedm

Track I · Beyond direct SQL — Ethos & the integration layer

EEDM REST mechanics — passport, boarding pass, version-pinned gate

Your passport never goes through the gate. You exchange it once at security for a boarding pass that expires in five minutes — and re-exchange whenever it does.

ethoseedmrest

Track I · Beyond direct SQL — Ethos & the integration layer

GUIDs vs PIDM — the impedance Banner SQL writers feel first

When an Ethos response lands on your desk, you can't join it on PIDM. You need GORGUID first.

ethosguidpidm

Track I · Beyond direct SQL — Ethos & the integration layer

When Ethos, when SQL — the decision frame for the next 3-5 years

Direct SQL is your private courier — fast, knows your roads, never crosses borders. Ethos is international cargo — slow, paperwork-heavy, but reaches anywhere. In Banner SaaS the back door has no key, so you ship by cargo.

ethossqlargos

Track I · Beyond direct SQL — Ethos & the integration layer

Transcript import end-to-end — customs at the EEDM port

An incoming transcript is cargo at customs. Every course needs a tariff classification — check the schedule first, file a new one if needed, then issue the import permit. Ethos exposes both desks.

ethostranscripttransfer-credit

No article matches that.

Track A · What is it, really

PIDM — The Number Behind Every Person

7 min readbannerpidmspridenjoinsfoundation

The hook

Every person in Banner has two names: the one you see, and the one the database uses. The one you see is printed on pay stubs, class rosters, vendor checks — a readable string like "Smith, John A." or a visible 8-digit Banner ID like 00123456. The one the database uses is a number you were never meant to know about. It is an internal surrogate: NUMBER(8), system-generated, invisible to end users, and utterly non-negotiable. It is called PIDM — Personal ID Master — and it is the single most important column in every SQL query you will ever write against Banner.

The everyday analogy

When you sign up for a library card, the librarian types your name and address into the system and the computer assigns you patron number #54287. That number is meaningless to you. You carry a physical card with your name on it. The librarian greets you by name. The receipt prints your name. But internally, every book you have ever checked out, every fine you have ever paid, every inter-library loan you have ever requested, is recorded against patron #54287 — not against "Jane Smith."

A library patron card on a wooden counter; behind it, a card-catalog drawer of borrowing history — every slip keyed on patron number #54287, not the name. The name on the card is what the world sees; the number is what the system uses.

Then you get married and change your name. The card gets re-issued with "Jane Cortez" on it. The address on file changes. Maybe the visible 16-digit barcode on the card even gets replaced when the library switches to a new card design. None of those changes touch your borrowing history. Patron #54287 is still patron #54287. The librarian re-types your current name into the system, the receipts now say "Cortez," but the loan records — every book you checked out before the name change and every book you will check out after — still join on the same patron number. Silently, invisibly, perfectly.

That patron number is PIDM. Your last name is SPRIDEN_LAST_NAME. Your visible library card barcode is SPRIDEN_ID. The librarian's re-typing your new name is an insert into SPRIDEN — the old name row gets flagged as historical (SPRIDEN_CHANGE_IND = 'N' for name change), the new name row is the current one (SPRIDEN_CHANGE_IND IS NULL). Banner uses PIDM because names change, visible IDs can be corrected or re-issued, and Social Security Numbers are things you actively try to avoid joining on. The invariant — the one thing that must never shift under you — has to be a number nobody else knows about.

What it really is

PIDM is Banner's internal surrogate key for every person and non-person entity in the system. It is NUMBER(8), generated once by Banner when an entity is first inserted, and it never changes and is never reused for the life of that entity. A student, an employee, a vendor, an applicant — every entity gets exactly one PIDM, and that PIDM stays with it forever.

SPRIDEN is the translation table. It sits between the raw PIDM and the human-readable world. It stores one row per name version per entity — so a person who has changed names (marriage, divorce, legal name change) appears in SPRIDEN multiple times. All but the current row carry a SPRIDEN_CHANGE_IND value: 'N' for a name change, 'I' for an identification change (a typo correction in the visible ID, a merge). The current row — the one that represents the person's name right now — has SPRIDEN_CHANGE_IND IS NULL.

SPRIDEN_ENTITY_IND tells you what kind of entity the PIDM represents: 'P' for a person, 'C' for a company or corporation. Vendor records like FTVVEND use 'C' often, because the vendor might be a business rather than an individual. You filter on this column every time you join to SPRIDEN for people, or you risk mixing companies into your student roster.

SPRIDEN at the center — the one table that holds the current name and visible ID. Rings of Banner tables (SGBSTDN, SFRSTCR, PEBEMPL, NBRJOBS, FTVVEND) all reach back to the same PIDM, regardless of the role the person is playing.

The same PIDM is shared across roles. PIDM 38201 might be a student in SGBSTDN (the student base table), an employee in PEBEMPL (the HR employee table), AND a vendor in FTVVEND — all simultaneously. The student-worker who also sells handmade crafts to the bookstore is one person, one PIDM, three roles. Every table that records anything about a person joins back to SPRIDEN on PIDM — not on name, not on the visible Banner ID, not on SSN. The visible ID (SPRIDEN_ID) is for humans: registrars type it, Argos prompts ask for it, reports display it. PIDM is for joins: fast, stable, anonymous.

See it — the diagram

The ring diagram makes the relationship visible. SPRIDEN sits at the center — the one table that knows a PIDM's current name, current visible ID, and current entity type. Radiating outward are the role-specific tables: student records, HR records, finance records. Every arrow points inward, toward the same PIDM. The diagram says what a thousand words of documentation would say: to ask a question about a person in Banner, you start at the role table, join to SPRIDEN on PIDM, and filter by change indicator and entity type. The pattern repeats identically across the entire ERP.

Show me the code

The canonical PIDM resolution — from a raw PIDM to the current name — is three lines of SQL and two filters you must never omit:

-- Get the CURRENT name for a given PIDM.
-- The two WHERE filters are not optional: without change_ind you get
-- duplicates (one row per historical name version), without entity_ind
-- you may catch a company that shares the PIDM number space.
SELECT s.spriden_pidm,
       s.spriden_id,
       s.spriden_last_name || ', ' || s.spriden_first_name AS full_name
FROM   spriden s
WHERE  s.spriden_change_ind IS NULL
  AND  s.spriden_entity_ind = 'P';

Now use it in a real query — a course roster for a specific term. The student registration table (SFRSTCR) carries the PIDM; SPRIDEN provides the name:

-- Roster of students in a specific course-section in Spring 2026.
-- The fact table (sfrstcr) joins to spriden on PIDM, never on ID.
SELECT s.spriden_id           AS student_id,
       s.spriden_last_name    AS last_name,
       s.spriden_first_name   AS first_name,
       r.sfrstcr_crn           AS course_ref_number,
       r.sfrstcr_credit_hr     AS credit_hours
FROM   sfrstcr r
JOIN   spriden s
       ON  s.spriden_pidm        = r.sfrstcr_pidm
       AND s.spriden_change_ind  IS NULL
       AND s.spriden_entity_ind  = 'P'
WHERE  r.sfrstcr_term_code = '202610';

The join is on PIDM — not on SPRIDEN_ID, not on SPRIDEN_LAST_NAME, not on any column a human types. The two filter conditions are inside the JOIN, not in a WHERE — they are part of the join contract, not an afterthought. When a query needs to resolve a second person — an advisor, a supervisor, a reporting manager — you join SPRIDEN a second time with a different alias. That pattern is covered in The Double SPRIDEN — Naming Two People in One Query.

Where intuition fails

Five lessons that every Banner SQL writer learns the hard way:

Never join people by name. There are multiple "Smith, John" records in

any college database. Names have typos. Accents come and go across systems. Suffixes ("Jr.", "III") are inconsistent. A name is a label, not a key. PIDM is the key. Join on it every time.

Never join people by SSN. SSNs are regulated PII — every copy of an SSN

in your query, your logs, your result set is an audit liability. Many Banner records have missing or placeholder SSNs (999-XX-XXXX patterns for international students, for example). PIDM exists specifically so you never have to touch SSN in a join. Use it.

**SPRIDEN_CHANGE_IND IS NULL is mandatory.** Without this filter, anyone

who has ever changed names — marriage, divorce, legal correction — returns duplicate rows in your result set. One row for every historical name version. A roster of 25 students silently becomes 31 rows, and the duplicates look identical except for the name column. The Banner Semantic Search SQL Explainer flags this missing filter as a warning. SPRIDEN Without CHANGE_IND — The Duplicate-Name Trap covers the full duplicate-name gotcha, with examples.

A single PIDM wears multiple hats. A student who works part-time on

campus and also sells handmade goods to the bookstore is one person, one PIDM, and rows in SGBSTDN (student), PEBEMPL (employee), and FTVVEND (vendor) simultaneously. If you join all three tables naively on PIDM without filtering by role, you produce a Cartesian mess — every student row multiplied by every employee row multiplied by every vendor row. Join one role table at a time, or use EXISTS to check for role membership without multiplying rows.

PIDM is not for display. Users see the 8-digit Banner ID

(SPRIDEN_ID). Your reports must translate PIDM back to SPRIDEN_ID before showing anything to a user. Never print a raw PIDM in a report, a dashboard, or an export. PIDM is an internal surrogate — it is not a student ID, not an employee ID, not a vendor number. Leaking it to the UI is a privacy concern, and it confuses users who expect to see the visible Banner ID they recognize. The join uses PIDM; the SELECT shows SPRIDEN_ID.

The one-sentence takeaway

PIDM is Banner's internal person number. It never changes, it never repeats, and it is the only thing you should ever join on.

Track A · What is it, really

Reading a Banner Table Name — The Seven-Letter Code

You see SPRIDEN, SFRSTCR, NBBPOSN, GOBEACC every day. They look like random seven-letter strings. They are not. Each one is a road map — and once you can read it, you can guess what domain any Banner table belongs to without opening a data dictionary.

6 min readbannertable-namingconventionschemaprefixellucian

The hook

The everyday analogy

Type a U.S. ZIP code into a search box — 60134. To anyone who knows the convention, that string encodes structure. The first digit (6) identifies a large region of the country: 0 for the Northeast, 1 for upper New York/Pennsylvania, 2 for the Mid-Atlantic, 6 for the Midwest including Illinois. The next two digits (01) narrow to a sectional center: northern Illinois. The last two digits (34) identify a local post office: Geneva, Illinois. Five characters, three levels of geographic precision, all encoded so that a sorter reading them left to right gets progressively more specific.

A postal sorter's wall map of the US showing ZIP code regions colored by first digit; below it, a sorted stack of envelopes with ZIP codes highlighted in coral; alongside, a small handwritten card showing `SPRIDEN` decoded into 'S = Student / PR = Person / IDEN = identification'.

Banner table names are the same trick applied to data domains. Seven characters, encoded so that a SQL writer reading them left to right gets progressively more specific. The first letter is the "ZIP region" — it tells you the system area: Student, Finance, Payroll, Position Control. The next two letters narrow to the application within that system. The remaining letters name the object the table stores. By the time you have read all seven, you know what domain the table belongs to, what application produced it, and what kind of record it holds — all without looking anything up.

Like ZIP codes, the convention is not perfect. Some prefixes are exceptions (the G for General is the cross-system catch-all, like ZIP codes that don't follow the regional grid cleanly). Some installations have custom tables that don't follow the rule at all (like a private courier service with its own routing codes). But the overwhelming majority of Banner tables follow the seven-letter rule, and learning to read it pays for itself the first day.

What it really is

Banner table names follow a positional convention — typically 7 characters, occasionally 6 or 8 for older or newer additions. Standard format: SAAOOOO.

Position 1 — System area (one letter). The most useful single character. S = Student (registration, academic history, admissions, course catalog, advising). F = Finance (general ledger, accounts payable, purchasing, grants). T = Accounts Receivable / Bursar (student finance). P = Payroll (wage history, deductions, pay events). N = Position Control (HR positions, labor distribution, jobs). R = Research / Financial Aid (RPRAWRD for award). G = General (cross-system: security, addresses, audit). A = Alumni / Advancement. W = Custom institutional tables (the convention at most installations; not part of Banner's distributed schema).
Positions 2-3 — Application within the system. Varies by area. In Student: B for Banner base, F for Registration (SFRSTCR), G for General Person/Student (SGBSTDN, SGRADVR), H for Academic History (SHRGRDE), R for Course catalog (SCBCRSE), P for Person (SPRIDEN, SPRADDR). In Finance: TV for validation, TB for base tables.
Positions 4-7 — Object. Names the specific record. B* = base/master table (NBBPOSN). R* = repeating/detail/rules table. TV = validation lookup (STVTERM, STVMAJR). V* = view. Common suffixes: IDEN (identification), EMPL (employee), STDN (student), POSN (position), TERM (term), CRSE (course).

There is a valuable sub-convention: any table name with TV in the middle two positions (*TV*) is a validation lookup table. STVTERM = Student validation TERM. GTVZIPC = General validation ZIP Code. PTVPDIS = Payroll validation disposition. See STV* and GTV* — Banner's Code Dictionaries for the deep dive.

Four real Banner table names (SPRIDEN, SFRSTCR, NBBPOSN, FTVORGN) decoded into their three positional parts (system area / application / object), each part colored to match a system-area legend on the side.

The convention is consistent enough that you can decode a new table name in seconds. NBRJOBS → N = Position Control, BR = Base Rules, JOBS = jobs (the position-to-job assignment table). FTMFUND → F = Finance, TM = Transaction Management, FUND = fund. RPRAWRD → R = Research/Financial Aid, PR = Prospective, AWRD = award. You don't need a data dictionary to know which domain a table lives in.

See it — the diagram

Four Banner table names decoded into their three positional parts, each part color-coded: system area (coral), application (ink), object (medium gray). A system-area legend along the right side shows all eight system letters with their full names. SPRIDEN splits into S / PR / IDEN (Student → Person → identification). SFRSTCR splits into S / FR / STCR (Student → Faculty/Registration → student course registration). NBBPOSN splits into N / BB / POSN (Position Control → Base Banner → position). FTVORGN splits into F / TV / ORGN (Finance → validation table → organization). The visual makes the positional structure obvious: reading left to right, the domain gets more specific — exactly like the ZIP code on the envelope on the facing page.

Show me the code

Decode real Banner table names by walking through their prefixes:

SPRIDEN  →  S  = Student system
            PR = Person record
            IDEN = identification
         (current name + ID per person)

SFRSTCR  →  S  = Student
            FR = Faculty/Registration
            STCR = student course registration

NBBPOSN  →  N  = Position Control
            BB = Base Banner (position master)
            POSN = position

FTVORGN  →  F  = Finance
            TV = validation table
            ORGN = organization

PHRHIST  →  P  = Payroll
            HR = HR History
            HIST = history (per pay event)

GOBEACC  →  G  = General (cross-system)
            OB = Banner Object
            EACC = e-account (security user)

STVTERM  →  S  = Student
            TV = validation table
            TERM = term code lookup

SCBCRSE  →  S  = Student
            CB = Course Base
            CRSE = course

The exercise is the point: decoding the prefix tells you what domain, what application, what kind of record — without any data dictionary lookup. Once you've decoded ten of these, the eleventh is guessable.

Where intuition fails

The prefix does NOT always tell you the schema. GOBEACC starts with G (suggesting General data), and indeed it lives in the GENERAL schema — but the rule is not universal. Some G-prefixed tables live in SATURN because they predate the schema split. The only reliable schema lookup is ALL_TABLES.OWNER or the BSS schema search. See Schemas — Which Drawer the Table Lives In.

Custom tables don't always follow the rule. Institution-specific tables created by a Waubonsee developer might use any prefix — common conventions are W* or Z*, but local developers sometimes use Banner-style prefixes that accidentally collide with future Ellucian additions. Check the BSS schema search when you encounter an unfamiliar table.

Some prefixes overlap by accident. SPRADDR starts with S (Student system) but stores person addresses, which are cross-system in practice. The prefix tells you where the table was originally housed; the table's actual use may be broader. SPRIDEN itself is S-prefixed but is the cross-system person-identification table that every system joins through.

The 4-letter object suffix is not always a noun. Some suffixes are abbreviations (IDEN for identification, EMPL for employee, STDN for student). Others are exact words (POSN for position, TERM for term, CRSE for course). Learn the common ones and the rest become guessable — none are truly random.

**The *TV* sub-convention is the most reliable rule.** Any Banner name with TV in positions 2-3 is a validation lookup table. STV*, GTV*, PTV*, FTV*, NTV*, ATV* — the system-area letter changes but the TV middle holds across every system. This is the first thing to check when you see an unfamiliar table name.

The one-sentence takeaway

Banner table names are seven-letter positional codes. The first letter is the system area (S=Student, F=Finance, P=Payroll, N=Position Control, G=General, R=Research/Aid, T=AR, A=Alumni). Letters 2-3 narrow to the application. The rest name the object. Read one, and you know what domain it belongs to before you even look at the columns.

Track A · What is it, really

STV* and GTV* — Banner's Code Dictionaries

Banner stores codes — not names, not descriptions, not the words a human reads. SGBSTDN_MAJR_CODE_1 = 'BIO', SFRSTCR_RSTS_CODE = 'RE', STVTERM_CODE = '202610'. The code is compact, efficient, and completely opaque. The translation is in a second set of tables — the STV and GTV dictionaries — and if you don't know they exist, you're reading a foreign language without the dictionary.

6 min readbannervalidation-tablesstvtermstvmajrgtvlookupdictionary

The hook

The everyday analogy

Open any college textbook to chapter 4. The text uses abbreviations and codes freely: "the FRC group," "Type II diabetes," "Class B amplifier," "the SR latch." The reader follows along, but every few pages they hit a code they don't recognize. They flip to the glossary at the back of the book. Alphabetical, one entry per term, one short definition each. "SR latch — set/reset bistable circuit." "FRC — Free Radical Capture." The text uses codes for brevity; the glossary defines them in one place; the reader joins one to the other in their head as they read.

An open college textbook with chapter 4 visible on the left page (text peppered with abbreviations like 'SR latch' and 'FRC group') and the alphabetical glossary on the right page (the same terms defined); a reader's finger marking the glossary entry.

Banner has the same structure. The data tables (SGBSTDN, SFRSTCR, PHRHIST) use codes everywhere — major codes, registration status codes, disposition codes, fund codes, organization codes. The validation tables (STVMAJR, STVRSTS, PTVPDIS, FTVFUND, FTVORGN) are the glossary: one row per code, with the description. To get a human-readable major name out of a student record, you join SGBSTDN to STVMAJR on the major code — the reader's eye flipping to the back of the book, made into a SQL JOIN.

The convention is so consistent that once you recognize the middle-letter pattern (TV in any Banner name), you know exactly what kind of table you are dealing with and how to use it. No guessing.

What it really is

The ***TV* convention** is Banner's most reliable naming rule. Any table name with TV in positions 2-3 is a validation table — a code dictionary. The system-area letter still applies: STV* = Student system validation, GTV* = General, PTV* = Payroll, FTV* = Finance, NTV* = Position Control, ATV* = Alumni.

A validation table typically has at least these standard columns:

<TABLE>_CODE — the primary key, the code value ('BIO', 'RE', '202610')
<TABLE>_DESC — the human-readable description ('Biology', 'Registered', 'Fall 2026')
<TABLE>_ACTIVITY_DATE — last modified timestamp
Often additional metadata: _VALID_*_IND flags (valid for Admissions, valid for Recruitment, etc.), sort order columns, parent-code columns for hierarchies, active/obsolete flags

**STV* (Student validation)** — the most numerous family. STVTERM (terms), STVMAJR (majors), STVRSTS (registration statuses), STVSTST (student statuses), STVRELG (religious affiliations), STVNATN (nations) — hundreds of these tables. See TERM Codes — The Academic Timestamp Banner Uses Everywhere for the deep dive on STVTERM.

**GTV* (General validation)** — cross-system code dictionaries. GTVZIPC (ZIP codes), GTVSDAX (cross-walk to external systems), GTVINSTITUTION (sister/parent institutions).

**PTV, FTV, NTV*, ATV*** — same pattern in their respective systems. PTVPDIS for payroll disposition (see PHRHIST Without DISP — In-Progress vs Posted Payroll). FTVORGN for finance organizations, FTVFUND for fund codes.

Left side: a SGBSTDN row showing major_code='BIO' (coral cell); right side: the STVMAJR validation row for 'BIO' showing desc='Biology' (coral cell); a coral JOIN arrow connecting the two cells, with JOIN stvmajr ON stvmajr_code = sgbstdn_majr_code_1 underneath.

The join pattern is universal. Every roster query, every financial summary, every enrollment report joins through at least one validation table to convert codes to descriptions. The validation table is usually small (a few hundred rows at most) and well-indexed on the code column — joins are inexpensive. The data table holds the code; the validation table holds the description; the SQL JOIN is the reader's finger on the glossary page.

See it — the diagram

A single SGBSTDN row on the left, one cell highlighted in coral: SGBSTDN_MAJR_CODE_1 = 'BIO'. A coral JOIN arrow arcs to the right, where a single STVMAJR row sits, one cell highlighted in the same coral: STVMAJR_DESC = 'Biology'. The join condition sits written beneath the arrow: JOIN stvmajr ON stvmajr_code = sgbstdn_majr_code_1. Below the diagram, a rendered result row shows what the user sees: "Student ID: 900123456, Last Name: Chen, Major: Biology" — the code is gone, the description is present. The visual formula is one diagram that generalizes to every _CODE column in Banner.

Show me the code

Look up a single code:

-- What does 'BIO' mean as a major code?
SELECT stvmajr_code, stvmajr_desc, stvmajr_valid_a_ind
FROM   stvmajr
WHERE  stvmajr_code = 'BIO';

Join a data table to its validation table for the human-readable label:

-- Students by major NAME, not just code.
-- The join through STVMAJR is the "flip to the glossary" step.
SELECT s.spriden_id           AS student_id,
       s.spriden_last_name    AS last_name,
       m.stvmajr_desc         AS major
FROM   sgbstdn g
JOIN   spriden s
       ON  s.spriden_pidm        = g.sgbstdn_pidm
       AND s.spriden_change_ind  IS NULL
       AND s.spriden_entity_ind  = 'P'
LEFT JOIN stvmajr m
       ON m.stvmajr_code = g.sgbstdn_majr_code_1
WHERE  g.sgbstdn_term_code_eff = '202610';

List every active code in a validation table (useful for populating an Argos dropdown):

-- Active terms only — the dropdown's source list.
SELECT stvterm_code, stvterm_desc
FROM   stvterm
WHERE  stvterm_start_date <= SYSDATE + 365
  AND  stvterm_end_date   >= SYSDATE - 365
ORDER BY stvterm_code DESC;

Find which validation table serves a given code column: Match the suffix. SGBSTDN_MAJR_CODE_1 → look for STVMAJR. SFRSTCR_RSTS_CODE → look for STVRSTS. PHRHIST_DISP → look for PTVPDIS. The naming convention from Reading a Banner Table Name — The Seven-Letter Code makes this predictable: the column suffix and the validation table name are the same root.

Where intuition fails

**Use LEFT JOIN, not JOIN, when joining to a validation table.** Some data rows carry codes that no longer exist in the validation table (codes retired without cleaning up the data). An inner join silently drops those rows. A LEFT JOIN keeps them and shows the description as NULL — which you can COALESCE to 'Unknown' for the report. Dropping rows because their code is stale is silent data loss.

Validation tables sometimes have per-system validity flags. STVMAJR_VALID_A_IND says whether the major is valid for the Admissions system. STVMAJR_VALID_R_IND says whether it is valid for Recruitment. A code may be valid in one system and not another. If you are filtering for "available majors in the Recruitment dropdown," check the right _VALID_*_IND.

Some validation tables are hierarchical. STVMAJR has STVMAJR_DEPT_CODE (parent department) and STVMAJR_COLL_CODE (parent college). The validation table itself encodes the org-chart of majors. Use the parent columns for grouping reports without joining to a separate STVDEPT or STVCOLL table.

The same code can mean different things in different columns. 'A' in SGBSTDN_STST_CODE means "Active Student"; 'A' in PEBEMPL_EMPL_STATUS means "Active Employee"; 'A' in SFRSTCR_RSTS_CODE might mean "Approved" depending on the installation. The code is scoped to its own validation table. Never compare codes across columns without joining each through its own STV/GTV.

Inactive / obsolete codes still appear in old data. Banner does not retroactively rename codes when validation table entries are deactivated. A student from 2015 may still have SGBSTDN_MAJR_CODE_1 = 'ENGZZ' (an obsolete engineering placeholder) even though STVMAJR no longer marks 'ENGZZ' as valid. The join still finds the row; the _VALID_*_IND flag tells you it is inactive. See also Soft Deletes — The Rows That Aren't Really Gone.

The one-sentence takeaway

Every Banner column ending in _CODE joins to an STV or GTV validation table. The convention is universal: *TV* in any Banner table name means validation lookup. Join through them with LEFT JOIN to convert opaque codes into human-readable descriptions — the code is the data, the validation table is the glossary at the back of the book.

Track A · What is it, really

Schemas — Which Drawer the Table Lives In

You type SELECT * FROM gobeacc in your SQL editor. Oracle returns ORA-00942: table or view does not exist. The table definitely exists — you saw it in BSS. The problem is not whether it exists. The problem is which drawer it lives in.

7 min readbanneroracleschemasaturngeneralsynonymgrantnamespace

The hook

The everyday analogy

Walk into the records room of any office that still keeps paper files. The wall is lined with filing cabinets. Each cabinet has multiple drawers, and each drawer has a label: "Personnel A-G," "Personnel H-M," "Personnel N-Z," "Vendor Contracts," "Old Tax Returns 2010-2019."

Now imagine you are sent to find the personnel file for "Margaret Chen." You walk in knowing two things: her name, and that you need her personnel file. You can find it because you know the file is in the personnel drawers, and you know alphabetical order. But if you didn't know it was in the "personnel" drawers, you might pull the "Vendor Contracts" drawer first, find nothing, and conclude "Margaret Chen has no file" — when really the file exists, you were just searching the wrong drawer.

A wall of vintage wooden filing cabinets with labeled drawers ('SATURN', 'GENERAL', 'PAYROLL', 'FIMSMGR'); one drawer pulled out showing folders (table names) inside; a small index card sticking out reading 'GOBEACC → GENERAL drawer'.

Banner is the records room. Each Oracle schema is a drawer. The drawers are labeled — SATURN, GENERAL, PAYROLL, FIMSMGR. The tables are the folders inside the drawers. To find a specific table, you need to know which drawer it lives in. A SELECT * FROM gobeacc is "open the default drawer and look for GOBEACC" — and if your default drawer is SATURN, the file is not there. The folder exists, but in another drawer. You need to either name the drawer explicitly (SELECT * FROM general.gobeacc) or have the filing system set up to look in multiple drawers automatically — Oracle synonyms, the records room's cross-reference index.

Like the records room, knowing which drawer holds what is half the job. The other half is having the keys to open the drawer — Oracle grants, the permissions that say which users can read which schemas.

What it really is

An Oracle schema is a namespace — a collection of tables, views, sequences, and other objects owned by a specific database user. Every table belongs to exactly one schema. The fully-qualified name is SCHEMA.TABLE.

The most common Banner schemas:

**SATURN** — the Student system. Most S*-prefixed tables (SPRIDEN, SGBSTDN, SFRSTCR, SCBCRSE, STVTERM, etc.) live here. The biggest schema by both row count and table count.
**GENERAL** — cross-system tables. GOBEACC (e-account security), GUBALOG (audit log), GOREMAL (email addresses), GTVNATN (nations), GTVZIPC (ZIP codes). The GOBEACC-in-GENERAL pitfall: GOBEACC's prefix looks like it should be in GENERAL (and it is), but many newcomers assume it lives in SATURN alongside other person tables — and get ORA-00942.
**PAYROLL** — payroll tables. PHRHIST (pay history), PEBEMPL (employee base). Some installations place these under different schema names; verify locally.
**FIMSMGR** — Finance Management. Finance and General Ledger tables.
**TAISMGR** — Accounts Receivable / Tax. TBRACCD (AR transactions), TBBDETC (detail codes).
**FINAID** — Financial Aid. RPRAWRD, RPBAWRD, etc. Some installations name this differently.

Synonyms are aliases that map an unqualified name to a fully-qualified one. CREATE PUBLIC SYNONYM gobeacc FOR general.gobeacc lets every user write SELECT * FROM gobeacc and have Oracle automatically resolve to general.gobeacc. Most Banner installations create public synonyms for the most-used tables, which is why SELECT * FROM sgbstdn "just works" — there's a synonym pointing to saturn.sgbstdn.

Grants are permissions. Your database user must have SELECT privileges on a table to read it. Banner ships with standard role-based grants (BAN_DEFAULT_M etc.), and reporting users typically have read access to most tables through these roles. If you get ORA-00942 and the table exists in BSS, the most likely causes are: (1) missing synonym in the current environment, or (2) missing grants for your user.

Five labeled boxes representing schemas (SATURN, GENERAL, PAYROLL, FIMSMGR, TAISMGR), each containing chips of the tables that live there; arrows showing cross-schema joins (e.g., SATURN.SPRIDEN ↔ GENERAL.GOBEACC on PIDM); a small synonym icon noting the unqualified name resolves via synonym.

**ALL_TABLES** is the Oracle data dictionary view that lists every table you have access to. SELECT owner, table_name FROM all_tables WHERE table_name = 'GOBEACC' returns the schema. The BSS schema search at bss.peopleworksservices.com is faster and more readable, but ALL_TABLES is the SQL fallback when you are already in the database.

See it — the diagram

Five labeled boxes arranged across the canvas, one per schema: SATURN (largest, filled with SPRIDEN, SGBSTDN, SFRSTCR, SCBCRSE, STVTERM chips), GENERAL (with GOBEACC, GUBALOG, GOREMAL, GTVZIPC chips), PAYROLL (with PHRHIST, PEBEMPL chips), FIMSMGR (Finance), TAISMGR (with TBRACCD, TBBDETC chips). A coral arrow labeled JOIN ON PIDM arcs from SATURN.SPRIDEN to GENERAL.GOBEACC — the cross-schema join that powers security reports. A small icon floating near GOBEACC marks "Resolved via PUBLIC SYNONYM → general.gobeacc." The visual says: the tables don't all live in one bucket; cross-schema joins are normal; synonyms make them practical.

Show me the code

Find which schema owns a table:

-- Oracle's data dictionary — the system catalog.
-- ALL_TABLES lists tables your current user can see.
SELECT owner, table_name
FROM   all_tables
WHERE  table_name = 'GOBEACC';
-- Returns: GENERAL  GOBEACC

Query a table by its fully-qualified name:

-- The schema prefix is the drawer name.
-- This works regardless of synonyms.
SELECT g.gobeacc_userid, g.gobeacc_username
FROM   general.gobeacc g
WHERE  g.gobeacc_status_ind = 'A';

Query the same table via synonym (the usual case):

-- A PUBLIC synonym makes the unqualified name work for everyone.
-- Most Banner installations have these for common tables.
SELECT gobeacc_userid, gobeacc_username
FROM   gobeacc                       -- resolved via synonym
WHERE  gobeacc_status_ind = 'A';

Cross-schema join (when the synonym is missing or for safety):

-- A security-audit query joining SATURN tables to a GENERAL table.
-- Use schema prefix on the GENERAL one for safety.
SELECT s.spriden_id, s.spriden_last_name, g.gobeacc_userid
FROM   saturn.spriden s
JOIN   general.gobeacc g
       ON g.gobeacc_pidm = s.spriden_pidm
WHERE  s.spriden_change_ind IS NULL
  AND  s.spriden_entity_ind = 'P';

The BSS schema search at bss.peopleworksservices.com lets you type any table name and returns its schema, description, and column list. Use it as the primary lookup — it is more reliable than guessing the schema from the table-name prefix (see Reading a Banner Table Name — The Seven-Letter Code for why the prefix alone is not enough).

Where intuition fails

The table-name prefix does NOT always tell you the schema. GOBEACC starts with G (suggesting General), and indeed it lives in GENERAL — but the rule is not universal. Some G-prefixed tables live in SATURN because they predate the schema split. The only reliable lookup is ALL_TABLES.OWNER or the BSS schema search.

Missing synonyms break old reports moving to new environments. A report that works in production (where a PUBLIC SYNONYM exists) breaks in test (where the synonym was never created). The error is ORA-00942 table or view does not exist — same as if the table genuinely didn't exist. Always test reports in the target environment before promoting.

**ORA-00942 does not distinguish "no such table" from "you don't have permission."** Oracle deliberately returns the same error to avoid revealing the existence of tables you can't read. If you know a table exists (via BSS) and your query errors with ORA-00942, the next thing to check is your user's grants on that schema.

Cross-schema joins work but can be slow. Oracle handles them, but the optimizer may not have statistics on tables in schemas it doesn't usually consider together. If a cross-schema query is slow, run EXPLAIN PLAN and look for surprising full-table scans. Sometimes manually pre-filtering one side via a subquery helps.

Schema names sometimes differ by installation. Banner ships with standard schemas, but DBAs can rename them or create custom schemas for institution-specific data. Waubonsee may have a WAUBONSEE or similar schema for custom tables. The BSS schema search covers the institution-specific names alongside the Ellucian-standard ones.

The one-sentence takeaway

Banner's tables are organized into Oracle schemas — SATURN (Student), GENERAL (cross-system), PAYROLL, FIMSMGR (Finance), TAISMGR (AR), FINAID. To query a table from outside its schema, you need the schema prefix (GENERAL.GOBEACC) or a public synonym. ORA-00942 is not the same as "the table does not exist" — it means Oracle cannot find or access the table with the name you gave. Check the BSS schema search first.

Track A · What is it, really

Effective Dating — Why Banner Never Forgets

8 min readbannereffective-datingsgbstdnscbcrsenbrjobshistory

The hook

A student changes majors from Biology to Nursing. A course gets re-titled from "Introduction to Computing" to "Foundations of Digital Literacy." An employee gets a raise and a new job title. In a normal database, you would UPDATE the row and move on. Banner does not do that. Banner lays a new row on top of the old one, stamps it with the date the change took effect, and leaves the old row exactly where it was. If your query says SELECT * FROM sgbstdn WHERE pidm = 38201 and stops there, Banner hands you every layer — every major that student ever declared — and your report is silently, arithmetically wrong. "Current" is not a column in Banner. It is a question you must learn to ask.

The everyday analogy

Drive past a road cut on a highway and you can see the rock laid down in layers, oldest at the bottom, newest at the top. Each stratum was deposited at a specific moment in geological time — a volcanic ash fall, a sea floor settling, a river flood plain — and once it solidified, nothing dug it back out. The next event simply laid a new layer on top of the previous one.

A roadside rock cut at golden hour: stratified layers of sedimentary rock in horizontal bands. One mid-stratum is highlighted in coral — 'as of Fall 2022.' Identity is the cliff; history is the layers; 'current' depends on the date you ask.

To answer the question "what was the surface of this hillside in 1850?" a geologist does not look at today's topsoil. They count down to the layer whose deposition predates 1850 and was not yet covered by anything younger. The most recent stratum whose date is on or before 1850 — that is the "current as of 1850" surface. The cliff is the identity. The layers are the history. What you call "the surface" depends entirely on the date you ask.

That is exactly how Banner stores history. A student's curriculum in SGBSTDN is a stack of strata — one row per declared major or program, each with SGBSTDN_TERM_CODE_EFF set to the term the change took effect. Course catalogs in SCBCRSE are strata of course definitions — the same course code might carry a different title and credit-hour count in 2024 than it did in

Employee jobs in NBRJOBS are strata of pay rates, titles, and FTE

status. The pattern is the same across every Banner master table that matters: identity is stable (see PIDM — The Number Behind Every Person), description is layered.

The mistake new Banner SQL writers make is treating these tables as if they were flat current-state snapshots. They run SELECT * FROM sgbstdn WHERE pidm = ?, get back every stratum the student ever accumulated, and the report shows duplicated students, inflated headcounts, and majors the student abandoned three years ago. Banner did not lie. The query did not ask "current as of when."

What it really is

Effective dating is Banner's built-in mechanism for versioning descriptive attributes over time. When a value changes — a major, a title, a salary, an advisor assignment — Banner does not overwrite the old row. It inserts a new row with a higher effective-date value and leaves the old row intact.

The effective-date column itself varies by table, and Banner is not consistent about its name. On the student side, SGBSTDN uses SGBSTDN_TERM_CODE_EFF — a six-digit term code like '202610' for Fall 2026. On the catalog side, SCBCRSE uses SCBCRSE_EFF_TERM. On the HR side, NBRJOBS uses NBRJOBS_EFFECTIVE_DATE — an actual DATE column. Advisor assignments in SGRADVR use SGRADVR_TERM_CODE_EFF. Every table names the column differently, but the pattern is identical: a row is valid from its effective date forward, until a newer row with a higher effective date supersedes it.

"Current" is not stored anywhere. There is no IS_CURRENT = 'Y' flag on these tables. You compute "current" at query time by finding the row with the maximum effective date that is less than or equal to your target. For "right now," your target is today's term or today's date. For a historical report, your target is the term or date you are reporting on. The query is the same; only the cutoff value changes.

One student's SGBSTDN stack: three layered rows for three curriculum changes, each with its effective term. The MAX-effective layer on or before the target term is highlighted — that is the row your query must isolate.

A row's effective date is the date the change took effect in the real world — the term the student actually switched majors, the date the raise took effect. Banner also has _ACTIVITY_DATE columns on most tables, which record when the row was last touched by a form. Those are the audit trail — the date someone typed the change. They are not the effective date. Confusing the two produces reports where a change entered late appears to have happened on the entry date instead of the real-world effective date.

This is the source-side analog of the warehouse's Slowly Changing Dimension Type 2 pattern (see Slowly Changing Dimensions — Keeping History When Attributes Change). Banner is SCD Type 2 on its master tables — it versions by inserting, not by overwriting. The warehouse's job is to mirror that same layered history in its dimension tables, with surrogate keys so that fact rows can point to the correct historical version without a MAX() subquery on every join.

See it — the diagram

The stack diagram shows one student, three curriculum changes, three rows in SGBSTDN. The bottom row — effective term '202010' (Fall 2020) — declares "Biology." The middle row — effective term '202210' (Fall 2022) — switches to "Nursing." The top row — effective term '202410' (Fall 2024) — switches again to "Health Sciences." To ask "what was this student's major in Spring 2023 (term '202320')?" you walk down from the top to the first row whose effective term is ≤ '202320' — the Nursing row. The MAX-effective subquery does exactly that walk. The diagram makes the walk visible.

Show me the code

Here is the mistake the article exists to prevent. A student changed majors three times. This query asks for their curriculum and gets all three layers:

-- WRONG: returns every historical curriculum row for the student.
-- A student who changed majors 3 times appears 3 times in the result.
SELECT s.sgbstdn_pidm,
       s.sgbstdn_majr_code_1,
       s.sgbstdn_term_code_eff
FROM   sgbstdn s
WHERE  s.sgbstdn_pidm = 38201;

Three rows. Three different majors. If this query feeds an enrollment report, the student is counted three times.

The canonical Banner fix is the MAX-effective subquery — find the row whose effective term is the greatest one on or before your target. For the student's current curriculum (target = all terms up to today):

-- RIGHT: the topmost stratum — the student's current curriculum.
-- See [[B3_effective_max]] for a deeper dive on this SQL idiom.
SELECT s.sgbstdn_pidm,
       s.sgbstdn_majr_code_1     AS current_major,
       s.sgbstdn_term_code_eff   AS effective_since
FROM   sgbstdn s
WHERE  s.sgbstdn_pidm = 38201
  AND  s.sgbstdn_term_code_eff = (
       SELECT MAX(s2.sgbstdn_term_code_eff)
       FROM   sgbstdn s2
       WHERE  s2.sgbstdn_pidm = s.sgbstdn_pidm);

One row. The current major only. The subquery finds the highest effective term for this PIDM, and the outer query filters to that single row.

Now ask a historical question — what was this student's major as of Fall 2022?

-- The stratum that was on top as of Fall 2022 (term '202210').
-- Same pattern, with a cutoff on the inner MAX.
SELECT s.sgbstdn_majr_code_1 AS major_as_of_fall_2022
FROM   sgbstdn s
WHERE  s.sgbstdn_pidm = 38201
  AND  s.sgbstdn_term_code_eff = (
       SELECT MAX(s2.sgbstdn_term_code_eff)
       FROM   sgbstdn s2
       WHERE  s2.sgbstdn_pidm = s.sgbstdn_pidm
         AND  s2.sgbstdn_term_code_eff <= '202210');

The inner MAX() is now bounded — it only considers terms up to and including Fall 2022. The outer query returns the Nursing row, because Nursing was the topmost stratum as of that term. Biology is below it; Health Sciences has not yet been deposited. The pattern is identical for NBRJOBS, where the column is a DATE instead of a term code — replace <= '202210' with <= DATE '2022-09-15' and the logic is unchanged.

Where intuition fails

Four traps that catch even experienced SQL writers:

No effective-date filter = duplicates. The most common Banner SQL bug. A

student with three curriculum changes appears three times. Headcounts are inflated. Totals are multiplied. The Banner Semantic Search SQL Explainer flags SGBSTDN queries that lack the MAX-effective pattern. The fix is always the same: add the correlated MAX() subquery on the effective-date column.

"Current major" depends on WHEN you mean. If a report asks "Fall 2022

enrollment by current major," it needs the major that was current in Fall 2022, not the major that is current today. Joining today's curriculum to a historical fact table is silent revisionism — the report looks correct but the labels are from the wrong stratum. The MAX() subquery needs the same cutoff as the report's time window. The Effective-Date Trap — Joining to Yesterday's Row covers this gotcha in full, with a worked audit example.

Term codes are strings, but they sort correctly. Banner term codes use

the format 'YYYYTT' where TT encodes the term within the year — 10 for Fall, 20 for Spring, 30 for Summer at most installations. MAX() on a string column works because the format is lexicographically ordered: a higher year sorts later, and within a year a higher term code sorts later. Do not CAST to integer — some legacy term formats break on numeric conversion, and the string sort has been reliable for decades.

Effective dating versions attributes, not balances. A student's major in

SGBSTDN is effective-dated. The student's cumulative GPA in SHRTRCE is not — it is a transactional table where each row is an event (a term's grades calculated), not a version of a description. Do not apply the MAX-effective pattern to transactional tables expecting "the current balance." The patterns are different, and Track E covers the traps of treating one like the other.

The one-sentence takeaway

Banner versions history by adding new rows with an effective date. The old rows stay. "Current" is not a column — it is a query you must write.

Track A · What is it, really

Argos, X-Rayed — The DataBlock, the Report, the Parameters

7 min readargosdatablockreportparametersanatomyfoundation

The hook

Everyone calls it "a report." But what you see on screen — the columns, the headers, the dropdowns at the top — is only one of three components layered behind the glass. X-ray the thing, and you see a structure that nobody taught you explicitly: the DataBlock, the Report, and the Parameters. Three subsystems, one device, each invisible to the end user.

The everyday analogy

Hold up a modern smartphone and you see a sleek slab of glass and metal. The phone "just works" — you tap an app, the screen updates, you swipe and the page moves. The complexity is hidden behind the case.

Now imagine putting that same phone under a hospital X-ray. The image reveals three distinct subsystems layered behind the glass:

The battery and the motherboard at the back — the load-bearing part where the power and the compute live. Without these, nothing works. The user never sees them, but they are the whole reason the phone functions.
The screen and the speaker at the front — the user-visible output. Everything you read, everything you hear, comes from here. This is the layer the user thinks of AS the phone.
The buttons and the touchscreen sensors — the user-input layer. Volume up/down, the side button, the touchscreen surface. These are how the user controls the phone.

An X-ray of a modern smartphone showing three labeled subsystems: motherboard + battery (DataBlock), screen + speaker (Report), buttons + touchscreen (Parameters); a hospital-style light box illuminating the X-ray from behind.

Three subsystems, one device, each invisible to the user unless they X-ray it open. The user knows "I press this button, the screen does that" — they don't think about the motherboard pulling data and the speaker driver translating it into sound.

Argos works the same way. From the outside it looks like "a report." But X-ray the report and you see three components:

The DataBlock = the motherboard and battery. The SQL that retrieves the data, the parameter declarations that accept user inputs, the column-type metadata. Load-bearing, invisible to the end user.
The Report = the screen. The layout the user sees: the columns, the headers, the grouping, the page footers. This is the layer the user calls "the report."
The Parameters = the buttons. The dropdowns, edit boxes, date pickers at the top of the Report. The user's controls.

Knowing the X-ray view tells you where to look when something breaks. The SQL is wrong → DataBlock. The columns look bad → Report layout. The user can't enter the right value → Parameter widget config.

What it really is

Argos has three core building blocks. They are interlocked — each feeds into the next — but they are separately configurable, separately testable, and separately breakable.

The DataBlock — the container. This is where the query lives. A DataBlock holds: the SQL query (the load-bearing part — the only part Oracle ever sees); parameter declarations (name, type, widget binding, default value); column-type metadata (text, number, date, with display width and format hints); and an optional named DataBlock identifier (used for :dbn_* cross-references in Argos Parameters — `:main_`, `:lcl_`, `:dbn_`). A DataBlock is the unit of REUSE — one DataBlock can feed multiple Reports (see Shared DataBlocks — One SQL, Many Reports). Change the DataBlock, and every consuming Report changes with it.

The Report — the layout. Consumes a DataBlock's output rows and formats them for the user. A Report holds: the column-display order, widths, and headers; grouping and sub-totalling rules; page header/footer text and images; export format (CSV, PDF, Excel, HTML); and sub-report bindings for banded child sections (each sub-report is itself a Report consuming a child DataBlock).

The Parameters — the user controls. The widgets at the top of the Report. Each Parameter: has a scope (:main_* / :lcl_* / :dbn_* — Argos Parameters — `:main_`, `:lcl_`, `:dbn_`); has a widget type (Edit Box, Drop Down, Date, Date Range, Check Box, Multi-Checkbox, Radio Button); has an options query (for dropdowns/multi-checkboxes) that populates the available values; has a declared data type that drives substitution quoting (see How Argos Assembles Your Query — Filters on the WHERE); and has an optional default value and validation rules.

Three labeled boxes stacked vertically (DataBlock at the bottom with SQL + parameter declarations, Report in the middle with column layout + grouping, Parameters at the top with widget icons); arrows showing data flow downward (parameter values into DataBlock) and rows flowing upward (DataBlock output into Report).

The lifecycle of running a Report:

User clicks Run.
Argos reads the Parameter widget values the user entered.
Argos substitutes the values into the DataBlock's SQL (using the string-substitution mechanism from How Argos Assembles Your Query — Filters on the WHERE).
The substituted SQL is sent to Oracle.
Oracle returns rows.
The Report's layout formats the rows into the user's chosen output (PDF, Excel, etc.).
The user sees the formatted result.

The same DataBlock can be wired to multiple Reports — a one-DataBlock-to-many-Reports relationship that Shared DataBlocks — One SQL, Many Reports explains in depth via the UNION ALL + discriminator pattern.

See it — the diagram

Three labeled boxes stacked vertically, reading bottom-to-top as data flows: DataBlock at the bottom (containing SQL code, parameter declarations, column-type metadata — coral background), Report in the middle (containing column layout, grouping rules, export formats — ink background), Parameters at the top (containing widget icons: a dropdown, an edit box, a date picker — coral accent). A downward coral arrow from Parameters to DataBlock reads "parameter values flow in via string substitution." An upward amber arrow from DataBlock to Report reads "rows flow out — Oracle result set." The visual is a data-flow diagram that doubles as an anatomy chart: three components, two flows, one report.

Show me the code

A single concrete example — a course-roster Argos object — shown as its three parts.

The DataBlock (the SQL + parameters):

-- DataBlock "CourseRosterByTerm" — holds the SQL and
-- declares two parameters: a term dropdown and an optional
-- subject filter.
SELECT r.sfrstcr_term_code,
       r.sfrstcr_crn,
       r.sfrstcr_subj_code,
       r.sfrstcr_crse_numb,
       s.spriden_id,
       s.spriden_last_name,
       s.spriden_first_name
FROM   sfrstcr r
JOIN   spriden s
       ON  s.spriden_pidm        = r.sfrstcr_pidm
       AND s.spriden_change_ind  IS NULL
       AND s.spriden_entity_ind  = 'P'
WHERE  r.sfrstcr_term_code = :main_DD_term_code
  AND  (r.sfrstcr_subj_code = :main_EB_subj_code
        OR :main_EB_subj_code IS NULL);

-- Parameters declared on this DataBlock:
--   :main_DD_term_code  (text, Drop Down, options from STVTERM)
--   :main_EB_subj_code  (text, Edit Box, optional default = NULL)

The Report (the layout that consumes the DataBlock):

Report "Course Roster"
  DataBlock: CourseRosterByTerm
  Columns shown:
    - Term  (sfrstcr_term_code) - hidden if only one term
    - CRN   (sfrstcr_crn) - width 80
    - Subject (sfrstcr_subj_code) - width 80
    - Course# (sfrstcr_crse_numb) - width 80
    - Student ID (spriden_id) - width 100
    - Student Name (combine last_name + ', ' + first_name) - width 200
  Grouping: by CRN (page break between courses)
  Page Footer: "Page {{page}} of {{pages}}"
  Export: CSV, PDF, Excel

The Parameters (what the user sees at the top of the Report):

Parameter 1: Term (dropdown)
  Widget: Drop Down
  Bound to: :main_DD_term_code
  Options query: SELECT stvterm_code, stvterm_desc
                 FROM stvterm
                 WHERE stvterm_start_date <= SYSDATE + 365
                 ORDER BY stvterm_code DESC

Parameter 2: Subject (optional)
  Widget: Edit Box
  Bound to: :main_EB_subj_code
  Default: (empty)
  Validation: 3-4 letter subject code

The three parts are separate but interlocked — the Parameter flows into the DataBlock's SQL via substitution, the DataBlock produces rows, the Report formats them. Change one, the others stay stable. That stability is the design: you can fix the SQL without touching the layout, or add a Parameter without rewriting the Report.

Where intuition fails

A "broken report" is rarely the Report layout's fault. Most user-reported problems ("the report shows wrong numbers," "I selected a term and got no rows") are DataBlock or Parameter issues, not layout issues. Diagnose in this order: Parameter (did the user's value reach the DataBlock correctly?) → DataBlock (does the SQL return the expected rows in a test query?) → Report (is the layout hiding or grouping rows in a misleading way?).

A DataBlock can be shared across Reports. Changing the DataBlock changes EVERY consuming Report. If you add or remove a column from the DataBlock, every Report that references that column breaks. Use the BSS Argos export feature or the Argos designer's "where is this DataBlock used?" view before editing. See Shared DataBlocks — One SQL, Many Reports.

Sub-reports are full Reports with their own DataBlocks. A banded child section inside a parent Report is itself a complete Report+DataBlock+Parameters structure, just nested. :lcl_* parameters defined on the child are invisible to the parent. See Argos Parameters — `:main_`, `:lcl_`, `:dbn_` for the scope rules.

Parameter options queries are separate from the main SQL. The dropdown population query (e.g., "list every active term") runs at REPORT-OPEN time, not at run time. If a new term is added to STVTERM after the user opens the Report, the new term won't appear in the dropdown until the Report is re-opened. Refresh fixes it.

Report formatting tricks have export-format consequences. A column hidden in the PDF export may still be visible in the CSV export (or vice versa). Page-footer images render in PDF but disappear in Excel. Test each export format independently before declaring the Report done.

The one-sentence takeaway

Every Argos report is three components: the DataBlock (SQL + parameter declarations — the load-bearing layer the user never sees), the Report (layout + formatting — what the user calls "the report"), and the Parameters (widgets — what the user controls). When something breaks, the X-ray tells you where to look: wrong data → DataBlock SQL; ugly layout → Report formatting; user can't enter the right value → Parameter widget config.

Track A · What is it, really

TERM Codes — The Academic Timestamp Banner Uses Everywhere

You see '202610' in every WHERE clause you write. You have used MAX(sgbstdn_term_code_eff) a hundred times. But nobody ever told you why the format was chosen, why it sorts correctly without casting, or what STVTERM actually holds. The term code is not a magic number. It is ISO 8601 adapted to academic time — and the format IS the feature.

7 min readbannerterm-codestvtermsortingfoundationacademic-calendar

The hook

You see '202610' in every WHERE clause you write. You have used MAX(sgbstdn_term_code_eff) a hundred times. But nobody ever told you why the format was chosen, why it sorts correctly without casting, or what STVTERM actually holds beyond its description column. The term code is not a magic number. It is ISO 8601 adapted to academic time — and the format IS the feature.

The everyday analogy

ISO 8601 is the international standard for writing dates: YYYY-MM-DD. The order is not arbitrary. Year first, then month, then day — all fixed-width, all zero-padded. Why? Because that order means a computer can sort dates correctly with plain string comparison. No date library needed. No parsing. No casting.

"2026-09-15" < "2026-10-01" is true. The strings compare lexicographically from left to right. The year prefix matches, so the comparison falls to the month field, where 09 < 10. The format is engineered so the dumbest possible sort produces the right chronological order. A shelf of file folders labeled in ISO 8601 naturally stays in chronological order just by shelving them alphabetically. Nobody has to re-sort the shelf after adding a new folder. The label does the work.

Banner term codes use the same trick, adapted to academic time. The format is YYYYTT — four digits of academic year, two digits of term-within-year. '202610' < '202620' < '202630' < '202710' is true by plain string comparison. Fall 2026 sorts before Spring 2027 because 2026 < 2027 — the year prefix comparison resolves before the term suffix is ever examined. Fall 2026 sorts before Spring 2026 because the year prefixes match and 10 < 20. The MAX() subquery you see in every Effective Dating — Why Banner Never Forgets table can operate on *_term_code_eff columns without ever calling a date function. Banner's designers picked this format on purpose: lexicographic sort equals chronological sort, for free.

A row of ISO-8601-dated file folders on a shelf, sorted naturally because the year-month-day prefix sorts correctly; one folder labeled '202610' shelved among them, showing Banner term codes use the same trick.

The trick has the same payoffs as ISO 8601: indexes work without type tricks, comparisons are universal across databases, there is no ambiguity about whether "10" means October or Fall, and human readers learn the pattern after seeing two examples. It has the same gotcha too: you must respect the format. The moment you CAST a term code to INTEGER, you lose the safety net. The strings are how the system is designed. Trust the strings.

What it really is

A term code is a six-character string in Banner that identifies a specific academic session. The format is YYYYTT:

**YYYY** — four-digit academic year. At most institutions, this is the calendar year of the primary Fall term. Fall 2026 is 202610, and the 2026 prefix anchors it to the academic year that runs Fall 2026 through Summer 2027.
**TT** — two-digit term-within-year identifier. The convention at most Banner installations is 10 = Fall, 20 = Spring, 30 = Summer. This is convention, not law — always verify against your STVTERM table.

**STVTERM** is the master lookup table. One row per term code. Its key columns:

Column	What it holds
`STVTERM_CODE`	The `YYYYTT` string (PK)
`STVTERM_DESC`	Human-readable description ("Fall 2026")
`STVTERM_START_DATE`	Calendar start of the term
`STVTERM_END_DATE`	Calendar end of the term
`STVTERM_ACYR_CODE`	Academic year code — may use its own format, separate from the term code's YYYY prefix
`STVTERM_FA_PROC_YR`	Financial aid processing year — follows federal FA calendar rules, not the academic calendar

Joining to STVTERM is how you turn a code like '202610' into a human-readable label, a date range, or the correct academic year for reporting.

Lexicographic sort = chronological sort. Because the format places YYYY first and both YYYY and TT are zero-padded and fixed-width, ORDER BY term_code returns terms in correct chronological order. MAX(term_code) returns the latest term. No casting needed. No date library called. This property is what makes the Banner MAX-effective subquery pattern from The MAX() Subquery — Getting the Row That's Current work on *_term_code_eff columns.

Anatomy of '202610': the YYYY prefix highlighted as academic year, the TT suffix highlighted as season (10=Fall), plus the STVTERM row that maps the code to its description, start date, end date, and academic year code.

Term codes appear in MANY columns across Banner, each with its own semantic:

Effective markers: SGBSTDN_TERM_CODE_EFF, SCBCRSE_EFF_TERM, NBRJOBS_EFF_TERM, SGRADVR_TERM_CODE_EFF — "the term this version took effect."
Transaction markers: SFRSTCR_TERM_CODE (registrations), SHRGRDE_TERM_CODE (grades), TBRACCD_TERM_CODE (accounts receivable) — "the term this event belongs to."
Admissions pipeline: SARADAP_TERM_CODE — "the term the applicant is applying to."

Each column answers a different question, but the format is universal. The time semantics differ — effective vs. transactional vs. target — but '202610' always means Fall 2026, everywhere.

The term code is the third foundational invariant in Banner. After PIDM (the person key, PIDM — The Number Behind Every Person) and effective dating (the version markers, Effective Dating — Why Banner Never Forgets), term codes are the time axis that every academic transaction lives on.

See it — the diagram

The sorting property is the whole point.

Four term codes sorted lexicographically left-to-right, with a parallel timeline showing they line up chronologically. The visual payoff: format-first design means string sort equals time sort.

Four codes laid out left to right — '202510', '202520', '202530', '202610' — with a parallel calendar timeline underneath. The string order and the calendar order are identical. This is not a coincidence. The format was chosen so that the string comparison '202530' < '202610' is true for the same reason September 30 comes before October 1 in ISO 8601: the year-month prefix dominates, and 2025 is less than 2026. The TT suffix only matters when the YYYY prefixes match — exactly when it should. The MAX() subquery in The MAX() Subquery — Getting the Row That's Current is the single most common consumer of this property: WHERE sgbstdn_term_code_eff = (SELECT MAX(s2.sgbstdn_term_code_eff) ...) works because MAX() on a VARCHAR column produces the chronologically latest term. The format earns that query its correctness.

Show me the code

**The simplest STVTERM query — turn a code into a label:**

SELECT stvterm_code,
       stvterm_desc,
       stvterm_start_date,
       stvterm_end_date
FROM   stvterm
WHERE  stvterm_code = '202610';

Sort terms chronologically — no casting needed:

-- Lexicographic sort = chronological sort, because of YYYYTT.
-- This is the foundation of every MAX(term_code) subquery.
SELECT stvterm_code, stvterm_desc
FROM   stvterm
WHERE  stvterm_code BETWEEN '202410' AND '202710'
ORDER BY stvterm_code;
-- Returns: 202410 (Fall 2024), 202420 (Spring 2025),
-- 202430 (Summer 2025), 202510 (Fall 2025), ...

Find the current term as of today:

SELECT stvterm_code, stvterm_desc
FROM   stvterm
WHERE  TRUNC(SYSDATE) BETWEEN stvterm_start_date AND stvterm_end_date;

Use a term code as an effective marker — the canonical pattern from The MAX() Subquery — Getting the Row That's Current:

-- Student's current curriculum, using term codes as version markers.
-- This works because MAX() on YYYYTT strings sorts correctly.
SELECT s.sgbstdn_pidm,
       s.sgbstdn_majr_code_1,
       s.sgbstdn_term_code_eff
FROM   sgbstdn s
WHERE  s.sgbstdn_term_code_eff = (
       SELECT MAX(s2.sgbstdn_term_code_eff)
       FROM   sgbstdn s2
       WHERE  s2.sgbstdn_pidm = s.sgbstdn_pidm);

Where intuition fails

Five gotchas — even experienced Banner SQL writers trip on these:

**The TT digits are convention, not law — verify against STVTERM.** The 10/20/30 mapping (Fall/Spring/Summer) is universal at most colleges, but some installations use different digits, and some inherited legacy data uses entirely different formats. Always eyeball STVTERM to confirm. If you write SQL that assumes SUBSTR(term, 5, 2) = '10' means Fall, document that assumption and validate it before shipping the report.

**Do not CAST term codes to INTEGER.** The strings sort correctly without it. Casting to integer defeats any index on the term column and breaks if your installation ever has non-standard 7-character or 8-character term codes in legacy data. The strings are the contract. Trust them.

**STVTERM_ACYR_CODE is NOT the same as the YYYY prefix.** A term code's first four digits identify the term's anchor year, but the academic year code is a separate column with its own format. Some installations use a six-digit YYYYYY academic year (e.g. 202526 for the AY spanning Fall 2025 through Summer 2026). Reports that filter by academic year should join to STVTERM and use STVTERM_ACYR_CODE, not derive the year from the term code's first four digits.

Financial aid uses its own year. STVTERM_FA_PROC_YR is the FA processing year, which follows federal financial-aid calendar rules and can differ from the term's anchor year. Fall 2025 ('202510') is FA year 2526 — not 2025. If you are writing financial aid reports, never derive the FA year from the term code yourself; always read STVTERM_FA_PROC_YR.

"Current term" is not a column — it is a query. Banner has no IS_CURRENT_TERM = 'Y' flag on STVTERM. To find the current term, query STVTERM for the row whose date range includes SYSDATE. During inter-term gaps (between Spring end and Summer start), there may be zero matching rows. Handle "no current term" gracefully, or extend the query to find the nearest upcoming term.

The one-sentence takeaway

Banner term codes are YYYYTT strings engineered so that lexicographic sort equals chronological sort — the same trick ISO 8601 uses. Trust the strings. Join STVTERM for the human-readable label. Never cast to integer.

Track B · The canonical joins

Joining by PIDM — SPRIDEN and the Universal Key

5 min readbannerpidmspridenjoincanonicalfoundation

The hook

The everyday analogy

Fly into a country and the customs officer holds out a hand for one thing: your passport. The form you filled out on the plane is different in every country — different boxes, different languages, different colors of ink — but the passport is the same. The officer matches your passport number to your arrival record, checks the photo against your face, and waves you through.

The customs office does not look up arriving passengers by name. Names are messy — spellings vary, transliterations differ, married/maiden distinctions. Names are how the passenger thinks of themselves; passport numbers are how governments track them. Every country, every airport, every customs counter joins to the same passport database on the same number — and then displays the current name from that database for the form they need to print.

A customs counter at an international airport, a passport held open on the desk with its number highlighted in coral; behind the officer, a wall of arrival-form templates in different languages, all sharing the same passport-number lookup.

Banner is the customs office and PIDM is the passport number. Every person-bearing Banner table — SFRSTCR (the registration form), PHRHIST (the payroll record), NBRJOBS (the job assignment), GOBEACC (the security badge) — holds the PIDM. To put a human-readable name on the report, you join to SPRIDEN (the passport database) on PIDM and pull the current name. The join is identical every time because the contract is identical: same passport, same translation.

A returning citizen presents the same passport as a visiting tourist — the passport database does not care WHY you are entering. A person playing multiple roles in Banner (student + employee + vendor) shows the same PIDM at every counter; the join pattern is unchanged whether the source table is a student record or an employee record. Three conditions, one pattern, every counter.

What it really is

The canonical SPRIDEN join has three conditions, and they all live INSIDE the ON clause:

**s.spriden_pidm = <source>.<col>_pidm** — the actual join key. The _pidm suffix is universal across person-bearing Banner tables: SFRSTCR_PIDM, SGBSTDN_PIDM, PHRHIST_PIDM, NBRJOBS_PIDM, GOBEACC_PIDM, FTVVEND_PIDM.
**s.spriden_change_ind IS NULL** — restrict to the CURRENT name row. SPRIDEN holds one row per name version per person; without this filter the join multiplies rows by every historical name change. See SPRIDEN Without CHANGE_IND — The Duplicate-Name Trap.
**s.spriden_entity_ind = 'P'** — restrict to people, not companies. The PIDM space is shared with corporations ('C'); vendor records can leak into person rosters without this filter.

Why all three belong in ON, not WHERE: with INNER JOIN, filters in WHERE behave the same. But the moment someone changes the JOIN to LEFT JOIN (to include sources without a SPRIDEN row), a WHERE filter rejects the NULL-extended rows and silently converts the LEFT JOIN back to an INNER. See The Phantom INNER JOIN — When a WHERE Breaks Your LEFT JOIN for the trap. Filters that belong to the join go in ON.

Three source tables (SFRSTCR, PHRHIST, GOBEACC) on the left, each with a _pidm column highlighted; a single SPRIDEN box on the right with spriden_pidm highlighted; three CORAL arrows converging from the source tables to SPRIDEN, each labeled with the 3-condition ON clause.

Common SELECT choices from SPRIDEN: SPRIDEN_ID (the 8-digit visible Banner ID — what users recognize); SPRIDEN_LAST_NAME, SPRIDEN_FIRST_NAME, SPRIDEN_MI (current name components); SPRIDEN_SSN (sensitive — avoid unless audit-required). Never display the raw PIDM to users (see PIDM — The Number Behind Every Person gotcha 5).

When you need TWO different people in the same query (student + advisor, employee + supervisor), the pattern extends to two SPRIDEN joins with different aliases. See The Double SPRIDEN — Naming Two People in One Query.

See it — the diagram

Three source tables on the left — SFRSTCR (registrations), PHRHIST (payroll), GOBEACC (security accounts) — each with their _pidm column highlighted in coral. Three coral arrows converge from those columns to a single SPRIDEN box on the right, its spriden_pidm column highlighted. Each arrow carries the full 3-condition ON clause in small mono type below it. The visual says: three different source tables, three different report types, one PIDM, one SPRIDEN join pattern. The passport analogy made structural: same lookup, every counter.

Show me the code

The canonical join — course roster:

-- Course roster for a specific term: PIDM is the passport,
-- SPRIDEN translates it to a current name.
SELECT s.spriden_id,
       s.spriden_last_name,
       s.spriden_first_name,
       r.sfrstcr_crn,
       r.sfrstcr_credit_hr
FROM   sfrstcr r
JOIN   spriden s
       ON  s.spriden_pidm        = r.sfrstcr_pidm
       AND s.spriden_change_ind  IS NULL
       AND s.spriden_entity_ind  = 'P'
WHERE  r.sfrstcr_term_code = '202610'
ORDER BY s.spriden_last_name, s.spriden_first_name;

Same pattern, different source — employee payroll line:

SELECT s.spriden_id,
       s.spriden_last_name || ', ' || s.spriden_first_name AS name,
       p.phrhist_year,
       p.phrhist_payno,
       p.phrhist_gross
FROM   phrhist p
JOIN   spriden s
       ON  s.spriden_pidm        = p.phrhist_pidm
       AND s.spriden_change_ind  IS NULL
       AND s.spriden_entity_ind  = 'P'
WHERE  p.phrhist_disp = 'P';

Vendor source — individuals vs companies:

SELECT s.spriden_id,
       s.spriden_last_name AS company_or_lastname,
       v.ftvvend_vend_code,
       v.ftvvend_active_ind
FROM   ftvvend v
JOIN   spriden s
       ON  s.spriden_pidm        = v.ftvvend_pidm
       AND s.spriden_change_ind  IS NULL
       AND s.spriden_entity_ind  = 'P'
WHERE  v.ftvvend_active_ind = 'Y';

Where intuition fails

All three conditions in ON, never in WHERE. The single most copied bug. WHERE spriden_change_ind IS NULL works for INNER JOIN but converts a LEFT JOIN back to INNER silently. Put the change_ind and entity_ind filters in ON alongside the PIDM equality.

**Omitting entity_ind = 'P' lets corporations into person rosters.** Most reports never see corporations, but the moment a join chain touches a vendor table, the missing filter shows up as "Acme Office Supplies, Inc." in the student list. The filter is cheap insurance.

**LEFT JOIN spriden when SPRIDEN might be missing.** A source table can hold a PIDM that has been hard-deleted from SPRIDEN (rare but possible in legacy migrations). Inner-joining drops those rows silently. Use LEFT JOIN with COALESCE(spriden_last_name, 'UNKNOWN') if completeness matters.

**Joining by SPRIDEN_ID instead of SPRIDEN_PIDM.** The ID is the visible 8-digit number users recognize; the PIDM is the internal surrogate. The ID can change (corrections, re-issues); the PIDM cannot. Always join on PIDM. See PIDM — The Number Behind Every Person.

**Selecting SPRIDEN_SSN without need.** The Social Security Number is sensitive PII. Including it in a SELECT exposes it to logs, exports, screenshots, and people who should not see it. Default to not selecting it; require an explicit audit-trail justification.

The one-sentence takeaway

The canonical SPRIDEN join is three conditions in ON: spriden_pidm = <source>_pidm AND spriden_change_ind IS NULL AND spriden_entity_ind = 'P'. All three belong in ON, never in WHERE. The pattern is identical across every person-bearing Banner table. One idiom, one pattern, every counter.

Track B · The canonical joins

TERM_CODE + CRN — The Registration Compound Key

You write JOIN ssbsect ON ssbsect_crn = sfrstcr_crn. The query runs. It returns rows — five times more than expected. The CRN looked global. It is not. CRN is unique only WITHIN a term, and you just joined across every term that ever reused it.

5 min readbannerterm-codecrnsfrstcrssbsectcompound-keyjoin

The hook

The everyday analogy

Look at a boarding pass. The big number on it — UA 1234 — is the flight number. Type just UA 1234 into a flight-status app and the app asks "for which date?" Because flight number UA 1234 is run almost every day. The flight from Chicago to Denver on March 15 is a completely different flight from the one on March 16 — different crew, different aircraft, different passengers, different weather, different on-time history. The flight number identifies the ROUTE; the date identifies the SPECIFIC FLIGHT.

A boarding pass on a wooden desk showing flight number 'UA 1234' large in coral, date '2026-03-15' beside it in amber; alongside, a second boarding pass with same flight number 'UA 1234' but different date '2026-03-16' — two different flights, same number.

To find your seat, the airline's system needs BOTH: flight number AND date. The seat assignment on UA 1234 / 2026-03-15 is unrelated to the seat assignment on UA 1234 / 2026-03-16. The same is true of every operational record — the catering manifest, the fuel order, the gate assignment, the delay log. All keyed on the compound (flight number, date).

Banner's registration system has the same shape. CRN 12345 is the flight number — a Course Reference Number that identifies a specific section pattern. TERM_CODE '202610' is the date — the academic session that section was offered in. Together they identify one specific section: CRN 12345 in Fall 2026, with its specific instructor, meeting times, enrolled students, and grade roster. CRN 12345 in Spring 2027 is a different section — possibly the same course taught by the same instructor, possibly something entirely unrelated. The CRN is the route; the term is the date; you need both.

What it really is

CRN (SFRSTCR_CRN, SSBSECT_CRN) is a 5-digit number unique WITHIN A TERM but reused ACROSS TERMS. CRN 12345 in Fall 2026 has no relationship to CRN 12345 in Spring 2027 even if they share an instructor or a subject code.

TERM_CODE (SFRSTCR_TERM_CODE, etc.) is the academic session anchor — see TERM Codes — The Academic Timestamp Banner Uses Everywhere for the YYYYTT format.

Together they form the compound primary key of a course section. SSBSECT (the section master) has a composite PK on (ssbsect_term_code, ssbsect_crn). Every joining table — SFRSTCR, SHRGRDE, SSRMEET — references both.

Two table cards (SFRSTCR and SSBSECT) side by side; each card shows two highlighted cells — term_code and crn; two CORAL arrows connect the two pairs of cells between the cards; below, the ON clause with both equality conditions in mono.

The JOIN pattern is two equality conditions in the ON clause: one on _term_code, one on _crn. Same shape every time, just paired across different source tables.

Why CRNs are reused: the registration system has finite CRN space (5 digits = max 99,999) and re-uses CRNs across terms by design. A section that ran in Fall 2020 may have its CRN recycled to a completely different course in Fall 2026. Joining on CRN alone treats these as the same section — wrong.

To pull the section's details (title, meeting times, instructor), the canonical chain is SFRSTCR → SSBSECT → SCBCRSE. See Catalog vs Section — SCBCRSE and SSBSECT for the catalog-vs-section distinction and The MAX() Subquery — Getting the Row That's Current for the SCBCRSE effective-dating pattern.

See it — the diagram

Two table cards side by side — SFRSTCR on the left, SSBSECT on the right. Each card highlights two cells in coral: term_code and crn. Two coral arrows connect the matching pairs: one from sfrstcr_term_code to ssbsect_term_code, one from sfrstcr_crn to ssbsect_crn. Below, the ON clause is written in monospace: ON sect.ssbsect_term_code = r.sfrstcr_term_code AND sect.ssbsect_crn = r.sfrstcr_crn. The visual is the flight-number-plus-date pattern rendered as a SQL join: two columns in the ON, never one.

Show me the code

The simple roster — join SFRSTCR to SSBSECT on the compound key:

-- Roster for a specific section: TERM_CODE + CRN both required.
SELECT s.spriden_id,
       s.spriden_last_name,
       sect.ssbsect_subj_code,
       sect.ssbsect_crse_numb,
       sect.ssbsect_seq_numb,
       r.sfrstcr_credit_hr
FROM   sfrstcr r
JOIN   ssbsect sect
       ON  sect.ssbsect_term_code = r.sfrstcr_term_code
       AND sect.ssbsect_crn       = r.sfrstcr_crn
JOIN   spriden s
       ON  s.spriden_pidm        = r.sfrstcr_pidm
       AND s.spriden_change_ind  IS NULL
       AND s.spriden_entity_ind  = 'P'
WHERE  r.sfrstcr_term_code = '202610'
  AND  r.sfrstcr_crn       = '12345';

The bug — CRN-only join (silent multiplication):

-- WRONG: joins on CRN alone, ignoring term_code.
-- If CRN 12345 has been reused in 3 prior terms, this returns
-- 4x the expected rows.
SELECT r.sfrstcr_pidm, sect.ssbsect_subj_code
FROM   sfrstcr r
JOIN   ssbsect sect ON sect.ssbsect_crn = r.sfrstcr_crn
WHERE  r.sfrstcr_term_code = '202610';

Grade history for a section — same compound join:

SELECT s.spriden_id,
       g.shrgrde_grde_code_final,
       g.shrgrde_credit_hours
FROM   shrgrde g
JOIN   spriden s
       ON  s.spriden_pidm        = g.shrgrde_pidm
       AND s.spriden_change_ind  IS NULL
       AND s.spriden_entity_ind  = 'P'
WHERE  g.shrgrde_term_code = '202610'
  AND  g.shrgrde_crn       = '12345';

Where intuition fails

CRN-alone joins are the single most common multiplication bug in registration reporting. The query "looks" right — one extra condition in WHERE, one fewer condition in ON — but returns 2x, 5x, 10x the rows depending on how often the CRN has been reused. The BSS SQL Explainer flags SSBSECT JOIN ... ON crn without the term_code companion.

Both conditions belong in ON, not WHERE. Same lesson as Joining by PIDM — SPRIDEN and the Universal Key — putting AND ssbsect_term_code = sfrstcr_term_code in WHERE works for INNER JOIN but converts a LEFT JOIN to an effective INNER. See The Phantom INNER JOIN — When a WHERE Breaks Your LEFT JOIN.

CRN format varies by installation. Most Banner sites use 5-digit numeric CRNs. Some sites use 4-digit. Older migrations sometimes have alphanumeric CRNs. The format does not change the compound-key rule — join on both columns regardless.

**SSRMEET (meeting times) has multiple rows per section.** A section meeting MWF 9-10 might have one row per meeting day. Joining SSBSECT to SSRMEET on the (term, CRN) compound key returns multiple meeting-time rows per section — expected. Aggregate or filter if you need one row per section.

Sub-sections (lab + lecture pairings) share a linked-section code. SSBSECT_LINK_IDENT connects a lecture section to its required labs. A query wanting both lecture AND linked labs follows this link in addition to the compound key — a separate, more advanced join pattern.

The one-sentence takeaway

CRN is unique within a term, not across terms. Every join involving CRN must include TERM_CODE as a second condition. The compound key is (term_code, crn) — two columns in ON, never one. CRN-alone joins silently multiply rows by the number of historical reuses.

Track B · The canonical joins

The MAX() Subquery — Getting the Row That's Current

8 min readbannersql-patterneffective-datingcorrelated-subquerysgbstdnscbcrsenbrjobs

The hook

You will write this pattern a hundred times in your Banner career. Four lines of SQL — a self-join alias, a MAX() over an effective-date column, a correlation predicate, and an optional <= bound — that look like noise the first time you see them, and like the only thing holding the report together every time after. It is the single most important SQL idiom in the entire Banner codebase: the correlated subquery on MAX(effective_date). Master it once, and every effective-dated table — SGBSTDN, SCBCRSE, NBRJOBS, SGRADVR — opens up. Skip it, and your reports silently multiply rows and mislabel history. There is no middle ground.

The everyday analogy

Open the Wayback Machine at archive.org. Type a URL. Then pick a date — say, March 15, 2014. The Wayback Machine does not show you today's version of that website. It does not show you the oldest snapshot it has. It looks at every snapshot ever captured of that URL, filters to the ones whose capture date is on or before March 15, 2014, and shows you the most recent one of those. The snapshot that was current as of the date you asked about.

The Wayback Machine calendar: a URL typed in, a date pin stuck on March 15 2014, a vertical stack of dated snapshots behind — the one whose capture date is the most recent on-or-before the pin is highlighted in coral. The snapshot that was current at the moment you asked.

The mechanics inside that query are exactly what your Banner SQL is doing. The URL is the entity — a student's PIDM in SGBSTDN, a course's subject-plus- number in SCBCRSE, an employee's position in NBRJOBS. The snapshots are the rows in those tables — one per effective-date version, stacked like the geological strata from Effective Dating — Why Banner Never Forgets. The capture date on each snapshot is the effective-date column. The "find the most recent snapshot at or before X" operation is the correlated subquery:

outer.eff_column = (
    SELECT MAX(inner.eff_column)
    FROM   same_table inner
    WHERE  inner.entity_columns = outer.entity_columns
      AND  inner.eff_column <= target_date   -- the "as-of" bound
)

You are running a Wayback Machine over rows that look like a flat table. The correlated subquery is doing the entity filter ("only snapshots of THIS URL"), the date bound ("on or before THIS date"), and the aggregation ("the MAX of what remains") all in one go. That is why it looks busier than a normal WHERE clause. It is not noise. It is three operations compressed into four lines.

What it really is

The pattern has four pieces, and you can point to each one and say what it does:

The outer alias — the row you are testing. FROM sgbstdn s — every

row in the table is a candidate. The subquery decides which one survives.

The inner alias — a self-join scoped to ONE entity. FROM sgbstdn s2

— but the WHERE inside the subquery restricts s2 to rows that share the same entity as s. For SGBSTDN the entity is PIDM alone: s2.sgbstdn_pidm = s.sgbstdn_pidm. For NBRJOBS the entity is three columns: (pidm, posn, suff) — an employee can hold multiple positions, each with its own version history. For SCBCRSE the entity is (subj_code, crse_numb). Getting the entity correlation wrong is the most common mistake in this pattern.

**The MAX()** — over the effective column of the scoped inner set. Among

the rows that belong to this entity, which one has the highest effective date? That row is the current one — the top of the stratum stack.

**The <= bound** (optional) — turns "current" into "as of." Without it,

the MAX() returns the latest effective date for the entity, period — the row that is current today. With it — AND s2.eff_column <= target_date — the MAX() only considers rows whose effective date is on or before the target. That is how you answer "what was this student's major in Fall 2022?" instead of "what is this student's major right now?"

Anatomy of the correlated subquery: outer alias, inner alias on the same table, the entity correlation predicates that scope the MAX to one entity, the MAX over the effective column, and the optional <= bound that turns 'current' into 'as-of.'

The pattern is read-time work. Every query that touches an effective-dated table re-computes the MAX(). In a warehouse with SCD Type 2 surrogate keys (see Slowly Changing Dimensions — Keeping History When Attributes Change), this work moves to load time — the fact row stores the surrogate key that was current at the fact's date, and the query does a plain equi-join. The Banner source has no such luxury. You pay the MAX() cost at query time because Banner stores history by stacking rows, not by giving you a pre-resolved current pointer.

See it — the diagram

The anatomy diagram labels each of the four pieces on a real subquery. The outer alias on the left, the inner alias scoped by entity on the right, the MAX() aggregating over the scoped set, and the <= bound slicing the set to a point in time. Once you can point to each piece and name it, the pattern stops looking like magic and starts looking like a tool. Every effective-dated Banner table uses the same tool; only the entity columns and the effective-date column name change.

Show me the code

Here is the pattern on three different tables. Notice the shape is identical; only the columns differ.

**Student curriculum — the SGBSTDN pattern.** Entity is PIDM alone. No as-of bound means "the latest version, period":

-- Current curriculum for every student.
-- Correlation: pidm only. No bound = the most recent version.
SELECT s.sgbstdn_pidm,
       s.sgbstdn_majr_code_1   AS major,
       s.sgbstdn_term_code_eff AS effective_term
FROM   sgbstdn s
WHERE  s.sgbstdn_term_code_eff = (
       SELECT MAX(s2.sgbstdn_term_code_eff)
       FROM   sgbstdn s2
       WHERE  s2.sgbstdn_pidm = s.sgbstdn_pidm);

**Course catalog as it was at registration time — the SCBCRSE pattern.** Entity is (subj_code, crse_numb). The as-of bound is <= sr.sfrstcr_term_code — the catalog row that was current in the term the student registered, not the catalog row that is current today:

-- Course title and credits as they were WHEN THE STUDENT TOOK THE COURSE.
-- Correlation: (subj_code, crse_numb). Bound: <= the registration's term.
SELECT sr.sfrstcr_term_code,
       sr.sfrstcr_crn,
       sc.scbcrse_subj_code,
       sc.scbcrse_crse_numb,
       sc.scbcrse_title,
       sc.scbcrse_credit_hr_low
FROM   sfrstcr sr
JOIN   scbcrse sc
       ON sc.scbcrse_subj_code = sr.sfrstcr_subj_code
      AND sc.scbcrse_crse_numb = sr.sfrstcr_crse_numb
      AND sc.scbcrse_eff_term = (
          SELECT MAX(sc2.scbcrse_eff_term)
          FROM   scbcrse sc2
          WHERE  sc2.scbcrse_subj_code = sc.scbcrse_subj_code
            AND  sc2.scbcrse_crse_numb = sc.scbcrse_crse_numb
            AND  sc2.scbcrse_eff_term <= sr.sfrstcr_term_code);

The <= sr.sfrstcr_term_code bound is the piece that makes this historically correct. Without it, every registration row — Fall 2020, Spring 2022, Summer 2024 — joins to the 2024 catalog row. The course titles silently update to whatever they are today. The Effective-Date Trap — Joining to Yesterday's Row covers exactly this gotcha.

**Employee job — the NBRJOBS pattern.** Entity is (pidm, posn, suff) — one employee can hold multiple positions (primary job, chair stipend, overload), each with its own version history. The bound is <= SYSDATE because the effective column is a DATE, not a term code:

-- Current job assignment per position per employee.
-- Correlation: (pidm, posn, suff). Bound: <= today.
SELECT nj.nbrjobs_pidm,
       nj.nbrjobs_posn        AS position_code,
       nj.nbrjobs_suff        AS suffix,
       nj.nbrjobs_salary      AS current_salary,
       nj.nbrjobs_effective_date
FROM   nbrjobs nj
WHERE  nj.nbrjobs_effective_date = (
       SELECT MAX(nj2.nbrjobs_effective_date)
       FROM   nbrjobs nj2
       WHERE  nj2.nbrjobs_pidm = nj.nbrjobs_pidm
         AND  nj2.nbrjobs_posn = nj.nbrjobs_posn
         AND  nj2.nbrjobs_suff = nj.nbrjobs_suff
         AND  nj2.nbrjobs_effective_date <= SYSDATE);

Three tables, three entity keys, one shape. Learn the shape, and you can apply it to any effective-dated Banner table without looking it up.

Where intuition fails

Five lessons that will save you from the most common Banner SQL disasters:

Wrong correlation columns = wrong row, no error message. For NBRJOBS,

correlating only on pidm — forgetting posn and suff — makes the subquery return the MAX effective date across ALL of that employee's jobs. The result is typically one job row with another job's effective date attached. The salary looks right. The date looks plausible. The data is silently corrupt. Always correlate on the full entity key — every column that defines a distinct version stream.

**< vs <= is a business decision, not a typo.** If a course catalog

change is effective Fall 2022 ('202210') and a student registered in Fall 2022, does the new catalog row apply to that registration? <= says yes; < says no. Most Banner teams use <= — the change effective IN a term applies TO that term — but confirm the business rule with the Registrar's Office before embedding it in every query. Document the choice.

Term codes are strings, and that is fine. MAX() on

SGBSTDN_TERM_CODE_EFF works because 'YYYYTT' sorts lexicographically in the correct chronological order. Do not CAST to integer inside the subquery — the CAST defeats any index on the effective column and the string sort has been reliable for decades. The only edge case is installations that use non-standard term codes; know your own setup.

The correlated subquery can be slow on large tables. On a table with

millions of rows, the nested-loop self-join implicit in the correlated subquery can drag. An index on (entity_columns, eff_column) is critical — for NBRJOBS, that means (nbrjobs_pidm, nbrjobs_posn, nbrjobs_suff, nbrjobs_effective_date). If the query is still slow, rewrite to an analytic function: ROW_NUMBER() OVER (PARTITION BY entity_columns ORDER BY eff_column DESC) with WHERE rn = 1 in an outer query. Both Oracle and PostgreSQL optimize the window-function form better on large row counts.

Duplicate effective dates produce duplicate rows. If two rows for the

same entity share the maximum effective date — a data quality bug that happens in NBRJOBS when a payroll run is botched and re-entered — the MAX() subquery returns both, and your row count is silently inflated. Add a tiebreaker: ROW_NUMBER() OVER (PARTITION BY entity ORDER BY eff DESC, activity_date DESC) with WHERE rn = 1, or fix the source data.

The one-sentence takeaway

Correlate on the full entity key. Take the MAX of the effective-date column. Add a <= bound for as-of queries. That is the whole pattern.

Track B · The canonical joins

The Double SPRIDEN — Naming Two People in One Query

6 min readbannerspridendouble-joinadvisorsupervisoraliassgradvr

The hook

The everyday analogy

Open a wedding invitation. The names appear on the same line: "Margaret Chen and Daniel Park request the honor of your presence." Two people, one invitation, one line of text. To print this invitation, the stationer needed to look up TWO records in the same registry — the bride's record (for her current legal name) and the groom's record (for his). Same registry, two queries, two names typed side by side.

Now imagine the stationer made a mistake: looked up the bride's record TWICE and printed the result. The invitation would read "Margaret Chen and Margaret Chen request..." A visible bug. But in a SQL roster — student name and advisor name displayed side by side — the equivalent bug is silent unless someone notices the names are duplicated.

A calligraphed wedding invitation laid open with bride and groom names visible side by side; a stationer's reference card off to the side noting two lookup IDs (one per name) into the same family registry.

The fix is to consult the registry TWICE explicitly, with two different lookups, and label the results so they don't get confused. In SQL terms: join SPRIDEN twice, with two different aliases, one for each PIDM. The bride's lookup is s (student). The groom's lookup is ai (advisor identity). Each gets its own ON clause. Each returns its own name. The query knows which is which because the aliases label them.

The pattern extends to any "two people on one row" report: applicant + recruiter, employee + supervisor, donor + solicitor, vendor + buyer. Same registry, two lookups, two aliases.

What it really is

The recipe: when a query needs two different names from SPRIDEN, write TWO SPRIDEN joins, each with its own alias, each with the full 3-condition ON clause from Joining by PIDM — SPRIDEN and the Universal Key.

Picking aliases: the Banner Lego convention uses a short suffix that hints at the role:

s = the "main" person (student, employee, applicant)
ai = "advisor identity" (for the second-person lookup)
mi = "manager identity" (for supervisor lookups)
ri = "recruiter identity" (for applicant-recruiter)

Each alias gets its own ON clause with the same three conditions: pidm equality + change_ind IS NULL + entity_ind = 'P'. The conditions are NOT shared across aliases. Each join is a complete, independent lookup — copy-paste the filter set.

SELECT columns are disambiguated by alias prefix: s.spriden_last_name AS student_lname and ai.spriden_last_name AS advisor_lname. Without the prefix, Oracle errors with "ambiguous column."

Center: an intermediate table (SGRADVR) with two PIDM columns highlighted (sgradvr_pidm and sgradvr_advr_pidm); left and right: two separate SPRIDEN boxes labeled s (student) and ai (advisor identity); coral arrows from each intermediate PIDM column to the matching SPRIDEN box.

The intermediate join that supplies the second PIDM is often SGRADVR (advisor assignment), NBRJOBS (supervisor chain), or SARAPRSP (applicant prospect). Each holds both the main PIDM (in *_pidm) and the second person's PIDM (in *_advr_pidm, *_supv_pidm, etc.).

Primary-advisor filter: SGRADVR_PRIM_IND = 'Y' selects the student's PRIMARY advisor. Without it, the join returns one row per advisor — silently multiplying the result by the number of advisors a student has.

See it — the diagram

An intermediate table (SGRADVR) sits in the center with two PIDM columns highlighted in coral: sgradvr_pidm (the student) and sgradvr_advr_pidm (the advisor). On the left, a SPRIDEN box labeled s (student) with its spriden_pidm highlighted. On the right, a second SPRIDEN box labeled ai (advisor identity) with its spriden_pidm highlighted. A coral arrow arcs from sgradvr_pidm to the left SPRIDEN. A second coral arrow arcs from sgradvr_advr_pidm to the right SPRIDEN. Each arrow carries the 3-condition ON clause. The visual says: two lookups, same table, different aliases — the wedding invitation rendered as a SQL join graph.

Show me the code

Student + Primary Advisor — the canonical double SPRIDEN:

-- Student name + primary advisor name in one row.
-- Two SPRIDEN aliases, each with its own 3-condition ON.
SELECT s.spriden_id           AS student_id,
       s.spriden_last_name    AS student_lname,
       s.spriden_first_name   AS student_fname,
       ai.spriden_last_name   AS advisor_lname,
       ai.spriden_first_name  AS advisor_fname,
       sv.sgradvr_advr_code   AS advisor_role
FROM   sgbstdn sb
JOIN   spriden s                        -- student identity
       ON  s.spriden_pidm        = sb.sgbstdn_pidm
       AND s.spriden_change_ind  IS NULL
       AND s.spriden_entity_ind  = 'P'
LEFT JOIN sgradvr sv
       ON  sv.sgradvr_pidm           = sb.sgbstdn_pidm
       AND sv.sgradvr_term_code_eff  = (SELECT MAX(sv2.sgradvr_term_code_eff)
                                        FROM sgradvr sv2
                                        WHERE sv2.sgradvr_pidm = sv.sgradvr_pidm)
       AND sv.sgradvr_prim_ind       = 'Y'
LEFT JOIN spriden ai                    -- advisor identity
       ON  ai.spriden_pidm        = sv.sgradvr_advr_pidm
       AND ai.spriden_change_ind  IS NULL
       AND ai.spriden_entity_ind  = 'P'
WHERE  sb.sgbstdn_term_code_eff = '202610';

Three things to notice:

Two SPRIDEN joins: s (student) and ai (advisor identity).
Each SPRIDEN alias has the full 3-condition ON clause.
LEFT JOIN on sgradvr and spriden ai is deliberate — students without a primary advisor still appear, with advisor columns NULL.

Employee + Supervisor — same pattern:

SELECT s.spriden_id          AS empl_id,
       s.spriden_last_name   AS empl_lname,
       mi.spriden_last_name  AS supv_lname
FROM   nbrjobs j
JOIN   spriden s
       ON  s.spriden_pidm        = j.nbrjobs_pidm
       AND s.spriden_change_ind  IS NULL
       AND s.spriden_entity_ind  = 'P'
LEFT JOIN spriden mi
       ON  mi.spriden_pidm        = j.nbrjobs_supervisor_pidm
       AND mi.spriden_change_ind  IS NULL
       AND mi.spriden_entity_ind  = 'P'
WHERE  j.nbrjobs_effective_date = (SELECT MAX(j2.nbrjobs_effective_date)
                                   FROM nbrjobs j2
                                   WHERE j2.nbrjobs_pidm = j.nbrjobs_pidm
                                     AND j2.nbrjobs_posn = j.nbrjobs_posn
                                     AND j2.nbrjobs_effective_date <= SYSDATE);

Where intuition fails

The two ON clauses are independent — copy/paste both filter sets. New writers sometimes write the second SPRIDEN join without the change_ind / entity_ind filters because "they're already on the first one." Wrong — each alias is its own join, each needs its own filter set. Otherwise the advisor or supervisor lookup returns duplicates from historical names.

Forgetting the alias prefix in SELECT creates "ambiguous column" errors. Oracle cannot guess whether spriden_last_name means the student's or the advisor's when both aliases are joined. Always prefix every column with its alias.

**SGRADVR_PRIM_IND = 'Y' is mandatory for "the advisor" reports.** Students can have multiple advisors. Without the primary-indicator filter, the join returns one row per advisor, silently multiplying the result. Same trap as missing change_ind — duplicates that look identical except in the advisor column.

**LEFT JOIN vs INNER JOIN is a business decision.** "Students without an advisor" → LEFT JOIN preserves them with NULL advisor columns. INNER JOIN drops them. Pick the business rule explicitly and document it.

Three or more SPRIDEN joins are possible but rare. A query needing student + advisor + recruiter on one row uses THREE SPRIDEN aliases (s, ai, ri). Pattern scales; just keep the aliases clearly named.

The one-sentence takeaway

When a query needs two different people on one row, join SPRIDEN twice with two different aliases. Each alias gets its own full 3-condition ON clause. The intermediate table (SGRADVR, NBRJOBS) supplies the second PIDM. Use LEFT JOIN when the second person might be missing and filter SGRADVR_PRIM_IND = 'Y' for primary advisors.

Track B · The canonical joins

The Security Audit Join — GURACLS Done Right

5 min readbannerguraclsgobeaccgubalogsecurityauditjoin

The hook

An auditor asks: "Show me everyone who has the STUDENT_RECORDS access class." The answer lives in a single table — GURACLS. But GURACLS doesn't know anyone's name. It only knows user IDs. To answer the auditor's question, you need a three-table chain, and if you miss the active-account filter, the report includes people who left in 2018.

The everyday analogy

In a multi-tenant office building, the security desk has a binder. Each page is one employee. Each page lists which floors their keycard opens: Floor 12 (Executive), Floor 8 (HR), Floor 5 (Warehouse). Some pages have one entry, some have a dozen. The binder is alphabetical by employee name.

Now imagine the building auditor walks in and asks: "Show me every employee who has access to Floor 12, the executive floor." The security officer cannot answer from the binder directly — the binder is keyed by employee, not by floor. The officer has to flip through every page, scan each employee's access list, and pull out the names whose lists include Floor 12.

An office security desk binder open to a page listing one employee's keycard access (Floor 12, Floor 8, Floor 5) with a small auditor's note in the margin asking 'who else can access Floor 12?'

Banner's security model is the same. GURACLS is the binder — one row per (userid, class) pair. To answer "who has the STUDENT_RECORDS class?" you scan GURACLS for rows where the class_code matches, then JOIN out to identify each user. The binder's USERID column is the entry point, not PIDM — so to get the user's name, you go USERID → GOBEACC_PIDM → SPRIDEN_LAST_NAME. Three-table chain to put one auditor's question into a one-page report.

The keycard analogy also captures the audit gotcha: just because a name is in the binder does not mean the person is still employed. Terminated employees stay in the binder until someone removes them. Banner has the same problem: inactive users keep their GURACLS rows until somebody runs the cleanup. Audit reports always join through to an active-account filter to exclude ghosts.

What it really is

**GURACLS** — the central security assignment table. One row per (userid, class_code). The class_code is the role; the userid is the person who holds it. A user with 12 access roles has 12 rows in GURACLS.

**GOBEACC** — the e-account / user-account table. One row per userid, with GOBEACC_PIDM linking back to the person in SPRIDEN. The userid is the security identity; the PIDM is the human identity; GOBEACC maps between them.

The class description typically lives in GTVCLAS or STVCLAS (validation tables) or GUBCLAS (the class definition table). Verify your local installation — the table name varies.

The canonical join chain:

GURACLS (the binder)
  → GOBEACC (userid to PIDM)
  → SPRIDEN (PIDM to current name)
GURACLS
  → class lookup (class_code to description)

Left to right: GURACLS box (binder of role assignments) → GOBEACC box (userid to PIDM mapping) → SPRIDEN box (PIDM to current name); arrows labeled with the join conditions; a side note showing the active-status filter at the GOBEACC step.

The active-user filter: GOBEACC_STATUS_IND = 'A' excludes terminated users. Without it, every former employee who once had access still appears.

**GUBALOG** is the audit log — a history of every permission grant and revoke. For "when did this user gain access?" or "who revoked this role?" the join extends to GUBALOG (see Soft Deletes — The Rows That Aren't Really Gone for the AUDIT_ACTION convention).

See it — the diagram

Three boxes in a chain, left to right. GURACLS on the left — the binder, keyed by userid — with two rows visible: (MCHEN, STUDENT_RECORDS) and (DPARK, STUDENT_RECORDS). A coral arrow labeled gobeacc_userid = guracls_userid arcs to the GOBEACC box in the center, which holds the userid-to-PIDM mapping and an active-status filter callout. A second coral arrow arcs from GOBEACC to SPRIDEN on the right, labeled with the 3-condition ON clause. The result at the far right shows two resolved names: "Margaret Chen" and "Daniel Park" — the user IDs translated into human identities.

Show me the code

"Who has the STUDENT_RECORDS access class — with names?":

-- Chain: GURACLS -> GOBEACC -> SPRIDEN, plus active filter.
SELECT g.guracls_userid,
       s.spriden_id,
       s.spriden_last_name,
       s.spriden_first_name,
       g.guracls_class_code,
       g.guracls_activity_date
FROM   guracls g
JOIN   gobeacc a
       ON  a.gobeacc_userid     = g.guracls_userid
       AND a.gobeacc_status_ind = 'A'
JOIN   spriden s
       ON  s.spriden_pidm        = a.gobeacc_pidm
       AND s.spriden_change_ind  IS NULL
       AND s.spriden_entity_ind  = 'P'
WHERE  g.guracls_class_code = 'STUDENT_RECORDS'
ORDER BY s.spriden_last_name, s.spriden_first_name;

Full access list per user (LISTAGG'd):

SELECT a.gobeacc_userid,
       s.spriden_last_name || ', ' || s.spriden_first_name AS name,
       LISTAGG(g.guracls_class_code, ', '
               ON OVERFLOW TRUNCATE '...' WITH COUNT)
         WITHIN GROUP (ORDER BY g.guracls_class_code) AS roles
FROM   gobeacc a
JOIN   spriden s
       ON  s.spriden_pidm        = a.gobeacc_pidm
       AND s.spriden_change_ind  IS NULL
       AND s.spriden_entity_ind  = 'P'
LEFT JOIN guracls g
       ON g.guracls_userid = a.gobeacc_userid
WHERE  a.gobeacc_status_ind = 'A'
GROUP BY a.gobeacc_userid, s.spriden_last_name, s.spriden_first_name
ORDER BY a.gobeacc_userid;

The audit-trail question — when was access granted?

SELECT b.gubalog_userid,
       b.gubalog_audit_date,
       b.gubalog_audit_action
FROM   gubalog b
WHERE  b.gubalog_class_code  = 'STUDENT_RECORDS'
  AND  b.gubalog_audit_action <> 'D';

Where intuition fails

GURACLS is keyed by USERID, not PIDM. New writers sometimes try to join GURACLS directly to SPRIDEN on PIDM — no PIDM column exists on GURACLS. Always route through GOBEACC.

**GOBEACC_STATUS_IND = 'A' is the active-user filter.** Without it, every terminated employee from the last decade appears in the audit. The filter is cheap and the report's credibility depends on it.

The class description table name varies. GTVCLAS, STVCLAS, GUBCLAS, or institution-specific. Check the BSS schema search to confirm your local lookup table.

**A user with no GURACLS rows is not "no access" — they may have access via group membership (GUBGRPS / GURGRPS).** Banner's security model has both direct user grants and group-mediated grants. A complete audit checks both.

GUBALOG is append-only — old grants accumulate. The audit log keeps every change forever. For "current access" reports, query GURACLS (the current state), not GUBALOG (the history). Use GUBALOG only for the audit-trail question.

The one-sentence takeaway

The security audit join chains GURACLS → GOBEACC → SPRIDEN. GURACLS is keyed by USERID, not PIDM — always route through GOBEACC. Add gobeacc_status_ind = 'A' to exclude terminated users. For the audit trail of who-granted-what-when, extend to GUBALOG with audit_action <> 'D'.

Track B · The canonical joins

Catalog vs Section — SCBCRSE and SSBSECT

SCBCRSE has a column called eff_term. SSBSECT has a term_code. They look related — so people join them. And when they do, three catalog versions of the same course silently multiply the result by three, and a 2020 transcript retroactively shows the 2024 course title. The join needs a bound, not just an equality.

6 min readbannerscbcrsessbsectcourse-catalogsectionsfrstcrjoin

The hook

SCBCRSE has a column called eff_term. SSBSECT has a term_code. They look related — so people join them. And when they do, three catalog versions of the same course silently multiply the result by three, and a 2020 transcript retroactively shows the 2024 course title. The join needs a bound, not just an equality.

The everyday analogy

Walk into a library and search the catalog for "Pride and Prejudice." The catalog returns one entry — the book's CATALOG record. Title, author, publication year, ISBN, Dewey class. The catalog entry tells you what the book IS. It does not tell you whether the library has the book available right now.

To borrow the book, you need a PHYSICAL COPY. The library may have three copies on the shelf — Copy #1 in good condition, Copy #2 with a torn cover, Copy #3 reserved for reference. Each copy has its own status (available, checked out, reserved, lost), its own due date if checked out, its own physical location. You borrow A COPY, not the catalog entry.

A library: a single catalog card pulled from a wooden card-catalog drawer (the descriptive entry) beside three physical copies of the same book on a nearby shelf (the borrowable instances), each copy with its own status sticker.

The catalog entry persists across all the copies. If the library buys a fourth copy next year, it gets the same catalog entry. If a copy is lost or weeded, the catalog entry stays — pointing now to fewer physical copies. The catalog describes the WORK; the copies are the lend-able INSTANCES.

Banner's course schema works the same way. SCBCRSE is the catalog entry — "ENGL 201, Introduction to British Literature, 3 credit hours." Every offering of ENGL 201 across every term shares the same catalog entry. SSBSECT is the physical copy — "ENGL 201 CRN 12345, Fall 2026, MWF 9-10, Smith." Students "borrow" sections (enroll in them), not catalog entries. The catalog tells you what the course IS; the section tells you when and where you can take it.

And like the library catalog, SCBCRSE is effective-dated. If the English department renames ENGL 201 in 2024 from "Introduction to British Literature" to "Foundations of British Literature," the catalog gets a new effective row. Sections offered before 2024 still link back to the older catalog version with the older title — see The Effective-Date Trap — Joining to Yesterday's Row for the join-time bound that makes this work correctly.

What it really is

**SCBCRSE** — the course catalog table.

Key: (subj_code, crse_numb, eff_term). Effective-dated.
Holds: title, credit hours, course level (UG/GR), subject, department, prerequisites (in SCRPREQ).
One LOGICAL course = many catalog rows over time.

**SSBSECT** — the section master table.

Key: (term_code, crn). NOT effective-dated.
Holds: subject code, course number, section sequence, schedule type, max enrollment, status.
One section = one row per (term, CRN). When the term ends, the row stays as history.

The join from section to catalog:

SSBSECT.ssbsect_subj_code = SCBCRSE.scbcrse_subj_code
AND SSBSECT.ssbsect_crse_numb = SCBCRSE.scbcrse_crse_numb
AND SCBCRSE.scbcrse_eff_term = (
      SELECT MAX(scbcrse_eff_term)
      FROM scbcrse sc2
      WHERE sc2.scbcrse_subj_code = SSBSECT.ssbsect_subj_code
        AND sc2.scbcrse_crse_numb = SSBSECT.ssbsect_crse_numb
        AND sc2.scbcrse_eff_term <= SSBSECT.ssbsect_term_code
    )

The <= SSBSECT.ssbsect_term_code bound is the key — catalog version current AS OF the section's term, not today. See The MAX() Subquery — Getting the Row That's Current.

Left: one SCBCRSE row showing (subj_code='ENGL', crse_numb='201', eff_term='202110', title='Introduction to British Literature') in coral; right: three SSBSECT rows for ENGL 201 in three different terms (each with its own CRN), all joining back to the same catalog entry via subj_code + crse_numb.

Companion tables: SIRASGN (instructor per section), SSRMEET (meeting times), SSRXLST (cross-listed sections).

Why students enroll in sections, not catalog entries: the catalog has no schedule, no instructor, no max enrollment, no CRN. Registration needs a specific time-and-place. The section is the lend-able copy.

See it — the diagram

One SCBCRSE row on the left, showing (subj_code='ENGL', crse_numb='201', eff_term='202110', title='Introduction to British Literature') — the catalog card, rendered as a database row, highlighted in coral. Three coral arrows arc from it to three SSBSECT rows on the right: ENGL 201 CRN 12345 in term 202210, ENGL 201 CRN 23456 in term 202310, ENGL 201 CRN 34567 in term 202410. All three sections point back to the SAME catalog entry via subj_code and crse_numb. A callout below reads "MAX-effective bound on scbcrse_eff_term <= ssbsect_term_code" — the piece that selects the right catalog version for each section's term. The visual says: one catalog entry, many sections; the join is on the course identity, but scoped by term.

Show me the code

A student's roster with course titles, joining all three tables:

SELECT r.sfrstcr_term_code,
       sect.ssbsect_crn,
       sect.ssbsect_subj_code,
       sect.ssbsect_crse_numb,
       sect.ssbsect_seq_numb,
       cat.scbcrse_title         AS course_title,
       cat.scbcrse_credit_hr_low AS credit_hours,
       r.sfrstcr_credit_hr       AS credit_hours_registered
FROM   sfrstcr r
JOIN   ssbsect sect
       ON  sect.ssbsect_term_code = r.sfrstcr_term_code
       AND sect.ssbsect_crn       = r.sfrstcr_crn
JOIN   scbcrse cat
       ON  cat.scbcrse_subj_code = sect.ssbsect_subj_code
       AND cat.scbcrse_crse_numb = sect.ssbsect_crse_numb
       AND cat.scbcrse_eff_term  = (
           SELECT MAX(c2.scbcrse_eff_term)
           FROM   scbcrse c2
           WHERE  c2.scbcrse_subj_code = cat.scbcrse_subj_code
             AND  c2.scbcrse_crse_numb = cat.scbcrse_crse_numb
             AND  c2.scbcrse_eff_term <= sect.ssbsect_term_code)
WHERE  r.sfrstcr_pidm      = 38201
  AND  r.sfrstcr_term_code = '202610';

The bug — joining straight to today's SCBCRSE (silent revisionism):

-- WRONG: no MAX-effective bound on SCBCRSE.
-- Returns TODAY's catalog title for every historical registration.
-- A course retitled in 2024 silently appears with the new title
-- in a 2020 transcript.
SELECT r.sfrstcr_term_code, r.sfrstcr_crn, cat.scbcrse_title
FROM   sfrstcr r
JOIN   ssbsect sect
       ON sect.ssbsect_term_code = r.sfrstcr_term_code
      AND sect.ssbsect_crn       = r.sfrstcr_crn
JOIN   scbcrse cat
       ON cat.scbcrse_subj_code = sect.ssbsect_subj_code
      AND cat.scbcrse_crse_numb = sect.ssbsect_crse_numb;
-- bug: no eff_term bound — all 3 catalog versions match every section

Where intuition fails

The MAX-effective bound on SCBCRSE is mandatory. Without <= sect.ssbsect_term_code, every section joins to every catalog version of that course. Three catalog versions = 3x the rows — silent multiplication AND silent revisionism. See The Effective-Date Trap — Joining to Yesterday's Row.

**SSBSECT_SUBJ_CODE and SCBCRSE_SUBJ_CODE are the same vocabulary but separate columns.** The join condition needs BOTH subj_code AND crse_numb — the catalog is keyed on the (subject, course number) pair, not on either alone.

**Section status (SSBSECT_SSTS_CODE)** flags inactive sections (cancelled, hidden, lab-only). A roster query that includes cancelled sections inflates totals. Filter ssbsect_ssts_code = 'A' (or your local "active" code) when "what is currently being offered" is the question.

**Cross-listed sections (SSRXLST)** are one logical class taught under multiple CRNs. Naive headcount queries double-count cross-listed enrollments. Recognize the cross-list via SSRXLST_XLST_GROUP and pick one representative CRN.

**SCBCRSE_CREDIT_HR_LOW vs SCBCRSE_CREDIT_HR_HIGH** — variable-credit courses have a low/high range. The student's actual credits are in SFRSTCR_CREDIT_HR, not in the catalog. Use SFRSTCR_CREDIT_HR for registration credit totals; use the SCBCRSE range only as a sanity bound.

The one-sentence takeaway

SCBCRSE is the course catalog (what the course IS — title, credits, subject). SSBSECT is the section master (a specific OFFERING — CRN, term, instructor). Students enroll in sections, not catalog entries. Join SSBSECT to SCBCRSE on (subj_code, crse_numb) with a MAX-effective bound of scbcrse_eff_term <= ssbsect_term_code to get the catalog version current AS OF the section's term.

Track C · From generic SQL to Banner

Banner Runs on Oracle — The Dialect You Will Meet

SQL is a standard. Oracle's version of it has its own vocabulary — small differences scattered through every query, none hard, none avoidable. You can't read Banner SQL for ten minutes without meeting SYSDATE, NVL, DUAL, ||, ROWNUM, and DECODE. Learn them once, and the dialect becomes the language.

5 min readoraclebannersql-dialectsysdatedualrownumnvldecode

The hook

SQL is a standard. Oracle's version of it has its own vocabulary — small differences scattered through every query, none hard, none avoidable. You can't read Banner SQL for ten minutes without meeting SYSDATE, NVL, DUAL, ||, ROWNUM, and DECODE. Learn them once, and the dialect becomes the language.

The everyday analogy

An American spends a week in London and notices small differences in the same language. The lift, not the elevator. The lorry, not the truck. The biscuit (sweet, like a cookie), not the biscuit (savory, like small bread). The car park, not the parking lot. Everything is mostly the same — grammar, spelling of common words, conversational patterns — but the small differences are everywhere, and not knowing them produces small puzzlements at every turn.

A phrasebook open on a desk with two columns: 'American English' (lift, lorry, biscuit) and 'British English' (elevator, truck, cookie); alongside, a second phrasebook for SQL: 'generic SQL' (NOW(), TOP, ISNULL) vs 'Oracle SQL' (SYSDATE, ROWNUM, NVL).

After a week the American has internalized the map. Lift = elevator. Lorry = truck. Take the lift to the second floor (which an American would call the third floor). The dialect becomes natural. The language was never unintelligible; it was just unfamiliar.

Oracle's SQL is the same kind of dialect. Most of the SELECT/FROM/WHERE/GROUP BY skeleton is identical to any other dialect. But scattered through every Oracle query are small idioms: SYSDATE (not NOW()), || (not + for concatenation), NVL (not ISNULL), DECODE (Oracle's CASE before CASE existed), ROWNUM (Oracle's TOP/LIMIT), the magical DUAL table. Each idiom is small. Together they make Oracle code look unmistakably Oracle.

Banner runs on Oracle. Every Banner report, every Argos DataBlock, every Banner Lego recipe in BSS is written in Oracle's dialect. Learn the idioms once, and the dialect becomes the language.

What it really is

Ten Oracle idioms a Banner writer meets every day:

**SYSDATE** — the current date and time. Used everywhere Banner needs "now": WHERE x.eff_date <= SYSDATE.
**DUAL** — a one-row, one-column system table you SELECT FROM when you need a result without a real source: SELECT SYSDATE FROM dual. Other dialects let you SELECT without a FROM; Oracle requires DUAL.
**ROWNUM** — a pseudo-column that numbers rows as they are produced. Used for "first N rows" via WHERE ROWNUM <= N. Modern Oracle (12c+) also supports FETCH FIRST N ROWS ONLY; older Banner code uses ROWNUM. Cannot be used with ORDER BY in the same WHERE without a subquery.
**NVL(a, b)** — returns a if non-null, else b. Oracle also supports the standard COALESCE for more than two arguments.
**DECODE(expr, v1, r1, v2, r2, ..., default)** — Oracle's pre-CASE conditional. Older Banner SQL uses DECODE; newer uses CASE.
**|| (string concatenation)** — Oracle uses ||, not + (SQL Server): last_name || ', ' || first_name.
**TO_DATE, TO_CHAR, TO_NUMBER** — explicit type conversion. TO_DATE('2026-09-15', 'YYYY-MM-DD') parses a string to a date. TO_CHAR(SYSDATE, 'YYYY-MM-DD') formats a date back to a string. Format strings are Oracle-specific (uppercase YYYY, MM, DD).
**ADD_MONTHS(date, n) and MONTHS_BETWEEN(d1, d2)** — date arithmetic. ADD_MONTHS handles end-of-month wraparound. MONTHS_BETWEEN returns a fractional float (larger date first).
**INSTR(haystack, needle) and SUBSTR(s, start, length)** — 1-indexed string operations. INSTR returns 0 if not found. Both named differently from other dialects.
**TRUNC(date)** — drops the time portion, leaving midnight. WHERE TRUNC(activity_date) = DATE '2026-09-15'. Also works for numbers: TRUNC(123.456, 1) = 123.4.

Brief mention, deferred: the trailing (+) for outer joins — Oracle's legacy syntax covered in From (+) to ANSI — Retiring Oracle's Old Outer Join.

A side-by-side comparison table: left column 'generic SQL / other dialects', right column 'Oracle equivalent', with ~8 rows showing the most common substitutions (NOW→SYSDATE, ISNULL→NVL, +→||, TOP→ROWNUM, GETDATE→SYSDATE, LEN→LENGTH, CONVERT→TO_CHAR, dual not needed→FROM dual).

See it — the diagram

A side-by-side comparison table: left column "generic SQL / other dialects," right column "Oracle equivalent." Eight rows: NOW() → SYSDATE, ISNULL → NVL, + for concat → ||, TOP 10 → WHERE ROWNUM <= 10, GETDATE() → SYSDATE, LEN() → LENGTH(), CONVERT(varchar, ...) → TO_CHAR(...), SELECT expr (no FROM) → SELECT expr FROM dual. Each row links the familiar to the Oracle. The visual is the British/American phrasebook rendered as a SQL reference card — same language, different vocabulary, one-to-one once you have the mapping.

Show me the code

A typical Banner-flavored query — six idioms in one place:

-- SYSDATE, NVL, ||, TO_CHAR, TRUNC, ADD_MONTHS, ROWNUM
SELECT ROWNUM                                  AS line,
       s.spriden_id,
       s.spriden_last_name || ', ' ||
         NVL(s.spriden_first_name, '(no first)') AS full_name,
       TO_CHAR(p.phrhist_year, 'FM9999')       AS fy,
       p.phrhist_gross,
       TRUNC(p.phrhist_activity_date)          AS last_touched
FROM   phrhist p
JOIN   spriden s
       ON  s.spriden_pidm        = p.phrhist_pidm
       AND s.spriden_change_ind  IS NULL
       AND s.spriden_entity_ind  = 'P'
WHERE  p.phrhist_disp = 'P'
  AND  p.phrhist_activity_date >= ADD_MONTHS(SYSDATE, -12)
  AND  ROWNUM <= 100
ORDER BY p.phrhist_activity_date DESC;

DUAL — the one-row table for expression evaluation:

SELECT SYSDATE, USER, 1 + 1 FROM dual;
-- Returns one row: current date, current user, the number 2.

DECODE vs CASE for the same logic:

-- DECODE (older Banner style):
SELECT DECODE(stvterm_code,
              '202610', 'Fall 2026',
              '202620', 'Spring 2027',
              '202630', 'Summer 2027',
              'Other') AS term_label
FROM stvterm;

-- Modern CASE equivalent:
SELECT CASE stvterm_code
         WHEN '202610' THEN 'Fall 2026'
         WHEN '202620' THEN 'Spring 2027'
         WHEN '202630' THEN 'Summer 2027'
         ELSE 'Other'
       END AS term_label
FROM stvterm;

Where intuition fails

**ROWNUM is applied BEFORE ORDER BY.** WHERE ROWNUM <= 10 ORDER BY x returns the first 10 rows the optimizer happens to produce, then sorts them — NOT the top 10 by x. To get "top 10 by x" wrap the query in a subquery and apply ROWNUM outside, or use FETCH FIRST 10 ROWS ONLY.

**NULL || 'something' behaves unexpectedly.** Oracle treats NULL as empty string in concatenation (producing just 'something'), but other functions on the result may still return NULL. Explicit NVL is safer.

**'' (empty string) IS NULL in Oracle.** WHERE x = '' returns no rows because empty string is treated as NULL and NULL is never equal to anything. Use WHERE x IS NULL. This bites hard when migrating to PostgreSQL where '' and NULL are distinct — see From Oracle to PostgreSQL — the Banner SaaS Migration.

DUAL is special — don't query it for production data. Some installations have customized DUAL or security restrictions on it. Use DUAL only for evaluating expressions or constants.

**Date format strings use uppercase tokens (YYYY, MM, DD).** Lowercase tokens like mm may parse differently or fail. Oracle's date format is well-documented but Oracle-specific — port to another database and the format strings need adjustment.

The one-sentence takeaway

Oracle's SQL dialect differs from generic SQL in ~10 everyday idioms: SYSDATE, DUAL, ROWNUM, NVL, DECODE, ||, TO_DATE/TO_CHAR/TO_NUMBER, ADD_MONTHS/MONTHS_BETWEEN, INSTR/SUBSTR, and TRUNC(date). Learn the ten, and Banner SQL becomes readable.

Track C · From generic SQL to Banner

From SQL Server to Oracle — Translating Your Instincts

You know how to write SQL. You've written hundreds of queries on SQL Server. Then you open a Banner DataBlock and see SYSDATE, NVL, ROWNUM, DUAL, || — and every instinct you have about what to type is a half-second wrong. The skill carries. The syntax doesn't. Here is the translation.

5 min readoraclesql-serverdialect-translationsysdatenvlrownumidentifiers

The hook

The everyday analogy

An American driver flies to London and rents a car at Heathrow. The car looks familiar — steering wheel, pedals, gear shift, windshield wipers. But the steering wheel is on the RIGHT side. The driver is on the LEFT side of the road. The gear shift is operated with the LEFT hand. The turn signal lever is on the OPPOSITE side of the steering column from where the wipers are at home.

The skill of driving translates perfectly. The driver knows how to brake, accelerate, signal, merge, parallel-park. But every motor pattern is mirrored. The first 30 minutes are exhausting — the driver consciously thinks about each action. After a day the new pattern starts to feel natural. After a week the driver is reaching for the gear shift with the correct hand without thinking.

View through a rental-car windshield showing the steering wheel on the right side of the cabin, the road ahead with left-side traffic; a small open guidebook on the passenger seat with translation notes.

Moving from SQL Server to Oracle is the same kind of re-mapping. The SQL skill carries — joins, aggregations, window functions, subqueries all work the same way. But the syntax is mirrored everywhere. The instinct to type GETDATE() must be redirected to SYSDATE. The instinct to write TOP 10 col must be redirected to WHERE ROWNUM <= 10. Square brackets for quoted identifiers must be unlearned in favor of double quotes. The driver's skill is intact; the muscle memory needs re-training.

And like driving on the other side of the road, a few patterns will not translate by simple substitution — they have actually-different semantics. The empty string equals NULL in Oracle but not in SQL Server. Identity columns work fundamentally differently. Those are the gotchas that bite even after the rest of the muscle memory has been retrained.

What it really is

The translation table, by category:

Date/time functions: GETDATE() → SYSDATE. DATEADD(month, n, dt) → ADD_MONTHS(dt, n). DATEDIFF(month, d1, d2) → MONTHS_BETWEEN(d2, d1) — note the argument order is REVERSED AND the return type is float in Oracle, integer in SQL Server. GETUTCDATE() → SYS_EXTRACT_UTC(SYSTIMESTAMP).

NULL handling: ISNULL(a, b) → NVL(a, b). COALESCE(a, b, c) is the SAME in both databases.

String operations: + for concat → ||. LEN(s) → LENGTH(s). CHARINDEX(needle, haystack) → INSTR(haystack, needle) (argument order reversed!). SUBSTRING(s, start, len) → SUBSTR(s, start, len).

Result limiting: SELECT TOP 10 * → wrap in subquery with WHERE ROWNUM <= 10 OR use FETCH FIRST 10 ROWS ONLY (12c+).

Conversion: CONVERT(varchar, date_col, 23) → TO_CHAR(date_col, 'YYYY-MM-DD'). CAST(s AS INT) → TO_NUMBER(s) or CAST(s AS NUMBER).

Identifiers: [Square Brackets] → "Double Quotes".

Control flow: IIF(cond, a, b) → CASE WHEN cond THEN a ELSE b END.

SELECT without FROM: SQL Server allows SELECT GETDATE(). Oracle requires SELECT SYSDATE FROM dual.

Three-column comparison: category | SQL Server | Oracle. Rows for date functions (GETDATE/SYSDATE, DATEADD/ADD_MONTHS), NULL handling (ISNULL/NVL), string ops (+ vs ||, LEN/LENGTH, CHARINDEX/INSTR), result limiting (TOP/ROWNUM), conversion (CONVERT/TO_CHAR), identifiers ([brackets]/"quotes").

Identity / auto-increment: SQL Server's IDENTITY(1,1) is column metadata. Oracle uses a SEQUENCE plus a BEFORE INSERT trigger (pre-12c) or GENERATED ALWAYS AS IDENTITY (12c+). Banner overwhelmingly uses the sequence+trigger pattern — look at the trigger code when reading DDL.

See it — the diagram

A three-column reference card: "Category," "SQL Server," "Oracle." Rows grouped by category. Date functions: GETDATE() / SYSDATE, DATEADD(month,...) / ADD_MONTHS(...), DATEDIFF(month,...) / MONTHS_BETWEEN(...). NULL handling: ISNULL / NVL, COALESCE / COALESCE (same). String ops: + / ||, LEN / LENGTH, CHARINDEX(x,y) / INSTR(y,x) (reversed). Result limiting: TOP 10 / WHERE ROWNUM <= 10 or FETCH FIRST. Conversion: CONVERT(varchar,...) / TO_CHAR(...). Identifiers: [brackets] / "quotes". The visual is the driver's guidebook on the passenger seat, rendered as a SQL reference card.

Show me the code

SQL Server version:

-- SQL Server: top 10 most recent payroll postings.
SELECT TOP 10
       LEN(s.spriden_last_name)              AS lname_length,
       ISNULL(s.spriden_first_name, '(blank)') AS first_name,
       p.phrhist_gross,
       CONVERT(varchar(10), p.phrhist_activity_date, 23) AS posted_dt
FROM   phrhist p
JOIN   spriden s ON s.spriden_pidm = p.phrhist_pidm
WHERE  p.phrhist_disp = 'P'
  AND  p.phrhist_activity_date >= DATEADD(month, -12, GETDATE())
ORDER BY p.phrhist_activity_date DESC;

Oracle (Banner) translation:

-- Oracle: same intent, dialect-translated. ROWNUM wrapped in
-- a subquery to honor the ORDER BY.
SELECT * FROM (
  SELECT LENGTH(s.spriden_last_name)              AS lname_length,
         NVL(s.spriden_first_name, '(blank)')      AS first_name,
         p.phrhist_gross,
         TO_CHAR(p.phrhist_activity_date,
                 'YYYY-MM-DD')                    AS posted_dt
  FROM   phrhist p
  JOIN   spriden s
         ON  s.spriden_pidm        = p.phrhist_pidm
         AND s.spriden_change_ind  IS NULL
         AND s.spriden_entity_ind  = 'P'
  WHERE  p.phrhist_disp = 'P'
    AND  p.phrhist_activity_date >= ADD_MONTHS(SYSDATE, -12)
  ORDER BY p.phrhist_activity_date DESC
) WHERE ROWNUM <= 10;

Substitutions: LEN→LENGTH, ISNULL→NVL, CONVERT→TO_CHAR, DATEADD→ADD_MONTHS, GETDATE→SYSDATE, TOP 10→ROWNUM-in-subquery.

Where intuition fails

**'' = NULL in Oracle but NOT in SQL Server.** A SQL Server query that used WHERE x = '' to find blank rows works fine. The same query in Oracle returns NO rows because Oracle treats '' as NULL and NULL is never equal to anything. Translate to WHERE x IS NULL. See From Oracle to PostgreSQL — the Banner SaaS Migration — PostgreSQL behaves like SQL Server here.

**ROWNUM is applied BEFORE ORDER BY.** SQL Server's SELECT TOP 10 ... ORDER BY x does the right thing. Oracle's WHERE ROWNUM <= 10 ... ORDER BY x returns the first 10 the optimizer scans and THEN sorts them. Wrap in a subquery, or use FETCH FIRST 10 ROWS ONLY.

**DATEDIFF vs MONTHS_BETWEEN have reversed argument order AND different return types.** DATEDIFF(month, '2024-01-01', '2024-06-01') returns integer 5. MONTHS_BETWEEN(DATE '2024-06-01', DATE '2024-01-01') returns float 5.0 — with the larger date FIRST. Argument order is the most common mistake.

Identity columns require a different mental model. SQL Server's IDENTITY(1,1) is column-level metadata. Oracle's sequence+trigger pattern is two separate objects, and the trigger must be present for INSERTs to populate the column. When reading Banner DDL, look at the trigger code — it is not in the column definition.

Stored procedure syntax is entirely different. SQL Server's T-SQL and Oracle's PL/SQL share almost no syntax. Translation here is a rewrite, not a substitution. Banner ships thousands of PL/SQL procedures; reading them requires a different mental model.

The one-sentence takeaway

Moving from SQL Server to Oracle is remapping muscle memory: GETDATE()→SYSDATE, TOP 10→ROWNUM/FETCH FIRST, ISNULL→NVL, +→||, LEN→LENGTH, CONVERT→TO_CHAR, DATEADD→ADD_MONTHS, CHARINDEX→INSTR. Most substitutions are 1:1. The non-1:1 mines: '' = NULL in Oracle, ROWNUM before ORDER BY, and reversed DATEDIFF/MONTHS_BETWEEN argument order.

Track C · From generic SQL to Banner

From Oracle to PostgreSQL — the Banner SaaS Migration

Ellucian's cloud Banner targets PostgreSQL, not Oracle. Every Argos DataBlock you write today in Oracle SQL will eventually run against a PostgreSQL database. Some of the SQL translates mechanically. Some doesn't. And one difference — '' = NULL — will silently change what rows your query returns without raising an error.

5 min readoraclepostgresqldialect-translationbanner-saasmigrationempty-string-null

The hook

Ellucian's cloud Banner targets PostgreSQL, not Oracle. Every Argos DataBlock you write today in Oracle SQL will eventually run against a PostgreSQL database. Some of the SQL translates mechanically. Some doesn't. And one difference — '' = NULL — will silently change what rows your query returns without raising an error.

The everyday analogy

An American moves to Spain for a new job. Their professional skills carry — they are still a competent project manager, still able to read a budget, still able to lead a meeting. But the daily life around their work changes in dozens of ways. The currency is euros, not dollars — a clean substitution. The address format puts the postcode in a different position — translatable. The driving rules are metric kph instead of mph — translatable but easy to misread.

And then there are the deeper differences. The work day starts and ends later. The default lunch is two hours, not thirty minutes. Negotiation styles favor relationship-first conversation. Healthcare is public and tax-funded, not employer-tied. These are not vocabulary substitutions — they are semantic shifts. The person who treats them as "just translate the words" will end up with problems that look like miscommunication but are actually genre-level mismatch.

An American passport stamped with a Spanish residency visa on a wooden desk beside a Spanish phrasebook open to a 'cultural differences' page; a half-unpacked moving box visible in the background.

Moving Oracle SQL to PostgreSQL is the same shape of move. Most translations are mechanical: SYSDATE → CURRENT_TIMESTAMP, NVL → COALESCE, DECODE → CASE WHEN. Like swapping currencies. But scattered through Oracle's idioms are semantic differences — the deepest one being Oracle's treatment of empty string as NULL, which PostgreSQL does NOT share. An Oracle query that worked correctly for ten years because '' and NULL were interchangeable will silently break in PostgreSQL where they are distinct values. The query still RUNS. It just stops returning the right rows. That is the kind of bug you do not catch until users notice the report's numbers drifted.

The translation guide is essential. The warning list is critical.

What it really is

Translations by category, each marked mechanical (1:1 substitute) or semantic (rewrite needed):

Date/time (mostly mechanical): SYSDATE → CURRENT_TIMESTAMP (or NOW(), or LOCALTIMESTAMP). ADD_MONTHS(d, n) → d + n * INTERVAL '1 month'. MONTHS_BETWEEN(d1, d2) → no direct equivalent; compute via EXTRACT(YEAR FROM age(d1,d2))*12 + EXTRACT(MONTH FROM age(d1,d2)). TRUNC(date_col) → DATE_TRUNC('day', date_col)::date.

NULL handling (mostly mechanical): NVL(a, b) → COALESCE(a, b) — clean substitute. **SEMANTIC: '' = NULL in Oracle, '' ≠ NULL in PostgreSQL.** See gotcha #1.

String operations (mechanical): || for concat works in both. INSTR(s, sub) → POSITION(sub IN s) (note argument order change). SUBSTR → SUBSTRING or SUBSTR (PostgreSQL supports both).

Conditional (mechanical): DECODE(...) → CASE WHEN ... THEN ... ELSE ... END.

Result limiting (mechanical): WHERE ROWNUM <= 10 → LIMIT 10 (add ORDER BY to make it deterministic). FETCH FIRST N ROWS ONLY works in both.

Quote / case sensitivity (semantic): Oracle folds unquoted identifiers to UPPERCASE; PostgreSQL folds to LOWERCASE. SPRIDEN_PIDM in Oracle is spriden_pidm in PostgreSQL. Cross-platform code should use quoted identifiers or rely consistently on the fold.

Three-column comparison: Oracle | PostgreSQL | mechanical-or-semantic flag. Rows highlighting SYSDATE/CURRENT_TIMESTAMP (mechanical), NVL/COALESCE (mechanical), (+)/ANSI JOIN (semantic — rewrite), ''=NULL/''!=NULL (semantic — audit), ROWNUM/LIMIT (mechanical), MERGE/ON CONFLICT (semantic).

**MERGE / UPSERT (semantic):** Oracle's MERGE INTO ... USING ... → PostgreSQL's INSERT ... ON CONFLICT (...) DO UPDATE SET .... Different syntax, similar semantics — a rewrite, not a substitution.

**(+) outer joins (semantic, mandatory rewrite):** PostgreSQL does NOT support (+). Every legacy Oracle query using (+) must be rewritten to ANSI JOIN before migration. See From (+) to ANSI — Retiring Oracle's Old Outer Join.

DUAL (mechanical removal): Oracle's SELECT SYSDATE FROM dual → PostgreSQL's SELECT CURRENT_TIMESTAMP (no FROM needed).

Sequence semantics (mechanical): Oracle's SEQ.NEXTVAL → PostgreSQL's nextval('seq'). Slightly different syntax; same semantics.

See it — the diagram

A three-column reference card: "Oracle," "PostgreSQL," and a "mechanical or semantic" flag in the third column. Mechanical rows in ink: SYSDATE/CURRENT_TIMESTAMP, NVL/COALESCE, DECODE/CASE, ROWNUM/LIMIT. Semantic rows in coral: (+)/ANSI JOIN (rewrite), ''=NULL/''≠NULL (audit every occurrence), MERGE/ON CONFLICT (rewrite). The visual says: most of the migration is a phrasebook; the flagged rows are where you stop translating and start auditing.

Show me the code

Oracle version:

-- Oracle: a typical Banner-flavored query.
SELECT s.spriden_id,
       s.spriden_last_name || ', ' ||
         NVL(s.spriden_first_name, '(blank)')  AS full_name,
       TO_CHAR(p.phrhist_activity_date, 'YYYY-MM-DD') AS posted_dt,
       p.phrhist_gross
FROM   phrhist p
JOIN   spriden s
       ON  s.spriden_pidm        = p.phrhist_pidm
       AND s.spriden_change_ind  IS NULL
       AND s.spriden_entity_ind  = 'P'
WHERE  p.phrhist_disp        = 'P'
  AND  p.phrhist_activity_date >= ADD_MONTHS(SYSDATE, -12)
  AND  ROWNUM <= 100
ORDER BY p.phrhist_activity_date DESC;

PostgreSQL translation:

-- PostgreSQL: NVL→COALESCE, SYSDATE→CURRENT_TIMESTAMP,
-- ADD_MONTHS→INTERVAL arithmetic, ROWNUM→LIMIT.
SELECT s.spriden_id,
       s.spriden_last_name || ', ' ||
         COALESCE(s.spriden_first_name, '(blank)') AS full_name,
       TO_CHAR(p.phrhist_activity_date, 'YYYY-MM-DD') AS posted_dt,
       p.phrhist_gross
FROM   phrhist p
JOIN   spriden s
       ON  s.spriden_pidm        = p.phrhist_pidm
       AND s.spriden_change_ind  IS NULL
       AND s.spriden_entity_ind  = 'P'
WHERE  p.phrhist_disp        = 'P'
  AND  p.phrhist_activity_date >= CURRENT_TIMESTAMP - INTERVAL '12 months'
ORDER BY p.phrhist_activity_date DESC
LIMIT 100;

Where intuition fails

**'' = NULL in Oracle but NOT in PostgreSQL — the most dangerous gotcha.** Oracle treats empty string '' as NULL. WHERE x = '' returns NO rows in Oracle (NULL is never equal to anything). The same query in PostgreSQL returns rows where x is literally an empty string. Conversely, WHERE x IS NULL in Oracle catches both NULL and empty string; in PostgreSQL it catches only NULL. Audit every IS NULL and = '' in Banner SQL before migration.

Unquoted identifier case-folding is opposite. Oracle folds to UPPERCASE; PostgreSQL folds to LOWERCASE. A column created as Spriden_Pidm becomes SPRIDEN_PIDM in Oracle and spriden_pidm in PostgreSQL. Use either consistently lowercase or quote every identifier for cross-database code.

**DATEDIFF-style intervals are easier in PostgreSQL.** CURRENT_TIMESTAMP - INTERVAL '12 months' is clean and readable. The Oracle ADD_MONTHS equivalent is more verbose. Use the interval syntax in PostgreSQL.

**MERGE syntax differs fundamentally.** Oracle's MERGE INTO target USING source ON (...) WHEN MATCHED THEN UPDATE WHEN NOT MATCHED THEN INSERT becomes PostgreSQL's INSERT ... ON CONFLICT (...) DO UPDATE SET .... The semantics are similar but the syntax is a rewrite.

**(+) outer joins are not supported.** PostgreSQL has no equivalent for Oracle's (+) syntax. Every Banner query using (+) must be rewritten to ANSI JOIN before it will run. See From (+) to ANSI — Retiring Oracle's Old Outer Join — audit your DataBlocks early; this is a common SaaS migration blocker.

The one-sentence takeaway

Oracle-to-PostgreSQL translation is mostly mechanical: SYSDATE→CURRENT_TIMESTAMP, NVL→COALESCE, DECODE→CASE, ROWNUM→LIMIT, ADD_MONTHS→+ INTERVAL. The semantic minefields: (1) '' = NULL in Oracle but '' ≠ NULL in PostgreSQL — audit every IS NULL and = ''; (2) unquoted identifiers fold to lowercase in PostgreSQL; (3) (+) outer joins are not supported — every one must be rewritten to ANSI JOIN before migration; (4) MERGE→INSERT ... ON CONFLICT is a rewrite, not a substitution.

Track C · From generic SQL to Banner

From (+) to ANSI — Retiring Oracle's Old Outer Join

You open an older Banner SR report and see WHERE a.x = b.x(+). It looks like a typo. It is not. It is Oracle's pre-ANSI outer join syntax — the stick-shift of the SQL world. It still runs, but PostgreSQL won't accept it, and the modern world has moved on. Here is the translation.

6 min readoracleansi-joinlegacyplus-syntaxouter-joinbanner-saas

The hook

The everyday analogy

An older driver learned to drive on a manual transmission: clutch in, shift to first, ease off the clutch while feathering the gas, shift to second at 15 mph, third at 25, fourth at 35. Every drive is an exercise in coordination — left foot, right foot, right hand on the gear lever, eyes on the tachometer, ears tuned for engine strain. The driver who masters this can squeeze every advantage out of the engine. But it takes years to internalize.

A younger driver learns on automatic. They never touch the clutch. The car decides when to shift. The driver focuses on steering and brake and throttle and traffic — the same outputs, far less mental overhead. The automatic transmission handles the gear math. It is the modern default. Driver's ed classes barely teach manual anymore.

A vintage stick-shift gear lever in worn leather on the left of the frame, a modern automatic shifter on the right; both attached to the same dashboard suggesting two cars side by side; a small instructor's card between them reading 'same destination, different effort.'

Oracle's (+) outer-join syntax is the stick-shift of the SQL world. Before the ANSI JOIN standard, Oracle had its own way to express outer joins: a trailing (+) on the optional side of an equality predicate in the WHERE clause. It worked, it was Oracle-specific, and a generation of Banner SQL writers learned it as the default. The ANSI JOIN syntax (LEFT JOIN ... ON ...) is the automatic transmission — same output, less mental overhead, easier to read, easier to modify, supported by every modern database.

Most modern Banner SQL today uses ANSI JOIN. But the legacy code base — older SR reports, older DataBlocks — is full of (+). Reading them requires the stick-shift mental model. Maintaining them is fine. Porting them to PostgreSQL (the SaaS migration target) requires translating every (+) to ANSI JOIN — because PostgreSQL does not support the stick-shift at all.

What it really is

The (+) syntax marks the OPTIONAL side of an equality predicate. WHERE a.x = b.x(+) means "keep all rows of a, with matching b rows where they exist; otherwise b's columns are NULL." Functionally a LEFT JOIN.

The translation table:

WHERE a.x = b.x(+) → a LEFT JOIN b ON b.x = a.x
WHERE a.x(+) = b.x → a RIGHT JOIN b ON a.x = b.x (or refactor to LEFT for readability)
WHERE a.x(+) = b.x(+) → a FULL OUTER JOIN b ON a.x = b.x (rare; technically not supported in older Oracle)

WHERE conditions on the OUTER side must move INTO the ON clause when rewriting. This is the same trap as The Phantom INNER JOIN — When a WHERE Breaks Your LEFT JOIN: WHERE a.x = b.x(+) AND b.y(+) = 'X' — the (+) on b.y is required to preserve outer-join semantics. The ANSI rewrite must move that filter inside ON: a LEFT JOIN b ON b.x = a.x AND b.y = 'X'. If you leave it in WHERE, the rewrite silently converts the LEFT JOIN back to an INNER.

Three-row comparison: each row shows a legacy (+) form on the left, an arrow, and the ANSI JOIN equivalent on the right; rows for LEFT (a.x = b.x(+)), RIGHT (a.x(+) = b.x), and FULL OUTER (a.x(+) = b.x(+)).

**Limitations of (+):**

Cannot express FULL OUTER JOIN cleanly.
Cannot be combined with ANSI JOIN syntax in the same query.
Cannot use complex conditions — (+) must be on a column reference, not an expression.
Not supported by PostgreSQL — see From Oracle to PostgreSQL — the Banner SaaS Migration.

See it — the diagram

Three rows, each a translation pair. Row 1: WHERE a.x = b.x(+) → a LEFT JOIN b ON b.x = a.x. Row 2: WHERE a.x(+) = b.x → a RIGHT JOIN b ON a.x = b.x (with a small note: "prefer refactor to LEFT"). Row 3: WHERE a.x(+) = b.x(+) → a FULL OUTER JOIN b ON a.x = b.x. The visual is the gear-shift pattern diagram — three legacy positions mapped to their modern equivalents — rendered as a SQL reference card.

Show me the code

Case 1: simple LEFT JOIN:

-- Legacy Oracle (+):
SELECT s.spriden_id, m.stvmajr_desc
FROM   sgbstdn g, stvmajr m
WHERE  g.sgbstdn_majr_code_1 = m.stvmajr_code(+);

-- ANSI JOIN rewrite:
SELECT s.spriden_id, m.stvmajr_desc
FROM   sgbstdn g
LEFT JOIN stvmajr m ON m.stvmajr_code = g.sgbstdn_majr_code_1;

Case 2: filter on the outer side — must move INTO ON:

-- Legacy Oracle (+) with filter on outer side:
SELECT s.spriden_id, m.stvmajr_desc
FROM   sgbstdn g, stvmajr m
WHERE  g.sgbstdn_majr_code_1 = m.stvmajr_code(+)
  AND  m.stvmajr_valid_a_ind(+) = 'Y';     -- (+) preserves outer-join

-- ANSI rewrite: filter belongs INSIDE the ON clause
SELECT s.spriden_id, m.stvmajr_desc
FROM   sgbstdn g
LEFT JOIN stvmajr m
       ON  m.stvmajr_code         = g.sgbstdn_majr_code_1
       AND m.stvmajr_valid_a_ind  = 'Y';     -- filter in ON

Case 3: the WHERE-vs-ON bug — same trap as E1:

-- BUG (legacy): filter without (+) on outer side
-- silently converts the outer join to an inner.
SELECT s.spriden_id, m.stvmajr_desc
FROM   sgbstdn g, stvmajr m
WHERE  g.sgbstdn_majr_code_1 = m.stvmajr_code(+)
  AND  m.stvmajr_valid_a_ind = 'Y';     -- NO (+) — kills outer join

-- ANSI rewrite of the BUG (still wrong):
SELECT s.spriden_id, m.stvmajr_desc
FROM   sgbstdn g
LEFT JOIN stvmajr m ON m.stvmajr_code = g.sgbstdn_majr_code_1
WHERE  m.stvmajr_valid_a_ind = 'Y';     -- WHERE rejects NULL rows

The bug exists in BOTH dialects — see The Phantom INNER JOIN — When a WHERE Breaks Your LEFT JOIN for the full discussion. The fix in both: move the filter to ON.

Where intuition fails

**You cannot mix (+) and ANSI JOIN in the same query.** Oracle rejects it. When refactoring an older query, you must rewrite ALL the (+) in one pass. Cannot incrementally migrate.

**(+) on both sides of a predicate is unusual.** Some Oracle versions accept a.x(+) = b.x(+) as syntactic sugar for FULL OUTER JOIN; others reject it. Avoid this pattern. Use FULL OUTER JOIN directly.

**(+) does not support complex predicates.** The marker must be on a column reference, not on an expression or function call. WHERE a.x = NVL(b.x(+), 0) does not work. ANSI JOIN has no such limitation.

**PostgreSQL does NOT support (+).** Every legacy Banner query using (+) must be rewritten to ANSI JOIN before the SaaS migration to PostgreSQL. Audit your DataBlocks early — search for (+) across every report; it is a well-defined, unblockable task.

The one-sentence takeaway

Oracle's (+) in WHERE marks the optional side of an outer join. a.x = b.x(+) → a LEFT JOIN b ON b.x = a.x. a.x(+) = b.x → a RIGHT JOIN b ON a.x = b.x. WHERE conditions on the outer side must move INTO the ON clause during translation, or the rewrite silently converts the outer join back to an inner. PostgreSQL does NOT support (+) — every legacy Banner query using it must be rewritten before SaaS migration.

Track D · The craft of Argos

Argos Parameters — `:main_`, `:lcl_`, `:dbn_`

Every Argos report is a building full of rooms, and every parameter is a microphone. The question is never 'does this parameter exist?' It is always 'can this room hear it?' The three prefixes — :main_, :lcl_, :dbn_ — are the three answers to that question.

9 min readargosparametersscopedatablockreports

The hook

Every Argos report is a building full of rooms, and every parameter is a microphone. When a report breaks with "parameter not bound," the problem is rarely that the parameter does not exist — it is that the room you are standing in cannot hear it. The three prefixes you see in every Argos SQL block — :main_, :lcl_, :dbn_ — are the three answers to the question "can this room hear this microphone?" Learn the reach of each one, and the parameter system goes from mysterious to obvious in about ten minutes.

The everyday analogy

Walk into a 1950s school building. Three sound systems are wired through the walls, and each one reaches a different audience.

First, the whole-school PA system. The loudspeaker grille is mounted high near the ceiling in every hallway and every classroom. When the principal hits the all-call button in the morning and says "good morning — today's lunch is pizza," every room in the building hears it simultaneously. The lunch menu is a top-level concern. Everyone needs it. Nobody is out of range.

A 1950s school hallway at golden hour: the round PA loudspeaker high on the wall (main), a classroom intercom panel beside a door (lcl), and a black wall-mounted telephone labeled 'Room 207' between two doorways (dbn). Three sound systems, three scopes, one building.

Second, the classroom intercom. Inside each room, the teacher has a small panel microphone mounted beside the chalkboard. When the teacher says "open your books to page 47," only the thirty students in that room hear it. The room next door is still reading page 102 of its own lesson, unaware. The page number is local — only the lesson happening in this room cares about it.

Third, the room-to-room wall phone. Mounted between two doorways in the hall, a black bakelite telephone with a handwritten label beneath it: "Room 207." A teacher who needs to check on a shared student with Mr. Smith picks up the receiver, dials the room number, and asks. The call goes to exactly one named destination — no broadcast, no default. You have to specify which room.

The mapping to Argos is exact:

**:main_*** is the school PA. Defined at the top of the report tree on the

main DataBlock. Visible everywhere — every subreport, every banded child, every nested block in the entire report. The user-facing widgets at the top of a report are almost always :main_* because the user enters them once and expects the whole report to filter by them.

**:lcl_*** is the classroom intercom. Defined on a child or banded

DataBlock. Visible only within that block. The parent DataBlock cannot hear it. The sibling DataBlock next to it cannot hear it. It is the per-iteration variable inside a banded subreport — a student PIDM passed down from the parent row, a course CRN that changes with each iteration of the band.

**:dbn_*** is the wall phone. A cross-reference to a parameter defined on

a specific named other DataBlock. You name the destination explicitly — :dbn_fiscal_calendar.year_value means "look in the DataBlock named fiscal_calendar and read its year_value parameter." Used when two sibling DataBlocks need to share a filter value but neither is the parent.

What it really is

An Argos report is a tree. The trunk is the main DataBlock — the top-level query whose result set defines the report's basic shape. Branches are banded subreports — child DataBlocks that fire once per row of the parent, producing detail sections that iterate. Each node in this tree can declare parameters. The scope rules govern which parameter is visible from which node.

**:main_* — top-level, visible everywhere.** Defined on the main DataBlock. The user-facing filter widgets at the top of the report — a dropdown for term code, an edit box for minimum salary, a date picker for hire date — are :main_* parameters. Example: :main_DD_term_code is a dropdown (DD) that lets the user pick a term. Every DataBlock in the report, at any depth, can reference it in its SQL WHERE clause.

**:lcl_* — local to one child DataBlock.** Defined inside a banded subreport. The main DataBlock cannot reference it — Argos will error with "parameter not bound." When a banded child iterates over parent rows, :lcl_* parameters hold per-iteration values passed from the parent. Example: :lcl_StudentPIDM is the student PIDM for the current iteration of a per-student detail block. The parent main query sees many students; the child detail block sees one at a time, and :lcl_StudentPIDM tells it which one.

**:dbn_* — cross-reference to a named DataBlock.** Points explicitly to a parameter on another DataBlock by name. Example: :dbn_fiscal_calendar.year_value reaches into a DataBlock named fiscal_calendar and reads its year_value parameter. The DataBlock name must match exactly — it is case-sensitive in Argos — and the target DataBlock must exist at runtime. Used sparingly, mostly in complex multi-block dashboards where two siblings share a dimension without wanting to promote it to :main_.

After the scope prefix, Argos developers conventionally add a two-letter widget type code. These are documentation, not enforcement — Argos does not validate that :main_DD_* is actually wired to a dropdown — but the convention is so universal that reading parameter names without it feels like reading a sentence without punctuation:

Code	Widget	What the user sees
`EB`	Edit Box	Free text or number entry
`DD`	Drop Down	Single-select from a predefined list
`DA`	Date	Single calendar date picker
`DR`	Date Range	Start-date and end-date pair
`CB`	Check Box	Single boolean on/off
`MC`	Multi-Checkbox	Multiple values selected from a list
`RB`	Radio Button	Single choice from a small set

So :main_DD_term_code reads left to right as: "top-level dropdown for term code." :lcl_EB_min_gpa reads as: "local edit box for minimum GPA." The convention is a code review aid — a parameter named :main_DD_* wired to a date picker is visually wrong and a reviewer should flag it.

The Argos report tree: main DataBlock at the top, child banded subreports below. :main_ arrows reach everywhere in the tree. :lcl_ arrows stay within their own block. :dbn_ arrows jump horizontally to a named sibling DataBlock.

At execution time, Argos substitutes parameter values into the SQL string before sending it to the database. The substitution handles quoting based on the parameter's declared type: text parameters get single-quoted, numeric parameters pass through bare, multi-value parameters expand into comma-separated lists. The substitution is string-level, not bind-variable level — Argos splices values into the SQL text.

See it — the diagram

The scope diagram shows the report tree. The main DataBlock at the top, two banded child DataBlocks below it, and a named sibling off to the side. Solid arrows from :main_* reach every block in the tree — the PA system. Dashed arrows from :lcl_* stay inside their own block — the classroom intercom. A dotted arrow from :dbn_* jumps horizontally to the named sibling — the wall phone. Once you have seen the tree, the scope rules are visual, not memorized.

Show me the code

A typical main DataBlock with a dropdown parameter — the user picks a term, the report filters to that term:

-- Main DataBlock SQL. :main_DD_term_code is a Drop Down populated
-- by an Argos-side query against STVTERM. The user picks one value.
SELECT sr.sfrstcr_pidm, sr.sfrstcr_crn, sr.sfrstcr_credit_hr
FROM   sfrstcr sr
WHERE  sr.sfrstcr_term_code = :main_DD_term_code;

A banded child DataBlock — runs once per student row from the parent, using both a :main_ parameter (the term) and a :lcl_ parameter (the current student's PIDM):

-- Child DataBlock SQL — per-student grade detail. Runs once per
-- parent row, with :lcl_StudentPIDM bound to that row's PIDM.
SELECT g.shrgrde_subj_code, g.shrgrde_crse_numb,
       g.shrgrde_grde_code_final
FROM   shrgrde g
WHERE  g.shrgrde_pidm      = :lcl_StudentPIDM
  AND  g.shrgrde_term_code = :main_DD_term_code;

A cross-DataBlock reference — a payroll block reads the fiscal year from a named sibling instead of redefining it:

-- This block reads its year filter from the named "fiscal_calendar"
-- DataBlock rather than duplicating the parameter definition.
SELECT SUM(p.phrhist_gross) AS gross_for_year
FROM   phrhist p
WHERE  p.phrhist_year = :dbn_fiscal_calendar.year_value;

A multi-checkbox parameter — Argos expands :main_MC_ecls into a comma-separated list of single-quoted values at substitution time:

-- Multi-checkbox: the user checks ECLS codes; Argos substitutes
-- them as a quoted CSV inside the IN (...) clause.
SELECT pe.pebempl_pidm, pe.pebempl_ecls_code
FROM   pebempl pe
WHERE  pe.pebempl_ecls_code IN (:main_MC_ecls);

Where intuition fails

Five traps that catch new Argos report writers, numbered by frequency:

**:lcl_ is invisible upward.** A child DataBlock can reach up into

:main_* parameters — child hears the PA. But the main DataBlock cannot reach down into a child's :lcl_* — the principal cannot hear every classroom intercom at once. New writers occasionally try to filter the main SELECT by a :lcl_ from a banded subreport. Argos errors with "parameter not bound," and the writer assumes they mistyped the parameter name. They did not. The scope rule forbids it.

**:dbn_* breaks silently on rename.** The DataBlock name in a :dbn_

reference must match the target DataBlock's name exactly, case-sensitive. Renaming a DataBlock in the Argos designer silently breaks every :dbn_* reference to it. There is no refactoring support and no warning at design time — the error only surfaces at runtime. When you rename a DataBlock, grep every SQL block in the report for the old name.

The widget code is a promise your wiring may not keep. A parameter

named :main_DD_term_code can be wired to an Edit Box. Nothing in Argos prevents it. The user sees a text box, types a term code by hand, and wonders why they are not getting a dropdown. The widget code is a naming convention — treat it as a code review checklist item, not as a runtime guarantee.

Type coercion is Oracle's, not Argos's. A parameter declared as TEXT and

compared against a numeric column relies on Oracle's implicit conversion — usually harmless on Oracle, but if the report ever migrates to PostgreSQL (a real possibility as colleges move toward SaaS Banner), the same query fails with a type error. Cast explicitly on the parameter side: WHERE salary >= TO_NUMBER(:main_EB_min_salary).

Multi-value parameters expand as text, not as bind variables.

:main_MC_ecls inside IN (...) becomes IN ('E', 'F', 'S') — a literal CSV spliced into the SQL string. This means Argos performs string substitution, not bind-variable binding, for multi-value parameters. The SQL text changes on every execution. That is fine for Oracle's shared pool (Oracle still caches similar statements) but it means multi-value parameters generate a different SQL ID for every distinct set of checked values.

The one-sentence takeaway

:main_ is the school PA — everyone hears it. :lcl_ is the classroom intercom — only this room hears it. :dbn_ is the wall phone to a specific room — you name the destination.

Track D · The craft of Argos

How Argos Assembles Your Query — Filters on the WHERE

You type :main_DD_term_code in your DataBlock SQL, the user picks 'Fall 2026' from a dropdown, and Oracle runs the query. What happens between the click and the execution is not parameter binding — it is string substitution, like a mail merge. The distinction explains every performance surprise, every silent breakage, and every 'it worked yesterday' your Argos users have ever reported.

9 min readargosparameterssubstitutionmail-mergebind-variableswhere-clause

The hook

You type :main_DD_term_code in your DataBlock SQL, the user picks "Fall 2026" from a dropdown, and Oracle runs the query. What happens between the click and the execution is not parameter binding — it is string substitution, like a mail merge. The distinction explains every performance surprise, every silent breakage, and every "it worked yesterday" your Argos users have ever reported.

The everyday analogy

Open Microsoft Word. Open a template letter with placeholders: Dear {first_name}, your balance on {date} is {amount} for account {account_id}. Open a CSV with one row per recipient — the data source. Hit "Merge." Word reads each row of the CSV, finds the placeholders in the template, swaps in the row's values, and produces one personalized letter per row. The merge is string substitution at print time — Word does not bind variables; it rewrites the document.

Argos works exactly the same way. The template is the DataBlock's SQL. The placeholders are the parameters (:main_DD_term_code, :lcl_StudentPIDM). The data source is the user — what they typed in the filter widgets at the top of the report. At execution, Argos finds every placeholder in the SQL text, looks up the value the user supplied, and splices that value directly into the SQL string. The resulting concrete SQL is what gets sent to Oracle. Oracle never sees :main_DD_term_code — it sees '202610' already substituted in.

A Word document open on a desk with mail-merge placeholders visible, beside an open CSV showing recipient data, beside the printed personalized letter — three artifacts of one merge cycle.

The mail-merge model has consequences that ripple through everything:

Optional filters need handling. If a recipient's {middle_name} is blank in the CSV, the letter says "Dear John Smith" not "Dear John Smith" only if the template anticipated the missing field. Argos templates have the same problem — if the user left a filter empty, the template needs an explicit pattern to skip that predicate.
Multi-value expansion happens at substitution time. A checkbox group sending three values expands into a literal comma-separated list inside the SQL — IN ('F', 'P', 'S'), not IN (:multi).
Quoting is the substituter's job. Word knows to splice text without adding punctuation. Argos knows the parameter's declared type and quotes strings, leaves numbers bare, formats dates. Get the type wrong and the SQL comes out malformed.
The SQL Oracle sees changes every run. Different parameter values produce different SQL text. In a mail merge, every recipient gets a different letter. In Argos, every execution sends a different concrete SQL string.

The mail-merge model is also what makes Argos approachable. The report writer reads the template and can predict exactly what Oracle will receive. The substitution is visible, not magical. The cost is that the report writer has to think about what happens when a placeholder is unfilled — and that is exactly what the "auto-WHERE" patterns in Seven Patterns Every Argos Report Needs exist to solve.

What it really is

At report run time, Argos executes a fixed sequence. Understanding it explains behavior that otherwise looks like bugs.

Step 1: Read the template. The DataBlock's SQL is a string with :scope_ParamName tokens embedded in it. Argos does not parse the SQL — it scans for parameter references.

Step 2: Read the widget values. The user's filter widgets (dropdowns, edit boxes, checkboxes, date pickers) each contain a value: a selected term code, a typed PIDM, three checked employee classes, an empty date range.

Step 3: Substitute. For each :scope_ParamName token, Argos looks up the value from the matching widget, formats it according to the parameter's declared type (text → quoted string, number → bare digits, date → per-DBMS format), and splices the formatted value into the SQL text. The original placeholder is gone. The value is now literal text in the SQL string.

Step 4: Send to Oracle. The concrete SQL — a plain string with no : parameters remaining — is sent to Oracle for parsing, planning, and execution. Oracle has no idea there was ever a template. It sees one static SQL statement.

Step 5: Return results. Oracle executes the query and returns rows. Argos formats them into the report layout.

Steps 1–3 are where everything in this article happens. Step 4 is where everything in the gotchas section bites.

String substitution vs. bind variables

Oracle has a native feature called bind variables: a SQL statement uses :param_name and the parameter value is sent separately from the SQL text. Oracle compiles the plan once and reuses it across every value of the bind. One plan, many executions, efficient cache usage.

Argos does not use this feature for parameter substitution. Argos splices values directly into the SQL text before sending it to Oracle. The : syntax visually resembles bind variables, but the mechanism is text-level — find-and-replace, not bind-and-send.

The assembly cycle: SQL template with :main_DD_term_code placeholder plus user-supplied value '202610' equals concrete SQL string sent to Oracle, with the substituted value highlighted in coral.

What the substitution engine does per type

Each Argos parameter has a declared type. The substitution respects it:

TEXT — wraps the value in single quotes: '202610'. Escapes embedded quotes.
NUMBER — leaves the value bare: 50000. No quotes.
DATE — formats per the database dialect, typically TO_DATE('2026-09-15', 'YYYY-MM-DD') or the dialect's date literal.

A mismatch between the declared type and how the SQL uses the parameter produces malformed SQL or implicit conversion at execution time.

Multi-value expansion

Checkbox groups (MC) and multi-select lists expand into literal comma-separated value lists at substitution time. :main_MC_ecls with three selected values becomes 'F', 'P', 'S' in the SQL string. The parentheses around IN (...) belong to the template — they are not added by the substitution. Each distinct combination of checked values produces a distinct SQL string — and a distinct entry in Oracle's SQL Plan cache.

The optional-filter problem

A widget the user can leave empty (a date range with no dates, a dropdown with "All" selected) needs the template to conditionally include or skip the WHERE predicate. The most common naive approach — WHERE x = :param OR :param IS NULL — works for single-value parameters but fails catastrophically for multi-value ones (gotcha #3). The safe recipes live in Seven Patterns Every Argos Report Needs.

See it — the diagram

The WHERE clause is where the substitution hits hardest.

Three optional filters in a WHERE clause: term (required, substituted), employee class (multi-checkbox, expanded to CSV), date range (left empty by user, short-circuited). Each predicate either inlined with quoted values or omitted; the final WHERE shown at the bottom.

Three filters, three different behaviors at substitution time. The term-code filter is required — the dropdown always has a selection, the substitute inlines '202610', done. The employee-class filter is a multi-checkbox — the two checked values expand into a literal CSV inside the IN (...). The date-range filter was left empty by the user — the template's short-circuit pattern detects the empty substitution and drops the predicate entirely. The final WHERE that Oracle receives has only two conditions, not three. The report writer never writes an IF statement. The template patterns encode the conditional logic in SQL that evaluates at substitution time.

Show me the code

One complete mail-merge cycle for a real Banner report.

The DataBlock's SQL template (what the report writer typed):

-- DataBlock template: registrations by faculty, optionally filtered
-- by employee class.
SELECT r.sfrstcr_term_code,
       r.sfrstcr_crn,
       r.sfrstcr_pidm,
       pe.pebempl_ecls_code
FROM   sfrstcr r
JOIN   spriden s
       ON  s.spriden_pidm        = r.sfrstcr_pidm
       AND s.spriden_change_ind  IS NULL
       AND s.spriden_entity_ind  = 'P'
LEFT JOIN pebempl pe
       ON pe.pebempl_pidm = r.sfrstcr_pidm
WHERE  r.sfrstcr_term_code = :main_DD_term_code
  AND  (pe.pebempl_ecls_code IN (:main_MC_ecls)
        OR :main_MC_ecls IS NULL);

The user's input at the filter widgets:

:main_DD_term_code = 202610 (selected from a dropdown)
:main_MC_ecls = F, P (two boxes checked from the Multi-Checkbox)

The concrete SQL Oracle actually receives (after Argos splices):

SELECT r.sfrstcr_term_code,
       r.sfrstcr_crn,
       r.sfrstcr_pidm,
       pe.pebempl_ecls_code
FROM   sfrstcr r
JOIN   spriden s
       ON  s.spriden_pidm        = r.sfrstcr_pidm
       AND s.spriden_change_ind  IS NULL
       AND s.spriden_entity_ind  = 'P'
LEFT JOIN pebempl pe
       ON pe.pebempl_pidm = r.sfrstcr_pidm
WHERE  r.sfrstcr_term_code = '202610'
  AND  (pe.pebempl_ecls_code IN ('F', 'P')
        OR ('F', 'P') IS NULL);

Three things to notice in the rendered SQL:

The term-code parameter substituted as a quoted string ('202610') because the parameter type is text.
The multi-checkbox expanded into a literal CSV inside the IN (...) — the parentheses are in the template, the values are spliced.
The "optional filter" tail OR :main_MC_ecls IS NULL did not short-circuit correctly — ('F', 'P') IS NULL is always false because a parenthesized value list is never NULL. This is a real failure mode of the naive pattern. When the user checks no boxes, the substitution produces IN () — which is a syntax error in Oracle. The safe alternatives are in Seven Patterns Every Argos Report Needs.

Where intuition fails

Five lessons that show up in production Argos reports:

Argos parameters are NOT Oracle bind variables. Tools that capture "the SQL Argos sent" — Oracle's V$SQL, AWR reports, trace files — show the substituted concrete string with literal values, not a parameterized statement. Every distinct combination of parameter values produces a distinct SQL_ID in Oracle. The SQL Plan cache fills up faster than with bind-variable code. This is helpful for debugging — every report execution is fully reproducible from the captured SQL — but it is a different beast from typical Java/PL/SQL bind-variable code. The : prefix in the template looks like a bind variable. It is not.

String substitution is SQL injection surface if user input reaches it raw. Argos handles quoting based on declared parameter type, so a TEXT parameter with the value '; DROP TABLE ...; -- would be quoted as '''; DROP TABLE ...; --' and neutralized. But if a parameter is ever spliced into a position where quoting cannot protect it — inside IN (...) parentheses, as a column name, into ORDER BY — the protection lapses. Treat parameter type declarations as a security boundary.

The naive optional-filter pattern fails for multi-value parameters. WHERE x IN (:multi) OR :multi IS NULL works for single-value widgets because 'A' IS NULL evaluates correctly. For multi-value widgets the substituted ('A', 'B') IS NULL is always FALSE — the OR branch never short-circuits, and an empty multi-value selection producing IN () causes a parsing error. Use WHERE x IN (:multi) plus a separate template conditional that drops the entire predicate when the selection is empty. See Seven Patterns Every Argos Report Needs for the safe recipe.

Type coercion happens at Oracle, not at Argos. A parameter declared as TEXT and compared to a numeric column (WHERE salary >= :main_EB_min_salary) substitutes as WHERE salary >= '50000'. Oracle implicitly converts the string to a number — usually correctly, but the conversion can defeat indexes (Oracle may not use a numeric index for a string comparison) and behaves differently on stricter databases. Declare the parameter as NUMBER, or cast explicitly on the Oracle side: WHERE salary >= TO_NUMBER(:main_EB_min_salary).

**A renamed DataBlock breaks :dbn_* references silently.** The :dbn_named_block.param_name cross-reference from Argos Parameters — `:main_`, `:lcl_`, `:dbn_` is resolved at substitution time by name lookup. Renaming a DataBlock in the Argos designer does not update the references that point at it. The error surfaces at runtime as "DataBlock not found" or a blank substitution that produces malformed SQL — no compile-time warning, no designer validation. Grep every DataBlock's SQL for the old name before committing a rename.

The one-sentence takeaway

Argos assembles queries by string substitution, not bind variables. Every parameter value is spliced into the SQL text before Oracle sees it. The template is a mail merge. The SQL Oracle receives changes every run. Plan cache, quoting, optional filters, and multi-value expansion all follow from this one fact.

Track D · The craft of Argos

Seven Patterns Every Argos Report Needs

9 min readargosparameterswhere-clausepatternsrecipesoptional-filtersmulti-value

The hook

The everyday analogy

A chef who has been running a kitchen for thirty years does not invent every recipe from scratch every Monday morning. In the back room there is a recipe binder — stained, tabbed, well-worn — containing about a hundred recipes organized into seven or eight base categories: stocks, sauces, braises, sautés, sears, roasts, soups. Every dish on every episode of every season is some variation on one of those base patterns. The chef knows the base recipe by heart, knows where to add salt, knows what to swap for what — but they did not reinvent "how to brown an onion" this morning. They opened the binder.

The Argos report writer is the same. The number of fundamentally different ways a parameter can appear in a WHERE clause is small — about seven. A required filter. An optional single-value filter. An optional multi-value checkbox filter. A date range. A partial-text search. A toggle. A cascading parent-child dependency. Every Argos report you will ever write uses one or more of these seven patterns. The variations are in what you filter on — a term code, an employee class, a student name, a fund. The recipes are stable.

A chef's stained recipe binder open on a wooden prep table, tabs visible for stocks, sauces, braises, sautés; one tab pulled forward labeled 'Argos WHERE patterns — 7 base recipes' in the same hand-lettered style.

Like recipes, each pattern has a canonical safe form and several wrong forms that look right but break in production. The wrong forms typically fail when the user leaves the filter empty (the cooking equivalent of "what if the diner skips the appetizer?"). The canonical form anticipates the empty case and produces clean SQL either way. The wrong form for a multi-value filter — WHERE x IN (:multi) OR :multi IS NULL — looks like the optional single-value recipe with IN swapped for =. It is not. The substituted ('A', 'B') IS NULL is always FALSE, and the empty case IN () is a syntax error. The pattern matters. The recipe matters.

By the end of this article you should have all seven recipes in your hands. Print this page. Tape it next to your monitor. When the next report needs a filter, pick the matching recipe and copy it in. That is what professional Argos report writing looks like.

What it really is

The seven patterns all share one principle: the template must handle the empty case before substitution produces malformed SQL. Understanding How Argos Assembles Your Query — Filters on the WHERE is prerequisite — these patterns are written for string substitution, not bind variables. The : tokens are replaced with literal values before Oracle sees the query. Every pattern below is engineered so that when the user leaves a widget empty, the resulting SQL is still syntactically valid and semantically correct.

The seven base patterns, in catalog order:

REQUIRED single-value — the user must pick something. No empty-case handling needed; the widget validation enforces it.
OPTIONAL single-value — the user may leave it empty. The NVL(:param, column) trick makes the predicate a no-op when empty.
OPTIONAL multi-value (checkbox group) — zero or more checked. Needs a conditional block to omit the predicate entirely when zero; IN () is a syntax error.
DATE RANGE — start + end, typically both optional. Sentinel dates (1900-01-01 / 9999-12-31) bound the open ends.
PARTIAL-TEXT SEARCH — LIKE '%' || :param || '%'. Naturally tolerant of empty input because '%%' matches everything.
TOGGLE filter — a radio or dropdown that switches between categories (Active/Inactive/All). Uses explicit OR branches with a sentinel value.
CASCADING parent-child — a child dropdown whose options depend on the parent's selection. The child's options query references the parent parameter.

You do not need all seven in every report. Most reports use two or three. But every report you will ever write fits into this catalog.

See it — the diagram

The recipe card is the unit of reuse.

A single recipe card from the binder: pattern name at top ('OPTIONAL MULTI-VALUE'), the safe SQL form in the middle, a one-line 'when to use' header, and a small amber warning at the bottom flagging the naive wrong form.

Each pattern gets one card. Pattern name across the top. A one-line "when to use this" header — the decision rule that tells you this is your pattern. The safe SQL form in the center, copy-paste ready. And at the bottom, in amber, the naive wrong form — the version that looks right and breaks when the user leaves the filter empty. The card is designed to be printed. Tape each one to your monitor as you need it. The binder has all seven; your current report needs two.

Show me the code

Pattern 1 — REQUIRED single-value. User must pick; no empty case.

-- Pattern 1: REQUIRED single-value. The widget has no "All" option;
-- form validation prevents execution with an empty selection.
WHERE r.sfrstcr_term_code = :main_DD_term_code

Simplest possible filter. No sentinel. No NULL handling. The dropdown forces a choice.

---

Pattern 2 — OPTIONAL single-value. The NVL trick makes the predicate a no-op when empty.

-- Pattern 2: OPTIONAL single-value. When the user picks nothing,
-- NVL returns the column's own value and the equality is always true.
WHERE NVL(pe.pebempl_ecls_code, 'X') = NVL(:main_DD_ecls, NVL(pe.pebempl_ecls_code, 'X'))

Naive wrong form: WHERE pe.pebempl_ecls_code = NVL(:main_DD_ecls, pe.pebempl_ecls_code). This can fail when the column itself is NULL — NULL equals nothing, and the filter silently drops NULL rows instead of keeping them. The double-NVL form above handles NULLs on both sides.

---

Pattern 3 — OPTIONAL multi-value (multi-checkbox). Use a conditional block to omit the predicate when zero boxes are checked.

-- Pattern 3: OPTIONAL multi-value. The conditional omits the entire
-- predicate when the user checks no boxes. Without this, IN () is a
-- syntax error and the naive OR :param IS NULL never fires for multi-value.
{{!IF :main_MC_ecls != ''}}
  AND pe.pebempl_ecls_code IN (:main_MC_ecls)
{{!ENDIF}}

Naive wrong form: WHERE pe.pebempl_ecls_code IN (:main_MC_ecls) OR :main_MC_ecls IS NULL. The substituted ('F', 'P') IS NULL is always FALSE, and IN () with zero values is a syntax error. The conditional block is the only safe form for multi-value.

---

Pattern 4 — DATE RANGE. Sentinel dates bound the open ends. Both sides optional.

-- Pattern 4: DATE RANGE. Each side independently optional. The
-- sentinel dates are far enough outside real data to be safe.
WHERE st.stvterm_start_date BETWEEN
        NVL(:main_DA_from, DATE '1900-01-01')
    AND NVL(:main_DA_to,   DATE '9999-12-31')

Naive wrong form: constructing the BETWEEN with concatenated SQL strings or leaving one side NULL (BETWEEN with a NULL operand returns zero rows, not all rows). The sentinels guarantee the range is always bounded.

---

Pattern 5 — PARTIAL-TEXT SEARCH. Wildcards in the template; empty input naturally matches everything.

-- Pattern 5: PARTIAL-TEXT search. UPPER on both sides for case-
-- insensitive matching. Empty input produces '%%' which matches all rows.
WHERE UPPER(s.spriden_last_name) LIKE '%' || UPPER(:main_EB_lastname) || '%'

No special-case handling needed. '%%' is a valid LIKE pattern that matches every non-NULL string. For large tables, pair this with a required filter (term, year) so the LIKE scans a constrained set.

---

Pattern 6 — TOGGLE filter. Explicit OR branches with a sentinel "ALL" value.

-- Pattern 6: TOGGLE filter. A radio button or dropdown with options
-- "All", "Active", "Inactive". The 'ALL' sentinel is the no-op.
WHERE (:main_RB_active = 'ALL'
    OR (:main_RB_active = 'ACTIVE'   AND pe.pebempl_empl_status = 'A')
    OR (:main_RB_active = 'INACTIVE' AND pe.pebempl_empl_status = 'T'))

Naive wrong form: using a CASE expression or dynamic SQL to swap the predicate. The explicit OR branches are readable, predictable, and survive substitution without surprises. Add a new status by adding an OR branch — the pattern extends cleanly.

---

Pattern 7 — CASCADING parent-child. The child's options query references the parent parameter.

-- Pattern 7: CASCADING dropdowns. The child dropdown's options query
-- (defined in the Argos designer) is filtered by the parent's selection.

-- Child DataBlock options query (in Argos designer, not in report SQL):
SELECT ssbsect_crn || ' - ' || ssbsect_subj_code
       || ' ' || ssbsect_crse_numb AS label,
       ssbsect_crn                  AS value
FROM   ssbsect
WHERE  ssbsect_term_code = :main_DD_term_code   -- parent's selection
ORDER BY ssbsect_subj_code, ssbsect_crse_numb;

-- Child DataBlock's main SQL uses the child's selected value:
SELECT *
FROM   sfrstcr
WHERE  sfrstcr_term_code = :main_DD_term_code
  AND  sfrstcr_crn       = :main_DD_crn;         -- child's selection

Naive wrong form: hardcoding the child's options list without filtering by the parent. The dropdown shows every CRN from every term — thousands of rows — and the user picks a CRN that doesn't belong to the selected term. The report returns zero rows and nobody knows why. The parent filter on the child's options query prevents the mismatch.

Where intuition fails

Five lessons that catch report writers using these patterns:

**Don't OR :param IS NULL for multi-value.** The naive "make-it-optional" recipe is WHERE x = :p OR :p IS NULL. For single-value parameters this works. For multi-value, the substituted ('A', 'B') IS NULL is always FALSE — the OR branch never fires. The predicate becomes mandatory when you intended optional. Use Pattern 3 (the conditional block) instead.

NVL sentinels must outlast real data. Pattern 4 uses DATE '1900-01-01' and DATE '9999-12-31' as sentinels. If your real data contains those dates — legacy imports sometimes use 1900-01-01 as a "missing date" marker — the sentinel collides with real rows. Pick sentinels outside any plausible real range and document the choice in the DataBlock description.

LIKE with a leading wildcard cannot use an index. Pattern 5's LIKE '%...%' forces a full table scan — Oracle cannot seek into an index when the pattern starts with %. For small tables like SPRIDEN this is fine. For large tables like PHRHIST or GURFEED, constrain the query with a required filter (term, year range, status) before the LIKE so the scan lands on a manageable subset.

Toggle filters need the "ALL" option in the widget. Pattern 6 depends on the radio or dropdown including a literal 'ALL' option. If the widget's options query doesn't return it, the pattern cannot no-op. Add 'ALL' as the first row in the widget's static list or options query, and make it the default selection.

Cascading dropdowns share one page refresh. Pattern 7 relies on the child DataBlock re-running its options query when the parent changes. In some Argos versions this refresh is automatic; in others you must configure it in the designer (a "cascading parameter" checkbox or dependency setting). If the child's options don't refresh, the user picks from a stale list and the report silently returns no rows. Test by changing the parent dropdown twice and verifying the child's list updates both times.

The one-sentence takeaway

Seven canonical WHERE-clause patterns cover every Argos report you will ever write: required, optional single-value, optional multi-value, date range, LIKE search, toggle, and cascading. Each has a safe form that handles the empty-selection edge case correctly. Copy the recipe. Ship the report. Move on.

Track D · The craft of Argos

Shared DataBlocks — One SQL, Many Reports

6 min readargosdatablockunion-alldiscriminatorshared-sqlfaid1084

The hook

The everyday analogy

Walk into the multipurpose room on a Monday morning at 8 AM and it is set up as a chapel — folding chairs in rows, a podium at the front. By 11:30 the chairs are stacked along the walls and long cafeteria tables roll out from the closet — it is the lunch room for 200 students. At 3 PM the tables are folded away and the floor markings become visible — it is the basketball court for after-school practice. At 7 PM the chairs come back out facing the other direction — it is the theater for the spring play.

Same room. Same dimensions. Same lighting rig. Same HVAC. Four uses, on one shared infrastructure, scheduled by a single calendar that says which configuration runs at which time. The calendar is the discriminator: "It is 8 AM Monday — therefore chapel mode."

A school multipurpose room photographed from four angles in one collage: chapel mode (chairs in rows), lunch mode (cafeteria tables out), gym mode (floor markings visible), theater mode (chairs facing the stage). Same room, four functions.

A shared Argos DataBlock works exactly the same way. The underlying SQL is the room. The various report layouts that consume it — the summary view, the detail view, the exception-only view — are the chapel, the lunch room, the basketball court, the theater. The discriminator column is the schedule that says "this row is for the summary layout, this row is for the detail layout, this row is for both." Each report layout filters by the discriminator and renders only its slice. One physical room, many functions, no duplicated furniture.

What it really is

The shared-DataBlock pattern has one SQL body that UNION ALLs multiple SELECT statements together, each producing rows in a consistent column shape, with a discriminator column that tags each row by its intended layout. The downstream report filters on that discriminator to get its slice.

**Why UNION ALL, not UNION.** UNION removes duplicates and forces a sort — expensive on large datasets. UNION ALL simply concatenates the result sets. Since the discriminator column already distinguishes rows, duplicates are intentional and UNION ALL is the correct — and faster — choice.

Column shape must match. Every branch of the UNION ALL must produce the same columns in the same order with compatible types. If one branch is missing a column, use NULL AS missing_col to keep the shape. If types differ, cast explicitly — Oracle's implicit conversion will not save you.

The discriminator column is a constant per branch — usually a short text code: 'SUMMARY', 'DETAIL', 'EXCEPTION'. Each branch hardcodes its label. No data-dependent logic. The consumer report filters by it in its own WHERE clause: WHERE layout = 'SUMMARY'.

One source of truth. The shared join graph, the shared WHERE filters, the shared business rules all live in the DataBlock SQL. If the underlying filter logic changes — a new fund code, a different effective date — you change ONE DataBlock, and all the consuming reports update. Each downstream layout is a presentation choice over the same dataset.

The SQL shape: three SELECT branches stacked vertically, each ending in 'AS layout' with a coral-highlighted literal ('SUMMARY', 'DETAIL', 'EXCEPTION'); the three branches joined by UNION ALL; below, three consumer reports each filtering by their layout value.

See it — the diagram

The UNION ALL shape is a stack of SELECT blocks, each ending in the same discriminator column with a different literal value. The blocks are visually identical in structure but differ in aggregation level, included columns, and the constant that goes into layout. Below the stack, the consumer reports each connect to exactly one discriminator value — one block per layout. The diagram makes the contract visible: the DataBlock produces tagged rows; the consumers filter by tag. Add a new layout by adding a new branch and a new consumer that filters on the new tag.

Show me the code

A financial-aid DataBlock that serves a summary view and a detail view from one SQL body:

-- Shared DataBlock: one SQL feeding multiple FA report layouts.
-- The discriminator column 'layout' tags each row's intended
-- consumer. Downstream reports filter by :main_DD_layout.

-- Branch 1: SUMMARY rows (one per student-term-fund).
SELECT 'SUMMARY'                  AS layout,
       r.rprawrd_pidm             AS pidm,
       r.rprawrd_aidy_code        AS aid_year,
       r.rprawrd_fund_code        AS fund_code,
       SUM(r.rprawrd_offer_amt)   AS amount,
       NULL                       AS award_status,
       NULL                       AS detail_seq
FROM   rprawrd r
WHERE  r.rprawrd_aidy_code = :main_DD_aid_year
GROUP BY r.rprawrd_pidm, r.rprawrd_aidy_code, r.rprawrd_fund_code

UNION ALL

-- Branch 2: DETAIL rows (one per individual award action).
SELECT 'DETAIL'                   AS layout,
       r.rprawrd_pidm             AS pidm,
       r.rprawrd_aidy_code        AS aid_year,
       r.rprawrd_fund_code        AS fund_code,
       r.rprawrd_offer_amt        AS amount,
       r.rprawrd_award_status     AS award_status,
       r.rprawrd_seq_no           AS detail_seq
FROM   rprawrd r
WHERE  r.rprawrd_aidy_code = :main_DD_aid_year;

The summary branch aggregates; the detail branch shows individual rows. Both share the same aid_year filter, the same table, the same column shape. The layout column is the discriminator. Notice NULL fillers in the summary branch for columns that only the detail branch populates — that is what keeps the column shape consistent.

The consumer reports filter by it:

-- Summary report's layout WHERE clause:
WHERE layout = 'SUMMARY'

-- Detail report's layout WHERE clause:
WHERE layout = 'DETAIL'

The Waubonsee FAID1084 / FAID1006 family uses exactly this shape — one DataBlock, multiple report consumers.

Where intuition fails

Column shape mismatch breaks silently in some edge cases. If branch 1 has 7 columns and branch 2 has 6, Oracle errors at parse time — loud failure, easy fix. But if branch 2 has 7 columns of subtly different types (NUMBER vs VARCHAR2 in the same position), Oracle attempts implicit conversion and may succeed with wrong data. Always declare explicit types via CAST(... AS NUMBER(12,2)) or TO_CHAR(...) to match the branches.

**UNION instead of UNION ALL is silently slow.** UNION forces a distinct sort across the entire combined result set, which on multi-million-row data can add minutes. Always use UNION ALL for shared DataBlocks — the discriminator column already makes the rows distinct.

The discriminator must be a literal, not a column. Putting r.some_column AS layout produces rows where the layout value depends on data, and the consumer report's filter breaks unpredictably. The discriminator is 'SUMMARY', 'DETAIL', etc. — a hardcoded string, one per branch.

Each branch is its own query against the source. A 5-branch shared DataBlock executes 5 SELECTs against the same source table and concatenates. If the source is large, this is 5x the I/O. Add indexes on the shared filter columns, and consider materializing the source into a temp table that all branches read from if the DataBlock is part of a scheduled report chain.

Consumer reports must stay in sync with the DataBlock. Add a new branch without updating each consumer's filter, and the new rows leak into every report. Remove a branch without updating the corresponding consumer, and that report breaks with empty results. Treat the discriminator vocabulary as a contract — document every consumer and its layout filter in a leading SQL comment.

The one-sentence takeaway

A shared DataBlock uses UNION ALL plus a discriminator column to serve multiple report layouts from a single SQL body. One join graph, one set of business rules, one place to change when the logic updates. Each consumer report filters by the discriminator to get its slice.

Track E · Where intuition fails

The Phantom INNER JOIN — When a WHERE Breaks Your LEFT JOIN

A report told to list every student lists only some — and the LEFT JOIN that was supposed to keep them is spelled out, correct, and innocent.

6 min readjoinsleft-joinwhere-clausenullargosthree-valued-logic

The hook

A registration report has one job: list every student in the program, with their course status. You run it. It lists most of them. The students missing are exactly the ones who have not registered yet — and the JOIN that was supposed to keep them is a LEFT JOIN, spelled out, correct. The bug is real, the JOIN is innocent, and the culprit is hiding one line below.

The everyday analogy

Picture a private event with a guest list. Being on the list guarantees you get in — the doorman never turns away a name on the list. That guarantee is the whole point of the list.

Some guests bring a plus-one. Some come alone.

Now picture a second checkpoint, deeper inside the venue. The person there does not check the guest list at all — they inspect each plus-one's badge, and wave a pair through only if the badge reads "RE".

A guest who came alone has no plus-one, so there is no badge to inspect. The checker cannot confirm the badge says "RE" — but cannot confirm it does not, either. It is simply unverifiable. And the rule at that checkpoint is "only confirmed-RE may pass." So the solo guests — every one of them — are quietly turned back.

The list guarantees entry — but a second checkpoint quietly turns away whoever came alone.

The host promised the whole list gets in. The second checkpoint, without anyone noticing, un-invited everyone who arrived alone. That second checkpoint is a SQL WHERE clause — and this article is about how it un-invites your rows.

What it really is

Start with the promise. A LEFT JOIN guarantees that every row of the left table appears in the result — the guest list. If a left row finds a match in the right table, the right-hand columns fill with that match. If it finds no match, the row still appears; the right-hand columns fill with NULL. That NULL-padding is the definition of a LEFT JOIN. An INNER JOIN would do the opposite — no match, no row.

In our report the left table is the students; the right table is SFRSTCR, course registration. A student who has not registered simply has no SFRSTCR row. The LEFT JOIN keeps that student anyway, with the registration columns NULL. That is why the report was written with a LEFT JOIN in the first place — to include the not-yet-registered.

The LEFT JOIN keeps every student; the one with no registration is kept too, padded with NULL.

So the LEFT JOIN is doing its job perfectly. Hold on to one word: NULL. The LEFT JOIN does not fail — it produces NULLs. What happens to those NULLs one clause later is the whole story.

See it — the diagram

SQL does not run top to bottom the way you read it. The engine first assembles the entire FROM + LEFT JOIN result — and at that instant the promise is kept: every student is present, the unregistered ones carrying their NULL. Only then does the WHERE run, walking that finished set row by row.

So what does WHERE sfrstcr_rsts_code = 'RE' decide for a row whose sfrstcr_rsts_code is NULL? Not TRUE, not FALSE — SQL has a third answer: UNKNOWN. Any comparison against NULL yields UNKNOWN. NULL = 'RE' is not false, it is unknowable: NULL means "unknown value," so "is the unknown equal to 'RE'?" can only be answered "cannot tell."

And here is the kill. A WHERE keeps a row **only when its condition is TRUE**. FALSE rows are dropped. UNKNOWN rows are dropped too — the WHERE does not distinguish "definitely not" from "cannot tell." It cuts both.

The WHERE keeps only TRUE rows — and drops UNKNOWN exactly as it drops FALSE.

The LEFT JOIN kept the unregistered students by padding them with NULL. The WHERE, running afterward over the finished set, evaluated NULL = 'RE' → UNKNOWN → and dropped them. That, exactly, is a LEFT JOIN silently turned into an INNER JOIN — without anyone ever typing the word INNER.

Show me the code

Here is the report's query, in real Banner objects:

SELECT s.spriden_id        AS student_id,
       s.spriden_last_name AS last_name,
       r.sfrstcr_crn       AS crn,
       r.sfrstcr_rsts_code AS status
FROM   spriden s
LEFT JOIN sfrstcr r
       ON  r.sfrstcr_pidm = s.spriden_pidm
WHERE  s.spriden_change_ind IS NULL          -- current name rows only
  AND  r.sfrstcr_rsts_code  = 'RE';          -- <-- the trap

Read it line by line and nothing looks wrong. But that last line filters r.sfrstcr_rsts_code — a column from r, the right table. The moment a right-table column appears in the WHERE under an equality, the LEFT JOIN is amputated. (The spriden_change_ind IS NULL line is a different, correct filter — it touches the left table; see SPRIDEN Without CHANGE_IND — The Duplicate-Name Trap.)

The fix is one line moved, not added. Take the right-table condition out of the WHERE and put it in the ON:

FROM   spriden s
LEFT JOIN sfrstcr r
       ON  r.sfrstcr_pidm      = s.spriden_pidm
      AND  r.sfrstcr_rsts_code = 'RE'         -- now part of the match rule
WHERE  s.spriden_change_ind IS NULL;

The ON runs during the join — it is the matching rule. A registration row counts as a match only if its status is 'RE'; a student with no 'RE' row still survives, NULL-padded. Every student is back.

One nuance worth its own sentence: moving the condition to ON does not just "fix" the query — it changes its meaning. The WHERE version answered "which students are registered?"; the ON version answers "all students, with their 'RE' registration if they have one." Neither is wrong — they answer different questions. The real bug was never the WHERE; it was the mismatch between writing LEFT JOIN (intent: keep everyone) and then a WHERE that filters the right table. If you genuinely want only the matched rows, write INNER JOIN — so the code says what it means.

Where intuition fails

Everything above assumed you can see the whole query. In Evisions Argos on Banner, you often cannot — and that is what makes this bug especially cruel here.

A DataBlock holds the SQL you wrote. But Argos also lets report designers add filters and parameters from the interface, and Argos **appends those conditions to the query's WHERE automatically**, behind the scenes — your DataBlock SQL never changes. The query Oracle actually runs is your SQL plus whatever Argos pinned on top.

The query Oracle runs is your DataBlock SQL plus the filters Argos appends to the WHERE.

So the trap can be armed by **someone else, somewhere else, long after you wrote the JOIN**. You ship a clean LEFT JOIN. Months later a colleague adds a UI filter on registration status — and Argos quietly appends AND sfrstcr_rsts_code = 'RE' to your WHERE. When the report comes back wrong and you open the DataBlock to debug, you are staring at your original SQL, which is correct. The dangerous line is not even in the file in front of you. (How Argos assembles the final query is its own article — see How Argos Assembles Your Query — Filters on the WHERE.)

The fingerprint of this bug is easy to spot once you know it: **a LEFT JOIN (or RIGHT JOIN) living with a WHERE that references a column of the right table.** When you see that pair in a review, stop and ask one question — was filtering those rows intentional? If yes, it should be an INNER JOIN; write it explicitly. If no, move the condition to the ON. In Argos, remember to check the appended filters too — not just the DataBlock SQL.

The one-sentence takeaway

A LEFT JOIN promises to keep the row; a WHERE that filters the right table breaks that promise in silence.

Track E · Where intuition fails

SPRIDEN Without CHANGE_IND — The Duplicate-Name Trap

7 min readbannerspridenchange-indentity-indduplicatesgotcha

The hook

You join to SPRIDEN, run the query, and scan the output. The names look right. The grades match. The CRNs are correct. But the row count at the bottom of the page says 258, and you know there are 250 students in this term. You have just shipped a report with phantom duplicates — and the error is invisible because every column looks correct except the number at the bottom of the page. The bug is not in your logic. It is in one missing filter on a column most Banner SQL writers do not know exists: SPRIDEN_CHANGE_IND.

The everyday analogy

A regular teacher walks into first-period English and reads the class roster: "Maria Smith." A hand goes up. The teacher marks Maria present and moves on. There are 30 students; 30 names get called; 30 hands go up. The headcount works.

Now picture a substitute taking over first period. The substitute has been handed the roster — but nobody told her that Maria changed her name to Cortez when she got married last summer, and to Cortez-Robinson last month when the legal paperwork went through. The substitute's roster is the raw unfiltered list of every name every student has ever had.

A substitute teacher's attendance sheet showing the same student listed three times — Maria Smith, Maria Cortez, Maria Cortez-Robinson — each with its own checkmark. Three marks, one student, a headcount silently wrong by two.

The substitute reads: "Maria Smith." Maria raises her hand. Checkmark. The substitute reads the next name: "Maria Cortez." Maria raises her hand again, a little embarrassed. The substitute, not catching on, marks "Maria Cortez" present too — a second checkmark, a second participation grade, a second phantom seat in the room. The substitute reads: "Maria Cortez-Robinson." Maria raises her hand a third time. Now there are three "Marias" on the attendance sheet, three participation grades being tracked separately, and the class headcount says 32 instead of 30.

No name was misspelled. No student lied. The roster simply had three rows for one person, and nobody filtered to the current row. That is what happens when you join to SPRIDEN without SPRIDEN_CHANGE_IND IS NULL. The roster is SPRIDEN. The substitute's missing context is the filter. The phantom Marias are the duplicate rows in your report. The bug is silent because no error fires — the count just comes out wrong, and the extra rows look identical to the real ones except for the name column.

What it really is

SPRIDEN stores one row per name version per entity — not one row per person. A student who has never changed names has exactly one row. A student who got married has two rows: the maiden-name row and the married-name row. A student who married and then divorced and reverted has three. The historical rows are never deleted — Banner versions names the same way it versions curricula in SGBSTDN (see Effective Dating — Why Banner Never Forgets), by adding a new row on top of the old one.

The current row — the one that represents the person's name right now — is identified by SPRIDEN_CHANGE_IND IS NULL. Every historical row carries a value in this column instead:

**'N'** — a name change. The person's last name, first name, or middle

name changed. The prior row was retired and this row was inserted.

**'I'** — an identification change. The person's visible Banner ID

(SPRIDEN_ID) was corrected, re-issued after a merge, or changed during a legacy-system conversion.

The column is VARCHAR(1). IS NULL is the ONLY correct test for the current row. = '' does not work — Oracle treats empty strings as NULL in VARCHAR columns, and NULL is never equal to anything, not even another NULL.

A second filter is equally load-bearing: SPRIDEN_ENTITY_IND. The PIDM number space is shared across people ('P') and corporations/companies ('C'). Vendor records in FTVVEND use 'C' for businesses that sell to the college. If your query joins SPRIDEN for people without AND spriden_entity_ind = 'P', a corporation named "Acme Office Supplies, Inc." can silently appear in your student roster because its PIDM happens to be referenced in a join chain.

One PIDM, three SPRIDEN rows over time: two historical rows with CHANGE_IND = 'N' (name changes recorded), one current row with CHANGE_IND IS NULL. The filter WHERE CHANGE_IND IS NULL selects exactly the top row — the current name.

The fix is two lines added to every SPRIDEN reference, every time: SPRIDEN_CHANGE_IND IS NULL and SPRIDEN_ENTITY_IND = 'P'. They are not optional. They are not "add them when the query looks wrong." They are part of the join contract. Treat them the way you treat the PIDM equality itself — non-negotiable.

See it — the diagram

The stack diagram shows one PIDM, three SPRIDEN rows, spanning a name-change timeline. The bottom row is the original name, CHANGE_IND = 'N'. The middle row is the first name change, CHANGE_IND = 'N'. The top row is the current name, CHANGE_IND IS NULL. The filter WHERE spriden_change_ind IS NULL selects exactly the top row — one row per PIDM. Without it, all three rows pass through, and every downstream aggregate multiplies by three.

Show me the code

The bug — no filter. A course roster that silently duplicates anyone who has ever changed names:

-- WRONG: missing CHANGE_IND. Every student who has ever changed
-- their name returns multiple times — once per historical version.
SELECT s.spriden_id,
       s.spriden_last_name,
       s.spriden_first_name,
       r.sfrstcr_crn
FROM   sfrstcr r
JOIN   spriden s ON s.spriden_pidm = r.sfrstcr_pidm
WHERE  r.sfrstcr_term_code = '202610';

If the term has 250 registrations and 8 of those students have changed names in the past, you get 258+ rows back instead of 250. The headcount in the next pivot is wrong. The credit-hour totals are inflated. Nothing errors. The bug ships.

The fix — two filters inside the ON clause:

-- RIGHT: the two SPRIDEN filters belong INSIDE the JOIN clause.
-- They are part of the join contract — every SPRIDEN reference
-- needs both, every time, no exceptions.
SELECT s.spriden_id,
       s.spriden_last_name,
       s.spriden_first_name,
       r.sfrstcr_crn
FROM   sfrstcr r
JOIN   spriden s
       ON  s.spriden_pidm        = r.sfrstcr_pidm
       AND s.spriden_change_ind  IS NULL
       AND s.spriden_entity_ind  = 'P'
WHERE  r.sfrstcr_term_code = '202610';

Same query. Three extra lines inside the ON clause. Correct row count. The Banner Semantic Search SQL Explainer flags the missing CHANGE_IND filter as a warning the moment you paste a SPRIDEN query without it. The rule message is direct: "SPRIDEN used without CHANGE_IND IS NULL. Without this filter, results include historical name changes. Add SPRIDEN_CHANGE_IND IS NULL for current name only." That rule exists because this bug is universal.

Where intuition fails

Four gotchas — Track E is compact, the gotcha beat is the whole point:

The duplicates look identical. The duplicate rows differ only in the

name columns. The sfrstcr_crn, the sfrstcr_credit_hr, the term — all identical across the three Marias. A reviewer scanning column values sees correct data. Only the row count betrays the bug. If nobody checks the expected row count against the query output, the inflated numbers go straight into the report that goes straight to the VP.

**SPRIDEN_ENTITY_IND matters the moment a query touches non-people.**

Most student or employee queries never encounter corporations — but the moment a join reaches FTVVEND or any table that holds a non-person PIDM, the filter spriden_entity_ind = 'P' becomes critical. An office-supply company does not belong in your student roster, and the only thing keeping it out is that one-character filter.

**The filter belongs in the ON clause, not the WHERE.** Putting

spriden_change_ind IS NULL in the outer WHERE clause filters correctly for an INNER JOIN — but if anyone later changes the join to a LEFT JOIN, the WHERE filter silently converts it back to an inner join by rejecting the NULL-extended rows. The Phantom INNER JOIN — When a WHERE Breaks Your LEFT JOIN covers that trap in full. Filters that belong to the join go inside the ON.

**IS NULL, never = ''.** The column is VARCHAR(1). Oracle treats

empty strings and NULL as identical in VARCHAR columns — comparing to '' returns no rows because NULL equals nothing. IS NULL is the only correct test. Every time.

The one-sentence takeaway

Every join to SPRIDEN needs SPRIDEN_CHANGE_IND IS NULL AND SPRIDEN_ENTITY_IND = 'P'. Inside the ON clause. Every time. No exceptions.

Track E · Where intuition fails

PHRHIST Without DISP — In-Progress vs Posted Payroll

You sum PHRHIST_GROSS for the fiscal year and the number looks right. It matches what you remember from the last payroll run. It is wrong. You have included rows from the payroll that is still being calculated — rows that look identical to posted rows in every column except one. The bank calls them 'pending.' Banner calls the column PHRHIST_DISP.

5 min readbannerphrhistdispositionpayrollgotchaptvpdis

The hook

The everyday analogy

You swipe your debit card at lunch for $12.40 and check your bank app on the walk back to the office. The transaction is already showing in your account — $12.40, today, the restaurant name. It looks like a real charge.

It is and it isn't. The merchant authorized the hold; the bank shows it to you so you know your card worked. But the money has not actually left your account yet. The merchant has not batched the day's charges to the processor; the processor has not settled with your bank; the bank has not posted the line to your ledger. The transaction is in "pending" state — visible, plausible, but not final.

Open the bank app the next morning. The $12.40 is still there but now it is in "posted" state. The hold became a real debit. The ledger updated. The pending list got shorter by one.

If you sum your account balance treating pending and posted transactions the same way, you double-count: the hold and the eventual posting each show as $12.40 even though they represent the same single charge. Most bank apps let you toggle "show pending" precisely because mixing the two confuses people.

A bank app screenshot showing a list of recent transactions, half tagged 'PENDING' in amber, half tagged 'POSTED' in coral; a confused user squinting at the running balance at the top.

PHRHIST is the same. Each row has a PHRHIST_DISP column that encodes whether the row is posted (final, real, in the GL) or some intermediate state (loaded, calculated, approved, but not yet posted). A naive SELECT SUM(phrhist_gross) FROM phrhist WHERE phrhist_year = 2026 sums posted and in-progress rows together — your "year-to-date gross" includes payroll events that have not actually paid anyone. The number looks right. It is wrong by exactly the in-progress total. The fix is one line: AND PHRHIST_DISP = 'P'. The bank's "show posted only" toggle, expressed as SQL.

What it really is

PHRHIST is the payroll history table — one row per (employee, pay event, earnings code). It is the foundation for HR and payroll reporting. Each row progresses through a disposition lifecycle: loaded → calculated → approved → posted. Each step advances the PHRHIST_DISP value. Posted ('P' at most installations) is the only state where the row represents a real, ledgered payment.

**PTVPDIS** is the validation table that defines the disposition codes. One row per code with PTVPDIS_CODE and PTVPDIS_DESC. Query it on your installation to see the local vocabulary. The codes are not universal — some Banner installations use single letters ('L', 'C', 'A', 'P'), others use numeric stages. Always confirm before hardcoding a filter value.

Without a disposition filter, payroll totals inflate during the active payroll window — preliminary calculations for the next pay period are visible alongside posted history. After posting, the total settles back to correct — but during the window (often 2–3 days per pay period), the numbers silently shift. Run the report Monday and get one number. Run it Thursday and get a different number. The data didn't change; the disposition of some rows did.

One employee's PHRHIST rows over a pay period: loaded → calculated → approved → posted. Each row has the same gross amount; only the posted row is the real ledgered payment. The filter WHERE phrhist_disp = 'P' selects exactly the posted row.

The BSS SQL Explainer flags PHRHIST queries that lack a disposition filter — the rule message names the trap and suggests AND PHRHIST_DISP = 'P' as the fix.

See it — the diagram

The stack diagram shows one employee's PHRHIST rows across a single pay period, stacked vertically in lifecycle order. Four rows, same employee, same gross amount, four different disposition codes: Loaded, Calculated, Approved, Posted. Only the bottom row — the posted row — is the real payment. The three rows above it are the payroll run in progress. The filter WHERE phrhist_disp = 'P' selects exactly the bottom row. Without it, all four rows pass through, and every downstream aggregate multiplies the same pay event by four.

Show me the code

The bug — no disposition filter:

-- WRONG: includes preliminary payroll runs that have not posted.
-- During the calc window before posting, this double-counts.
SELECT SUM(p.phrhist_gross) AS ytd_gross
FROM   phrhist p
WHERE  p.phrhist_year = 2026;

The fix — filter to posted only:

-- RIGHT: only posted rows. Verify 'P' matches PTVPDIS at your
-- installation — some installations use different codes.
SELECT SUM(p.phrhist_gross) AS ytd_gross
FROM   phrhist p
WHERE  p.phrhist_year = 2026
  AND  p.phrhist_disp = 'P';

Verify the local vocabulary before shipping:

-- Pull the disposition codes for YOUR installation.
SELECT ptvpdis_code, ptvpdis_desc
FROM   ptvpdis
ORDER BY ptvpdis_code;

Where intuition fails

The disposition codes are installation-specific. 'P' for posted is common but not universal. Always confirm against PTVPDIS before hardcoding the filter value. Document the chosen value in a SQL comment so the next maintainer knows where to look.

The bug is intermittent — it appears during calc windows. Run the unfiltered query right after a payroll posts and the numbers look correct because everything is in 'P'. Run it the day before the next payroll posts and the preliminary rows inflate the total. Reports that look right Monday but wrong Thursday are usually missing this filter.

**PHRHIST_PICT_CODE and PHRHIST_PAYNO identify the pay period.** Filtering by year is broad; filtering by (phrhist_year, phrhist_pict_code, phrhist_payno) gives a specific pay period. Combine disposition with these for any "what was paid in period X" report.

The disposition filter belongs in the WHERE, not the ON. Unlike SPRIDEN_CHANGE_IND from SPRIDEN Without CHANGE_IND — The Duplicate-Name Trap, PHRHIST is usually the FROM anchor of the query — not joined as a lookup — so the disposition predicate is a top-level WHERE condition. If PHRHIST is in a LEFT JOIN tail of a larger query, see The Phantom INNER JOIN — When a WHERE Breaks Your LEFT JOIN for the join-vs-where trap.

The one-sentence takeaway

Every PHRHIST query needs AND PHRHIST_DISP = 'P'. In-progress payroll rows look identical to posted rows but represent amounts not yet paid. Verify the local posted code against PTVPDIS before shipping any payroll report.

Track E · Where intuition fails

LISTAGG Overflow — The List That Silently Truncates

6 min readbannerlistaggoraclevarchar2overflowora-01489gotcha

The hook

The everyday analogy

The yellow school bus pulls up at the elementary school at 3 PM. The bus has 60 seats. There are 60 kids waiting at the curb — perfect. Everyone gets on, the driver pulls away, every parent picks up their kid at the next stop.

Now picture the same bus on the first day of the new school year. Enrollment was higher than the district planned. There are 75 kids at the curb. The driver does not have time to call dispatch. He waves the kids on. Sixty climb aboard. The driver closes the door and pulls away. The remaining 15 kids stand at the curb watching the bus leave. The driver does not announce "I left 15 kids behind." The bus goes about its route. The parents at the next stop see their kids and go home satisfied — until a phone call from the abandoned 15 at the original school's office an hour later.

A yellow school bus pulling away from a curb, full to capacity; a handful of children left standing at the curb watching the bus leave. A sign on the bus reads 'CAPACITY 60'; the kids on the curb count to 15.

LISTAGG is the bus. The 4000-byte VARCHAR2 limit is the 60 seats. The rows being concatenated are the kids. On a small dataset everyone fits — no problem visible. On a big dataset, some rows silently fall off the back of the result. Pre-12.2 Oracle is the driver who waves people on and pulls away without counting. Post-12.2 Oracle is the dispatcher who calls the driver and says "stop, there's a problem" — louder, but the bus stops moving until you fix it.

The fix is the same the bus district would adopt: a sign saying "12 students could not board." That is LISTAGG(...) ON OVERFLOW TRUNCATE WITH COUNT — the truncated output ends with (12 more) so at least you know data was dropped.

What it really is

LISTAGG(expr, separator) WITHIN GROUP (ORDER BY ...) is Oracle's aggregate that concatenates rows into a delimited string. It is used like SUM or COUNT but produces a string — a comma-separated list of courses per student, roles per user, fund codes per position, advisors per term. Security reports (GURACLS-based role listings) hit this limit constantly because a single user can belong to dozens of classes.

The 4000-byte limit comes from VARCHAR2's SQL-context maximum length. PL/SQL allows 32K, but SQL contexts — including Argos DataBlock SQL and most reporting tools — cap at 4000 bytes. A concatenated list that exceeds this triggers the overflow.

Pre-12.2 behavior: silent truncation. The result string is whatever fit in 4000 bytes. No warning. No error. The report looks correct because every row has a value; the reader has no signal that the value is incomplete. The only way to detect the truncation is to count the items yourself.

Post-12.2 default behavior: Oracle raises ORA-01489: result of string concatenation is too long and the query fails. Loud failure beats silent failure — but neither is acceptable in production.

The fix: ON OVERFLOW TRUNCATE [WITH COUNT | WITHOUT COUNT]. Gracefully truncates at the byte limit. WITH COUNT appends a (N) indicator showing how many rows were dropped. TRUNCATE is the keyword every report writer should use on any LISTAGG that could grow beyond a handful of rows.

A horizontal bar at 4000 bytes; below it, 75 small chips (roles) being concatenated left-to-right; the chips that fit are coral, the chips that overflow are amber and gray, with a small '(15 more)' tail appended where the ON OVERFLOW TRUNCATE WITH COUNT clause would emit.

Real Banner places where LISTAGG hits the limit: security reports listing roles per user, advisor lists per student over multiple terms, course lists per student (SHRGRDE aggregated), fund lists per position (NBRPLBD), historical name lists per PIDM (SPRIDEN without change_ind — see SPRIDEN Without CHANGE_IND — The Duplicate-Name Trap).

See it — the diagram

The 4000-byte bar is the hard ceiling. Seventy-five chips — each a role code — are being concatenated left-to-right, comma-separated. The first ~55 chips fit inside the bar, coral-colored, visible in the output. The chips beyond the bar are amber and gray — concatenated, but dropped. With the ON OVERFLOW TRUNCATE WITH COUNT clause, a small (15 more) tail is appended at the truncation point. Without it, the string just stops. The diagram is the same for both Oracle versions; only whether the truncation is silent or accompanied by the (N) indicator changes.

Show me the code

The bug — naive LISTAGG, no overflow handling:

-- WRONG: silently truncates if the role list exceeds 4000 bytes.
-- (Pre-12.2 Oracle.) Post-12.2 Oracle errors with ORA-01489.
-- Either way, the report breaks invisibly or loudly.
SELECT g.guracls_userid,
       LISTAGG(g.guracls_class_code, ', ')
         WITHIN GROUP (ORDER BY g.guracls_class_code) AS role_list
FROM   guracls g
GROUP BY g.guracls_userid;

The fix — ON OVERFLOW TRUNCATE WITH COUNT:

-- RIGHT: truncates gracefully, appends '(N more)' so the reader
-- knows data was dropped. Available in Oracle 12.2 and later.
SELECT g.guracls_userid,
       LISTAGG(g.guracls_class_code, ', '
               ON OVERFLOW TRUNCATE '...' WITH COUNT)
         WITHIN GROUP (ORDER BY g.guracls_class_code) AS role_list
FROM   guracls g
GROUP BY g.guracls_userid;

**Check your Oracle version before assuming ON OVERFLOW works:**

SELECT version, version_full FROM product_component_version
WHERE  product LIKE 'Oracle%';

Where intuition fails

The truncation is silent pre-12.2. Reports built before a database upgrade hide the bug. Post-upgrade, those same reports start raising ORA-01489 and operations teams scramble. Audit every LISTAGG in your codebase before the upgrade — add ON OVERFLOW TRUNCATE defensively.

**WITH COUNT vs WITHOUT COUNT matters for reconciliation.** WITH COUNT ends the truncated list with ... (12) — a visible signal that data was dropped. WITHOUT COUNT just stops, no indicator. Use WITH COUNT unless the consumer report's display logic cannot handle the trailing indicator.

The 4000-byte limit is bytes, not characters. Multi-byte characters (UTF-8 accented characters in Spanish names) count for more than one byte. A list of 600 names in plain ASCII might fit; the same 600 names with accents may overflow. Test with your actual data character set.

Sorting affects which rows survive truncation. WITHIN GROUP (ORDER BY x) controls the concatenation order. The truncated output keeps the FIRST rows by that ordering and drops the LAST. If the business cares about specific items appearing (e.g. "always show the primary advisor"), sort to put those first — or filter to the most-important subset before aggregating.

The one-sentence takeaway

LISTAGG silently truncates when the concatenated string exceeds 4000 bytes. Use LISTAGG(...) ON OVERFLOW TRUNCATE WITH COUNT on every LISTAGG that could grow. The (N more) indicator is the difference between a known incomplete list and an invisible one.

Track E · Where intuition fails

Soft Deletes — The Rows That Aren't Really Gone

6 min readbannersoft-deletesfrstcrsgbstdnauditgotchastvrsts

The hook

You withdraw a student in Banner. The row in SGBSTDN does not disappear — it gets a status code. You drop a registration. SFRSTCR keeps the row with a drop flag. You delete a security role. The audit log keeps an entry with AUDIT_ACTION = 'D'. Banner does not hard-delete. The rows stay in the table forever. Every report that does not filter them out is silently counting ghosts.

The everyday analogy

You hit "Delete" on an email about last quarter's budget. The email disappears from your inbox. It feels gone. Conceptually it is gone — it is not cluttering your view, it is not in your unread count, you cannot accidentally reply to it from the inbox view.

Then you go to "Search all folders" for a related thread. The search returns 47 results — half from your active inbox, half from Deleted Items. The same email you "deleted" last week shows up in the search alongside this week's live correspondence. If you scan results and pick the most recent one, you might quote a budget number that was wrong, that you corrected the next day, that you thought you had retired.

The email is not gone. It is soft-deleted — moved to a different folder, flagged "deleted," kept for thirty days or forever. A search across "all folders" ignores the flag and treats deleted items as if they were live. To get a clean answer you have to scope the search: "Inbox only," not "All folders."

An email client showing a search across 'All Folders' returning results from both Inbox (coral, current) and Deleted Items (amber, with strikethrough); a confused user squinting at a stale quote pulled from a deleted thread.

Banner does the same with several tables. Withdraw a student and the row in SGBSTDN does not disappear — it gets a status code that means "withdrawn." Drop a registration and SFRSTCR keeps the row with SFRSTCR_RSTS_CODE = 'DD' ("drop, no grade") or 'DW' ("drop with W"). Delete a security role assignment and the audit log keeps an entry with AUDIT_ACTION = 'D'. Naive queries return everything — live and withdrawn and dropped and deleted — and the report's "active enrollment" count silently includes people who left last semester.

The fix is the email's "Inbox only" toggle, expressed as SQL: one WHERE filter per soft-delete pattern. Know which tables have them. Filter every time.

What it really is

Banner uses at least three distinct soft-delete patterns across its schema. No universal is_deleted flag exists. Every table needs its own filter pattern.

Registration status codes in SFRSTCR_RSTS_CODE. The validation table STVRSTS defines the codes. Typical drops: 'DD' (drop-no-grade), 'DW' (drop with a W). To count "currently registered," exclude these. Some codes mean "re-enrolled after drop" ('RE'); decide whether those count as live or not, and document the decision.

Student status codes in SGBSTDN_STST_CODE. STVSTST defines the vocabulary. Typical withdrawn students have status codes that mean "no longer active." Confirm your local codes — some installations distinguish "withdrawn – voluntary," "withdrawn – academic," and "dismissed" as separate statuses.

Audit-log action codes — <table>_AUDIT_ACTION = 'D' on log tables like GUBALOG. An insert is 'I', an update is 'U', a delete is 'D'. Counting all rows as "current state" inflates counts by every delete ever recorded.

The discipline: for every Banner table you query, ask "does this table do soft deletes?" If yes, identify the column and the values; filter them out. The BSS schema search and SQL Explainer rules flag the major ones. SPRIDEN Without CHANGE_IND — The Duplicate-Name Trap is itself a specialized soft-delete pattern — historical name rows are kept but flagged.

Three Banner tables side by side, each with one live row (coral) and one soft-deleted row (amber, strikethrough): SFRSTCR with RSTS_CODE='DD', SGBSTDN with STST_CODE='WD', GUBALOG with AUDIT_ACTION='D'. The filter for each is shown beneath.

Soft deletes serve real purposes: audit trail, FERPA compliance, the ability to reinstate a withdrawn student, reconciliation against historical reports that included the now-deleted rows. They are a feature — but the feature requires filtering discipline in every consumer query.

See it — the diagram

Three tables side by side, three soft-delete patterns. SFRSTCR: one live row (coral, RSTS_CODE = 'RE') and one dropped row (amber strikethrough, RSTS_CODE = 'DD'), with the filter WHERE rsts_code NOT IN ('DD', 'DW') beneath. SGBSTDN: one active row (coral, STST_CODE = 'AS') and one withdrawn row (amber strikethrough, STST_CODE = 'WD'), with WHERE stst_code <> 'WD' beneath. GUBALOG: one grant row (coral, AUDIT_ACTION = 'I') and one removal row (amber strikethrough, AUDIT_ACTION = 'D'), with WHERE audit_action <> 'D' beneath. Three tables, three columns, three filters, one pattern: find the flag column, exclude the deleted value.

Show me the code

Bug #1 — Registration count includes drops:

-- WRONG: counts dropped registrations as if they were live.
SELECT r.sfrstcr_crn, COUNT(*) AS enrolled
FROM   sfrstcr r
WHERE  r.sfrstcr_term_code = '202610'
GROUP BY r.sfrstcr_crn;

Fix — exclude drop codes:

-- RIGHT: only currently-active registrations. Verify the exact
-- drop codes against STVRSTS at your installation.
SELECT r.sfrstcr_crn, COUNT(*) AS enrolled
FROM   sfrstcr r
WHERE  r.sfrstcr_term_code = '202610'
  AND  r.sfrstcr_rsts_code NOT IN ('DD', 'DW')
GROUP BY r.sfrstcr_crn;

Bug #2 — Student roster includes withdrawn students:

-- WRONG: withdrawn students still have rows in SGBSTDN.
SELECT s.sgbstdn_pidm, s.sgbstdn_majr_code_1
FROM   sgbstdn s
WHERE  s.sgbstdn_term_code_eff = (
       SELECT MAX(s2.sgbstdn_term_code_eff)
       FROM   sgbstdn s2
       WHERE  s2.sgbstdn_pidm = s.sgbstdn_pidm);

Fix — add the status filter:

-- RIGHT: exclude withdrawn students. Confirm the withdrawal code
-- against STVSTST at your installation.
SELECT s.sgbstdn_pidm, s.sgbstdn_majr_code_1
FROM   sgbstdn s
WHERE  s.sgbstdn_term_code_eff = (
       SELECT MAX(s2.sgbstdn_term_code_eff)
       FROM   sgbstdn s2
       WHERE  s2.sgbstdn_pidm = s.sgbstdn_pidm)
  AND  s.sgbstdn_stst_code NOT IN ('WD', 'DD');

Find the soft-delete vocabulary on your installation:

-- Registration status codes
SELECT stvrsts_code, stvrsts_desc
FROM   stvrsts ORDER BY stvrsts_code;

-- Student status codes
SELECT stvstst_code, stvstst_desc
FROM   stvstst ORDER BY stvstst_code;

Where intuition fails

The "deleted" semantics vary by table. SFRSTCR_RSTS_CODE uses domain-specific codes ('DD', 'DW', 'RE'). *_AUDIT_ACTION uses single letters ('I', 'U', 'D'). Status columns like SGBSTDN_STST_CODE are their own vocabulary. No universal is_deleted flag exists. Every table needs its own filter pattern.

**The filter is NOT IN, not <>.** Most soft deletes use multiple codes. NOT IN ('DD', 'DW') is the correct shape — <> 'DD' misses 'DW' and any other drop variants. Always enumerate the full set of exclusion codes from the validation table.

Some status codes mean "in-between" — neither fully active nor fully deleted. Beyond the obvious drops, some statuses mean "in-progress," "incomplete," or "pending." Decide explicitly whether to include them. Document the choice in the SQL comment so the next developer knows it was intentional.

Hard deletes happen too — at unpredictable times. Some Banner installations periodically purge old audit log rows for storage reasons. A report that counted deleted history yesterday may return fewer rows today because the rows were finally hard-deleted. Reports that depend on soft-deleted rows should LEFT JOIN defensively rather than assuming the row will always be there.

The one-sentence takeaway

Banner soft-deletes rows instead of removing them. Registration drops, student withdrawals, and audit-log deletions all leave data in the table flagged with a status code. Identify the soft-delete column for every table you query. Filter it out. Every time.

Track E · Where intuition fails

The Effective-Date Trap — Joining to Yesterday's Row

6 min readbannereffective-datingsgbstdnscbcrsehistorical-reportsgotchamax-eff

The hook

You run a report: "Fall 2022 enrollment by current major." The row count is right. The CRNs match. Every student has exactly one major. What nobody told you is that the major is from today — not from Fall 2022. You used the unbounded MAX-effective subquery from The MAX() Subquery — Getting the Row That's Current, and it silently tagged every historical registration with present-tense labels. The report is a history book whose author walked into the archive and swapped all the old placards for new ones.

The everyday analogy

Walk through a natural history museum and you read the placards: "This dinosaur is Triceratops horridus." Three years later the curators decide the specimen is actually Triceratops prorsus — different species, refined analysis. They print a new placard. The old one is recycled. The display case still shows the same skeleton; only the label changed.

Now imagine you are a researcher writing about the museum's 1990 holdings. You walk into the exhibit hall today, look at the placards, and write: "In 1990 the museum displayed a Triceratops prorsus." You quoted today's placard. But in 1990 the placard said Triceratops horridus — a visitor's notebook from 1990 confirms it. Your 1990 history book is now silently wrong — it describes the past using the present's labels.

A museum exhibit case with two placards visible: a faded 1990 placard partially behind the current one which reads a different species name; a researcher squinting at the new placard while leafing through a 1990 visitor catalog.

The fix is to consult a dated archive — the museum's acquisition records from 1990 — instead of walking into the current exhibit. The acquisition record from 1990 says Triceratops horridus; the placard today says Triceratops prorsus; both are correct for their respective dates.

Banner is the museum. The student's major in SGBSTDN is the placard. The MAX-effective subquery is "walk into the exhibit hall and read the current placard." For "what is Maria's major right now?" that is correct. For "what was Maria's major in Fall 2022?" it is silent revisionism — Maria changed her major in Fall 2024, and today's placard says her current one. Your Fall 2022 enrollment-by-major report tags Maria with the wrong major. The fix is the dated archive: bound the MAX by the report's term. AND s2.sgbstdn_term_code_eff <= '202210'. Now you are reading the placard that was on the case in Fall 2022.

What it really is

Banner's effective-dating system (see Effective Dating — Why Banner Never Forgets) stores attribute versions in stacked rows — each change inserts a new row. The MAX-effective subquery (see The MAX() Subquery — Getting the Row That's Current) resolves which row is "current" — but "current" is always relative to a date.

The unbounded MAX — WHERE same_pidm with no date bound — returns the row that is current right now, across all time. The topmost stratum. For operational reports ("what is each student's major today?") this is correct.

The bounded MAX — WHERE same_pidm AND eff <= target_date — returns the row that was current as of the target date. For historical reports ("Fall 2022 enrollment by major"), the target date is the report's period — the term the facts are about, not the date the report runs.

Same student SGBSTDN stack as A5/B3 figures, with two target dates marked: 'today' (the unbounded MAX hits the top Health Sciences row) and 'Fall 2022' (the bounded MAX hits the middle Nursing row). Same student, same query shape, different result depending on the bound.

The bug is silent because the report still returns one row per student, each with a major. The shape is right. The label is just from the wrong stratum. A student who was Nursing in Fall 2022 but changed to Health Sciences in Fall 2024 appears in the historical report as "Health Sciences" — four semesters before they declared it. Spot-checking unchanged students finds nothing wrong. The only way to catch it is to audit a student you know changed.

Tables most commonly affected: SGBSTDN (curriculum — major, minor, concentration), SCBCRSE (course catalog — retitled courses), NBRJOBS (job title and salary — promoted since), SGRADVR (advisor — reassigned since).

See it — the diagram

The same SGBSTDN stack from Effective Dating — Why Banner Never Forgets, with two different arrows drawn through it. The unbounded arrow — labeled "today" — cuts through the top row: Health Sciences, declared Fall 2024. The bounded arrow — labeled "Fall 2022" — stops at the middle row: Nursing, declared Fall 2021. Same student. Same stack. Same MAX subquery shape. The only difference is the <= bound. Without it, the arrow always hits the top. With it, the arrow stops at the stratum that was current when the report's period happened.

Show me the code

Bug — unbounded MAX in a historical report:

-- WRONG: "Fall 2022 enrollment by current major."
-- The MAX-effective subquery has no date bound.
-- Every student's major is TODAY's value, not what they had in 2022.
SELECT s.sgbstdn_majr_code_1 AS major, COUNT(*) AS enrolled
FROM   sfrstcr r
JOIN   sgbstdn s
       ON  s.sgbstdn_pidm = r.sfrstcr_pidm
       AND s.sgbstdn_term_code_eff = (
           SELECT MAX(s2.sgbstdn_term_code_eff)
           FROM   sgbstdn s2
           WHERE  s2.sgbstdn_pidm = s.sgbstdn_pidm)
WHERE  r.sfrstcr_term_code = '202210'
GROUP BY s.sgbstdn_majr_code_1;

Fix — bound the MAX by the report's term:

-- RIGHT: bound the MAX to <= the report's term.
-- Now each student's major is what they had in Fall 2022.
SELECT s.sgbstdn_majr_code_1 AS major, COUNT(*) AS enrolled
FROM   sfrstcr r
JOIN   sgbstdn s
       ON  s.sgbstdn_pidm = r.sfrstcr_pidm
       AND s.sgbstdn_term_code_eff = (
           SELECT MAX(s2.sgbstdn_term_code_eff)
           FROM   sgbstdn s2
           WHERE  s2.sgbstdn_pidm = s.sgbstdn_pidm
             AND  s2.sgbstdn_term_code_eff <= '202210')
WHERE  r.sfrstcr_term_code = '202210'
GROUP BY s.sgbstdn_majr_code_1;

One extra line inside the subquery — AND s2.sgbstdn_term_code_eff <= '202210' — and the report switches from silent revisionism to historical accuracy.

Where intuition fails

The bug only appears for students who changed. A student who never changed majors has the same major in every stratum; bounded and unbounded MAX return the same row. The bug is invisible in spot-checks of unchanged students. Audit by deliberately running the report for a term where you know a student changed since — verify their major matches their then-current declaration.

The bound is the report's PERIOD, not the report's RUN DATE. A report run today about Fall 2022 needs <= '202210', not <= SYSDATE. New writers sometimes substitute SYSDATE because "that's what 'current as of' means." It means current as of the period the report covers — which is not today.

Multi-term reports need a JOIN-time bound, not a literal. "Enrollment by current major, Fall 2020 through Fall 2024" needs the MAX bound to vary per row — <= r.sfrstcr_term_code, not <= '202410'. Otherwise every row of every term uses Fall 2024's major.

**Course catalog (SCBCRSE) has the same trap.** A historical transcript that lists "courses with their current titles" silently relabels courses that have been retitled since the student took them. The bound is the term the student registered. The The MAX() Subquery — Getting the Row That's Current article shows the SCBCRSE pattern with <= sr.sfrstcr_term_code — same fix, different table.

The one-sentence takeaway

The unbounded MAX-effective subquery returns today's version of every effective-dated attribute. For historical reports, bound it: AND term_code_eff <= :report_term. Otherwise every student's major, every course's title, and every employee's job appears as what it is NOW, not what it was THEN.

Track E · Where intuition fails

The `> 0` Trap — The Filter That Drops Reversals

You add AND phrhist_gross > 0 to your payroll report. The intent is defensive: exclude zero rows, count only real amounts. The effect is the opposite of defensive. You have silently dropped every payroll reversal — every void, every adjustment, every back-out. Your 'total gross earnings' now includes money that was keyed by mistake and reversed the next day. The filter that was supposed to protect the report broke it.

6 min readbannerphrhisttbraccdreversalgotchafilter-trap

The hook

You add AND phrhist_gross > 0 to your payroll report. The intent is defensive: exclude zero rows, count only real amounts. The effect is the opposite of defensive. You have silently dropped every payroll reversal — every void, every adjustment, every back-out. Your "total gross earnings" now includes money that was keyed by mistake and reversed the next day. The filter that was supposed to protect the report broke it.

The everyday analogy

A waiter ends a long evening shift and counts the tip jar. $487 in cash. They go home happy.

But that is not what they actually netted. The restaurant has a tip-out policy: at end of shift, the waiter shares a percentage with the busser, the bartender, and the food runner. The tip-outs are recorded in the same accounting system as the tips coming in — but as negative entries. Tips in: $487. Tip-outs: -$95 to busser, -$60 to bartender, -$40 to food runner. Net to the waiter: $292.

Now imagine the waiter's manager runs a report at end of month titled "Total tips collected." The report writer, trying to be safe, adds WHERE amount > 0 to the SQL — "to exclude any zero entries." The report sums every positive tip and ignores every tip-out. The total reads $487 × the number of shifts. The waiters look like they're making more than they actually take home. Payroll reconciliation drifts. The bookkeeper cannot tie back to the bank deposit.

A waiter's end-of-shift tip jar full of cash beside a notebook showing positive entries (tips in, coral) and negative entries (tip-outs, amber); a confused calculation between gross and net at the bottom.

The > 0 filter felt protective. It was deletion in disguise. Removing it lets the positives and negatives net correctly, and the total matches reality.

Banner has the same shape across every transactional table. PHRHIST_GROSS includes negative rows for payroll reversals. TBRACCD_AMOUNT includes negative rows for AR adjustments and refunds. Filtering > 0 keeps the gross and silently drops the reversals. Your "Total Gross Earnings" or "Total Tuition Charged" report inflates by exactly the amount that was reversed — and the bug is invisible because every reversal happened to a real charge that is still in the report.

What it really is

Banner stores reversals as negative-amount rows. A payroll line entered for $2,150 and then voided the next day produces TWO rows in PHRHIST: the original +2,150.00 and the reversal -2,150.00. Both rows are real, both posted, both ledgered. The sum of the two is $0 — exactly the net effect of the reverse-and-rebill cycle. This is not a bug in Banner. It is the standard double-entry shape: every debit has a credit, every charge has a reversal, every adjustment nets to zero in the pair.

The WHERE amount > 0 filter looks defensive — "only count positive amounts, exclude zeroes" — but it silently drops the reversal row while keeping the original. Result: every voided transaction appears in the total as if it were real.

Two PHRHIST rows for the same employee and pay event: the original +2,150.00 (coral) and the reversal -2,150.00 (amber). Below, SUM with and without the > 0 filter: filtered = $2,150 (wrong); unfiltered = $0 (right).

The fix is simpler than the filter: remove it. SUM(phrhist_gross) already nets positive against negative. The total that comes out is the actual cash that moved. If you specifically need to exclude amount-of-zero rows (rows that represent non-events with no financial impact), use <> 0 — which keeps negatives and drops only true zeroes.

The BSS SQL Explainer flags PHRHIST_GROSS > 0 and similar patterns. The rule message: this filter drops reversals; use <> 0 or omit the filter entirely.

The pattern generalizes beyond payroll. TBRACCD_AMOUNT (AR transactions) stores refunds as negatives. RPRAWRD (financial aid award activity) stores award decreases as negatives. Anywhere Banner records financial events that can be voided or adjusted, the > 0 filter silently drops the reversals.

See it — the diagram

Two PHRHIST rows side by side for the same employee, same pay event, same earnings code. The original row — coral, +2,150.00 — the payroll line as keyed. The reversal row — amber, -2,150.00 — the void posted the next day. Below them, two SUM results in large type: "SUM with > 0 filter = $2,150.00" in amber (wrong — the gross survives, the reversal is gone), and "SUM without filter = $0.00" in coral (right — the pair nets to zero as it should). The visual makes the arithmetic obvious: dropping one row of a zero-sum pair inflates the total by exactly the dropped row's magnitude.

Show me the code

**Bug — > 0 filter drops reversals:**

-- WRONG: drops every payroll reversal silently.
-- Total gross is inflated by every voided payroll line.
SELECT SUM(p.phrhist_gross) AS ytd_gross
FROM   phrhist p
WHERE  p.phrhist_year = 2026
  AND  p.phrhist_disp = 'P'        -- posted only (good — see E3)
  AND  p.phrhist_gross > 0;        -- BUG: drops reversals

Fix — let positives and negatives net:

-- RIGHT: SUM nets positive payroll entries with their reversals.
-- The total matches the actual cash that left the bank.
SELECT SUM(p.phrhist_gross) AS ytd_gross
FROM   phrhist p
WHERE  p.phrhist_year = 2026
  AND  p.phrhist_disp = 'P';

If you specifically want to exclude exact-zero rows:

-- The defensive intent of "exclude zeros" without dropping reversals.
SELECT SUM(p.phrhist_gross) AS ytd_gross
FROM   phrhist p
WHERE  p.phrhist_year = 2026
  AND  p.phrhist_disp = 'P'
  AND  p.phrhist_gross <> 0;       -- excludes zero, keeps negatives

Where intuition fails

The bug looks safe because every reported amount is real. No phantom rows appear; the inflation is from rows that actually existed and were validly reversed. The report isn't lying about what it shows — it is hiding what it is NOT showing. Reviewers see plausible numbers and approve. The reconciliation breakage surfaces weeks later when the Bursar or Payroll office cannot tie back to the bank.

**The same pattern affects TBRACCD (AR transactions), RPRAWRD (financial aid award activity), and any table where amounts can be reversed.** Filter > 0 on AR-related reports silently drops every refund, every waiver, every adjustment. The Bursar's reconciliation breaks for the same reason payroll's did.

**<> 0 is sometimes the right intent.** If you really do want to exclude amount-of-zero rows (because they represent non-events with no financial impact), <> 0 does that without dropping reversals. Use it when the business rule explicitly cares about "non-zero amounts only" — but understand you are keeping negatives and the SUM will net them.

**The opposite trap — < 0-only reports — is equally wrong.** A "show only reversals" report that filters phrhist_gross < 0 shows only the negative side of pairs, making refunds look like the entire transaction. If the business needs "reversed transactions," join the table to itself on the matching pair criteria (PIDM + pay event + earnings code) and show both sides together.

The one-sentence takeaway

Banner stores reversals as negative-amount rows. WHERE amount > 0 silently drops every reversal while keeping the original — inflating totals by exactly the amount that was reversed. Let positives and negatives net in the SUM. Use <> 0 only if you need to exclude amount-of-zero rows specifically.

Track F · From Banner to a warehouse

What Waubonsee Actually Reports Today — and Where the Warehouse Should Land First

Before you draw your first star, look at what the campus already prints every week. The Argos folder will tell you which warehouse to build first — and the answer is not the one you expected.

6 min readargosevidencewarehouse-strategyprioritization

The hook

Every data-warehouse playbook in the Kimball canon starts the same way: pick the most important business process. Universities almost always answer "course registration." It is the textbook example, the photogenic star, the first chapter of every higher-ed BI deck.

That answer is wrong for Waubonsee. We have evidence — not opinion, not folklore. The Argos folder on the production server is a ledger of what the campus actually prints. We parsed it. The numbers tell a different story.

The everyday analogy

A restaurant kitchen has two reference texts. One is the cookbook on the shelf: every dish the kitchen could serve. The other is the wooden ticket spike at the pass: every dish the kitchen did serve, last night and the night before, going back years. The cookbook is bigger, prettier, and mostly aspirational. The spike is dog-eared, stained, and exactly accurate.

When you remodel the kitchen, you do not start from the cookbook. You start from the spike. The spike tells you which prep station gets a refrigerator and which gets a cutting board, because the spike tells you what the cooks actually do between four and ten every evening.

The cookbook lists every dish the kitchen *could* serve; the ticket spike says which ones the kitchen *actually* serves.

Banner is the cookbook. Every table that could be queried is in there, documented, joined, indexed. Argos is the spike. Every report the campus does run lives in .argosexport bundles, with its SQL, its parameters, and a record of who runs it. If you build the warehouse from the cookbook, you will end up with a beautiful kitchen that does not match the orders.

What it really is

The sibling project argos_tool parses every .argosexport bundle exported from Waubonsee's production Argos server. For each DataBlock — the unit of an Argos report — it pulls the SQL, the matched Banner objects, and the report variants that share it. The output is one JSON per DataBlock at argos_tool/ArgosDoc/ai_data/.

The script src/argos_ingest.py in this wiki rolls those JSONs into one catalog: data/argos_catalog.json. Three numbers come out of it:

670 DataBlocks — the discrete reports the school currently maintains.
272 unique Banner objects referenced across all of them.
5 functional domains — HR/Payroll, Position & Budget, Student

Records, Finance/GL, and Person/Other — that bucket each report by its dominant table prefix.

The domain breakdown is the first headline:

Five functional domains, none dominant. The warehouse's first star will leave at least two thirds of the catalog still running against Banner.

The chart does not show one dominant slice. HR/Payroll is the largest named domain at roughly a fifth of the catalog, with Student Records and Finance/GL right behind it. Position & Budget — the part of the school that decides who gets paid out of which fund — is the smallest named slice. The largest slice of all, "Person / Other," is the catalog's most honest admission: many reports are dominated by SPRIDEN, SPBPERS, and the validation-code tables (STV*, GTV*) that sit underneath every business process. There is no single warehouse you can build that "replaces Argos" — the load is spread.

See it — the diagram

The same evidence, table by table:

The Banner objects Waubonsee actually queries today — ranked by how many Argos reports touch them.

spriden leads the catalog at 203 reports — about one in every three DataBlocks. That is the universal-identity table; every Argos report that names a person reaches through it. pwvempl (employee view) is second at

sfrstcr (course registration) is third at 74 — quietly confirming

that the registrar's office does run real Argos reports, against the folklore that says SSB and the Banner forms handle all of it. The next three are ftvorgn (organization), pebempl (employee detail), and nbbposn (position master).

Read this list as a punch list for the warehouse, then notice what it tells you about strategy: the top six tables span four different functional domains. No single first star covers half of them. Whichever star ships first will leave the majority of the catalog still running against Banner.

Show me the code

The catalog is one command:

python src/argos_ingest.py

It reads ../../argos_tool/ArgosDoc/ai_data/*.json and writes data/argos_catalog.json. The figures on this page read that catalog at build time, so re-running src/figures.py after the next Argos export refreshes both the bar chart and the domain breakdown without touching the text. The numbers above will move with the production export — they are a snapshot, not a constant.

The rollup carries three sections worth knowing:

report_count        - 670
unique_table_count  - 272
table_frequency[]   - every Banner object, with the list of report names that
                      touch it. This is the lookup the warehouse priority list
                      is built from.
domain_summary[]    - {Person/Other: 286, HR/Payroll: 146, Student: 124,
                      Finance/GL: 70, Position & Budget: 44}
                      a quick way to count what each new dimension or fact
                      would replace.

Where intuition fails

Three things in the catalog will surprise the warehouse builder.

There is no dominant first star by raw frequency. The five named

domains divide the catalog into reasonably comparable slices. Whichever star ships first will leave at least two thirds of Argos still running against Banner. "Pick the busiest area" gives no clear winner here, so the first-star decision has to be made on other criteria — see Pick a Process — Why Position-Budget Is the First Star for what those criteria are and why Position & Budget wins them at Waubonsee.

**SPRIDEN is everywhere.** It appears in roughly one of every three

reports. The corollary: any warehouse worth building has a strong Person/Identity dimension on day one. Skip that, and the first star feels disconnected from everything around it. PIDM — The Number Behind Every Person is the article that explains why SPRIDEN's key — the integer PIDM — is the universal join column behind every person in Banner.

The registrar runs more Argos than folklore claims. SFRSTCR

(course registration) sits at #3 in the catalog with 74 reports. The conventional wisdom — "the registrar uses SSB; only HR has heavy Argos use" — comes from a smaller HR-only export we worked with earlier. The production directory tells a more even story. That matters for sequencing: the second and third stars in Track G should probably cover registration well before the catalog runs out of pain.

The Argos catalog is one BI consumer, not the only one. It counts

what runs in Argos. It cannot count what runs in Power BI, Tableau, or Excel Power Query workbooks — and Waubonsee already uses all three. The institution's RISE 2030 Strategic Plan anchors data-informed decision-making, and the partnership with Achieving the Dream (ATD) drove a Data Empowerment Workshop in which faculty learned to read Power BI dashboards disaggregated by race, gender, and age during annual program review. The warehouse must feed Power BI semantic models AND Argos DataBlocks — both consumers, both first-class. The 670-DataBlock catalog tells you what to build for; the cultural layer it cannot see tells you how the output will be consumed. See The Semantic Layer — Where Argos, Power BI, and Dashboards Sit for how the two consumers share one set of definitions.

The catalog is a priority signal, not an instruction. It tells you what the campus prints. The strategy of which fact to model first is its own decision, and the next article picks it up.

The one-sentence takeaway

Before you build a dimension, count the reports it would replace.

Track F · From Banner to a warehouse

Why a Warehouse? — OLTP, OLAP, and the Cost of Asking Banner the Wrong Question

7 min readwarehouseoltpolapkimballperformance

The hook

Banner registers a student in milliseconds — that is its job. It posts grades, charges fees, drops courses, prints transcripts. Every one of those actions is a transaction: short, surgical, touching a handful of rows at a time. Ask it a different kind of question — "how did enrollment shift by program and credit load over five years?" — and the same engine that serves the front counter will scan millions of rows, build hash tables in memory, and contend for the very locks the registrar is waiting on. One database cannot be optimal for both jobs. A data warehouse is the second database that handles the second job.

The everyday analogy

A woodworking shop builds furniture. Every tool is sharp, every bench is tight, every motion is optimized for the next strike — measure, cut, join, finish. The shop is noisy, sawdust on the floor, clamps within arm's reach. A cabinetmaker can build a chair in an afternoon because the space is arranged for making, not for looking. If you stop the shop to give a tour, the work stops with it. The saw goes quiet. Nothing gets built while the visitors browse.

Across the hall, behind a glass wall, is the exhibit floor. The same furniture — same wood, same finish, same craft — sits on low platforms under soft gallery lights. Each piece has a wall placard: the year, the wood, the maker, the story. The chairs are arranged so you can compare armrest shapes across decades. Nothing moves. No one is measuring or cutting. The only questions are "how does this one compare to that one?"

Both spaces use the same raw material. Both are necessary. But the shop cannot serve both jobs. You cannot run a table saw in a gallery. You cannot study a thirty-year retrospective of joint techniques from a workbench cluttered with clamps and half-finished armrests. Each space is optimized for its purpose — and the very optimization that makes one fast makes the other impossible.

The workshop makes furniture — fast, sharp, one piece at a time. The exhibit floor displays it — still, lit, built for browsing. Same wood, same craft, two different optimizations.

Banner is the workshop. Every table is normalized, every index is narrow, every transaction is designed to finish fast and release its locks. The data warehouse is the exhibit floor — a separate copy of the same data, reshaped for browsing, comparison, and the slow, wide questions the workshop was never meant to answer.

What it really is

OLTP — Online Transaction Processing — is the workshop. It is optimized for write efficiency: many small, concurrent transactions, each touching a handful of rows. Banner's schema is normalized to third normal form for a reason — normalization eliminates redundancy, which means an update writes to exactly one place. Narrow indexes speed up single-row lookups. Row-level locks are held for milliseconds and released. Every design choice serves the next transaction.

OLAP — Online Analytical Processing — is the exhibit floor. It is optimized for read efficiency: few large queries, each scanning millions of rows, grouping, aggregating, comparing. An analytic query does not care about updating one row; it cares about summarizing ten million of them. The optimizations are inverted: wide covering indexes, pre-computed aggregates, denormalized tables that repeat data to avoid joins at query time.

OLTP vs OLAP — every design choice that makes Banner fast at transactions makes it slow at analysis, and vice versa.

These two sets of optimizations are not just different — they are incompatible in the same database. The normalization that makes writes fast forces analytic queries to join eight tables just to count enrollments by term. The row-level locks that keep transactions crisp become contention points when an analytic query scans the same table for thirty seconds. And the narrow indexes that serve single-row lookups are useless for a GROUP BY that spans five academic years.

The warehouse is loaded by an ETL process — extract, transform, load — that runs on a schedule, typically overnight. It reads from Banner during quiet hours, reshapes the data into dimensional form (the star schemas that later articles in this track cover), and writes the result into a separate database. The warehouse's data is recent, not real-time. Yesterday's transactions are there. Today's are not — and that is by design.

See it — the diagram

The ETL pipeline is the bridge between the two spaces. It is not a live mirror — it is a scheduled freight train. Every night it extracts the day's changes from Banner, transforms them into dimensional tables, and loads them into the warehouse. The next morning, analysts query yesterday's data with zero impact on the production system.

The ETL pipeline runs on a schedule — extract from Banner overnight, transform into dimensional form, load into the warehouse. The warehouse is recent, not real-time.

This nightly rhythm is the key architectural contract of a warehouse. The warehouse does not compete with Banner for resources because it never touches Banner during business hours. The trade — a one-day lag — buys complete isolation. The front counter never freezes because of a dean's dashboard.

Show me the code

Here is a real Banner analytic query — enrollment counts by term and program, across five years. This is the kind of question a dean asks every September:

-- Enrollment by term + program, last 5 years — against Banner OLTP.
-- Three tables, plus a correlated MAX subquery to land on the right
-- SGBSTDN row per (pidm, term). Holds shared locks on SFRSTCR for the
-- duration of the scan. The MAX subquery is the canonical Banner pain;
-- see [[B3_effective_max]] for why it has to be there.
SELECT t.stvterm_code,
       s.sgbstdn_program_1            AS program,
       COUNT(DISTINCT r.sfrstcr_pidm) AS headcount
FROM   sfrstcr r
JOIN   stvterm t  ON t.stvterm_code = r.sfrstcr_term_code
JOIN   sgbstdn s  ON s.sgbstdn_pidm = r.sfrstcr_pidm
               AND  s.sgbstdn_term_code_eff = (
                       SELECT MAX(s2.sgbstdn_term_code_eff)
                       FROM   sgbstdn s2
                       WHERE  s2.sgbstdn_pidm = r.sfrstcr_pidm
                         AND  s2.sgbstdn_term_code_eff <= r.sfrstcr_term_code)
WHERE  t.stvterm_start_date >= ADD_MONTHS(SYSDATE, -60)
  AND  r.sfrstcr_rsts_code IN ('RE', 'RW')
GROUP BY t.stvterm_code, s.sgbstdn_program_1
ORDER BY t.stvterm_code, s.sgbstdn_program_1;

Run it at 10am on a registration day and the front counter feels it. The shared locks on sfrstcr contend with the row-level locks the registration system needs. The correlated subquery runs once per row of the outer scan — the optimizer can sometimes flatten it, but on Banner it often does not. The hash table for the GROUP BY sits in memory alongside the transaction buffer cache. The database can do it — but it cannot do it and serve students at the same speed.

Now the warehouse version:

-- The same question — against the warehouse.
-- Two joins. No correlated subquery. No date arithmetic. No locks on
-- Banner. Runs in seconds against a read-only copy of yesterday's data.
SELECT d.term_label,
       p.program_name,
       COUNT(DISTINCT f.student_key) AS headcount
FROM   fct_registration f
JOIN   dim_term    d ON d.term_key    = f.term_key
JOIN   dim_program p ON p.program_key = f.program_key
WHERE  d.term_start >= DATE '2021-01-01'
GROUP BY d.term_label, p.program_name
ORDER BY d.term_label, p.program_name;

The dimensional names — dim_term, dim_program, fct_registration — tell you what each table is before you read a single column. The query is shorter, the intent is clearer, and the only database it touches is the one built specifically for this kind of question.

Where intuition fails

Three intuitions that steer people wrong about warehouses:

"The warehouse is real-time." It is not. The warehouse runs on a schedule

— overnight is typical — so its data is recent, not live. If you need up-to-the-second numbers (an open-registration dashboard showing seat counts as they fill), you do not need a warehouse. You need a caching layer or a read replica. The warehouse answers "how did we do this term." It does not answer "what is happening right now."

"We can just point Power BI at Banner." You can — for ten users. For a

hundred, you cannot. A single Power BI dashboard can issue a dozen queries on refresh. A hundred open dashboards becomes a denial-of-service attack on the registrar's office. Worse, Power BI has no idea which tables are safe to scan — it will happily run a full table scan on SFRSTCR at 10am on the first day of registration. The warehouse absorbs that traffic harmlessly because it is a separate copy, on a separate server, with no live transactions to block.

"The warehouse replaces Banner." It does not. The warehouse reads from

Banner. Banner remains the system of record — the single place where data is created and updated. The warehouse is a downstream copy, reshaped for a different kind of question. You still register students in Banner. You still post grades in Banner. The warehouse reads the result and makes it browsable. If the warehouse goes down, Banner keeps running. If Banner goes down, the warehouse is frozen at last night's state — useful for reports, useless for registering the next student.

The one-sentence takeaway

Banner is built to run the college. The warehouse is built to understand it.

Track F · From Banner to a warehouse

Facts, Dimensions, Measures — The Multidimensional View

Every report you have ever written follows the same hidden grammar: a number, sliced by context. You have been thinking in facts and dimensions your whole career. You just never called them that.

7 min readwarehousekimballfactsdimensionsmeasuresgrain

The hook

Every report you have ever written follows the same hidden grammar: a number, sliced by context. Headcount by department. Revenue by fund by fiscal year. GPA by program by term. You have been thinking in facts and dimensions your whole career — you just never called them that. The dimensional model does not invent a new way of asking questions. It names the pattern you already use, so the database can be shaped to match it.

The everyday analogy

Take a photograph. Press the shutter, and the camera records a single irreducible moment: light fell on the sensor at this instant, at this exposure, and here is what it captured. That photograph is a fact — a measurement taken at a point in time, under specific conditions, never to be repeated in exactly the same way.

The camera also writes a block of EXIF metadata into the file. The date and time the shutter fired. The GPS coordinates of where you stood. The camera body, the lens, the focal length, the aperture, the ISO. You did not ask for this metadata — the camera writes it automatically, and it stays glued to the image forever.

The photograph is the fact — the light that fell on the sensor. The EXIF tags around it are the dimensions — the when, where, how, and by whom.

Now ask a question about your photo library: "Show me every photo I took with the old Nikon during the summer of 2019." You just sliced your collection along two dimensions — camera and date — and the answer is the subset of facts that match. Ask "how many photos did I take in Italy last year?" and you sliced along location and year, then aggregated — COUNT, SUM, the simplest measures. You did not create new photographs. You rearranged the ones you have along the axes that matter for this question.

That is dimensional modeling. The photograph is the fact — the row of measurement. The EXIF tags are the dimensions — the context columns. The number of photographs is the measure. The question answers itself by filtering dimensions and aggregating facts. Every analytic question you will ever ask a warehouse fits this shape.

What it really is

A fact is a row in a fact table — a granular measurement captured at a specific moment in a specific context. In Banner terms: one position's budget snapshot for one month from one fund (the Position-Budget fact), one employee's payroll deduction for one pay period (the Payroll fact), one student's tuition charge for one term (the Student Finance fact). Facts are mostly numeric and mostly additive — you can sum budgeted dollars across organizations, months, and funds and the total is meaningful.

A dimension is a descriptive table that provides the context for facts — the who, what, when, where, and why. Each dimension has a primary key (the _key column) and a set of attributes. dim_organization has org_name, org_code, org_level, parent_org. dim_date has full_date, fiscal_year, fiscal_quarter, academic_term, is_holiday. Dimensions are wide and relatively short — a few hundred rows for date, a few thousand for organization. Facts are narrow and very long — millions of rows over time.

A measure is the value you aggregate: SUM(budgeted_amt), AVG(actual_amt), COUNT(DISTINCT position_key). Most measures are fully additive across all dimensions (dollars, hours, units). Some are semi-additive: headcount cannot be summed across dates because the same person appears in every month — you sum it across departments but take the last value across time. Averages are non-additive: you cannot average averages. An average is really a ratio of two additive measures, and the warehouse stores the numerator and denominator separately so you can recompute the ratio at any level of aggregation.

One fact row at the center — budgeted and actual dollars for a position in a given month. Every column that is not a measure is a foreign key to a dimension.

The fact row at the center of the diagram carries two kinds of columns: the foreign keys that point to dimensions (the context) and the numeric measures (the thing being measured). Every column in a fact table is either a dimension key or a measure. There is nothing else. That is the discipline.

See it — the diagram

Once your data is organized into facts and dimensions, every analytic question becomes a single operation: pick a measure, choose the dimensions to slice by, apply filters on those dimensions, and aggregate.

Slice and dice — the same facts, rearranged along three different dimensions. You do not create new facts; you pivot the ones you have.

You do not write a new query for each question. You pivot the same facts along different axes. "Budget by org" is the same fact table as "budget by fund by month" — the only difference is which dimension keys appear in the GROUP BY. The facts stay fixed. The dimensions rotate around them.

Show me the code

Here is the Position-Budget fact — the first star of Track G, and the example that will carry through every article in this track. A row means: position 10042, held by employee 38201, in the Academic Affairs org, against fund 11001, during January 2024, budgeted at $58,400 and actually costing $61,200.

-- One row = one position, one month, one fund.
-- Every column is either a dimension key or a measure.
CREATE TABLE fct_position_budget (
    position_key   INTEGER NOT NULL,  -- FK to dim_position
    employee_key   INTEGER NOT NULL,  -- FK to dim_employee
    org_key        INTEGER NOT NULL,  -- FK to dim_organization
    fund_key       INTEGER NOT NULL,  -- FK to dim_fund
    date_key       INTEGER NOT NULL,  -- FK to dim_date
    budgeted_amt   NUMERIC(12,2),     -- measure: additive
    actual_amt     NUMERIC(12,2)      -- measure: additive
);

Ask it a question — total budgeted dollars by organization for fiscal 2024:

-- Slice the Position-Budget fact by organization and fiscal year.
-- Two joins, a WHERE, a GROUP BY, and a SUM. That is the whole pattern.
SELECT o.org_name,
       SUM(f.budgeted_amt) AS total_budgeted
FROM   fct_position_budget f
JOIN   dim_organization o ON o.org_key = f.org_key
JOIN   dim_date        d ON d.date_key = f.date_key
WHERE  d.fiscal_year = 2024
GROUP BY o.org_name
ORDER BY total_budgeted DESC;

Compare this to the Banner query in Why a Warehouse? — OLTP, OLAP, and the Cost of Asking Banner the Wrong Question — three tables, a correlated MAX() subquery, ADD_MONTHS(SYSDATE, -60), and shared locks on live transaction tables. The warehouse version has two joins, no subquery, a literal date filter, and zero impact on Banner. The dimensional names tell you the shape of the answer before the query runs.

Where intuition fails

Four lessons that take most people a year of warehouse work to learn:

The grain is what you join on, not what you SELECT. Two facts with

different grains cannot be joined naively. A fact at the position-month grain and a fact at the employee-pay-period grain live in different fact tables because their rows represent different events. Joining them produces a row that means neither. When two facts have different grains, you query them separately and combine the results in the reporting layer — not in the SQL.

An average is a ratio, not a measure. Never store an average in a fact

table. Store the numerator and denominator as separate measures — SUM(hours) / COUNT(employees) works at any level of aggregation, but AVG(stored_avg) is mathematically wrong the moment you group by anything. The warehouse rule: additive measures only. Ratios are computed at query time.

"Headcount = 1" is a real measure. To count anything in a warehouse, you

add a column that is literally the integer 1 — one per fact row. SELECT SUM(headcount) FROM fct_position_budget WHERE ... gives you a distinct count of position-month rows matching the filter. It is the simplest measure in the entire dimensional toolkit, and the one beginners overlook because it feels too trivial to write down.

NULL foreign keys do not belong in a fact table. Every dimension key in

a fact row must point to exactly one row in the dimension. If the source system allows a NULL — a position with no assigned employee, a charge with no fund — the dimension table needs a dedicated "Unknown" row (usually with key = -1 or 0) and the fact row points there. NULL keys break joins silently; an "Unknown" row makes the absence visible. See The Star Schema — One Fact, Many Dimensions, and the Grain for how the star shape enforces this at the schema level.

The one-sentence takeaway

A fact is what you measure. A dimension is what you measure it by.

Track F · From Banner to a warehouse

The Star Schema — One Fact, Many Dimensions, and the Grain

A star schema is not a diagramming convention. It is a mechanical guarantee: every dimension is exactly one JOIN away from the fact. No exceptions, no shortcuts, no climbing branches.

7 min readwarehousekimballstar-schemagrainsurrogate-keys

The hook

A star schema is not a diagramming convention. It is not the shape you draw because it looks tidy on a whiteboard. It is a mechanical guarantee: every dimension is exactly one JOIN away from the fact. No exceptions, no shortcuts, no climbing branches. When an analyst writes SELECT ... FROM fact JOIN dim ON ..., the query planner does exactly one seek into exactly one dimension index per join. That one-hop guarantee is the reason warehouse queries are fast. It is also the reason the grain matters so much — because the grain decides which joins are even possible.

The everyday analogy

A bicycle wheel. The hub at the center is the dense thing that holds everything together — every spoke terminates there, every force passes through it. Spin the wheel and every point on the rim stays exactly one spoke-length from the center. That is a star schema: one fact table at the hub, dimension tables arrayed around the rim, and a single foreign-key column connecting each dimension directly to the fact.

When a cyclist rides over rough pavement, the hub absorbs forces from every direction at once and distributes them evenly to the rim through the spokes. The wheel works because no spoke ever passes through another spoke. Every spoke is a direct, unbroken line from hub to rim.

The hub is the fact — the dense center that holds everything. The spokes are the dimensions — one straight hop from hub to rim. The wheel turns because every point on the rim is one spoke away from the center.

When you query a star schema — "total budgeted dollars by organization for fiscal 2024" — you walk exactly two spokes: one from the fact to dim_organization, one from the fact to dim_date. You never walk dim_organization → dim_parent_org → dim_region in a chain, because that would be a second spoke attached to the end of the first one. That is a snowflake schema, and Kimball's advice is blunt: avoid it unless the dimension is both enormous and volatile. A star works because every dimension is flat — one table, one spoke, one hop.

What it really is

A star schema has exactly two kinds of tables:

The fact table at the center. Every column is either a foreign key to a dimension or a numeric measure. No descriptive text, no flags that say "this row is a correction," no varchar(200) notes from the source system. Keys and measures. That is the discipline. The fact table's primary key is usually a subset of its foreign keys — for fct_position_budget, the combination (position_key, fund_key, date_key) uniquely identifies every row.

The dimension tables on the spokes. Each dimension has a surrogate primary key — an integer _key column generated during ETL, not copied from Banner. The natural key from the source system (spriden_id, nbbposn_posn, stvterm_code) lives in the dimension as an attribute, but the surrogate key is what the fact references. The reason: natural keys change. A position might be re-coded; a fund number might be retired and reissued. The surrogate key stays stable — the fact row's position_key = 1042 keeps meaning "the position that was Director of IT during January 2024," even after the position code changes. And when a dimension attribute changes and you need to track the history, you create a new surrogate key for the new version — that is [[F4_slowly_changing]], the next article.

The grain is the one-sentence contract that governs the fact table. It was declared in [[G2_declare_the_grain]]: one row of fct_position_budget means exactly one position, in one fund, during one month. Every column you add to the fact must be consistent with that grain. If you want to add a measure that is per-employee-per-pay-period, it belongs in a different fact table — a different star. See [[G8_second_star]] for that discussion.

The star schema — five dimension tables on spokes around one central fact. Every analytic query walks from the hub through exactly one spoke; no multi-hop joins.

Every analytic query against this schema has the same shape: start at the hub, walk one spoke to filter a dimension, walk another spoke to group by another dimension, aggregate the measures. The query planner knows this shape and optimizes for it — dimension seeks are index lookups on tiny tables; the bulk of the work is a sequential scan of the fact.

See it — the diagram

Here is one row of fct_position_budget, pulled apart:

One row of fct_position_budget, dissected — five foreign keys give the context, two measures give the value. The grain: one (position, fund, month) snapshot.

Position 100123, held by employee 47281, reporting into INFO_OPS, paid from fund 11-A, snapshot for August 2025, budgeted and actually costing $8,420. The five foreign keys on the left are the dimensional context — the who, what, where, and when. The two numeric columns on the right are the measures — the how much. Every row in a star-schema fact table dissects the same way: context columns and measure columns, nothing else.

Show me the code

Here is the Position-Budget star as DDL — the dimension that carries the calendar, and the fact that carries the snapshot:

-- One dimension: the calendar. Surrogate key, natural date,
-- plus every analytic attribute the campus needs.
CREATE TABLE dim_date (
    date_key        INTEGER PRIMARY KEY,   -- surrogate: 20250801
    full_date       DATE NOT NULL,         -- natural: 2025-08-01
    fiscal_year     INTEGER NOT NULL,      -- 2025
    fiscal_quarter  VARCHAR(6) NOT NULL,   -- 'FY25Q1'
    academic_term   VARCHAR(20),           -- 'Fall 2025'
    calendar_month  VARCHAR(10) NOT NULL,  -- 'August'
    is_holiday      BOOLEAN DEFAULT FALSE
);

-- The fact: five foreign keys, two measures, one composite PK.
-- Every FK references exactly one dimension row.
CREATE TABLE fct_position_budget (
    position_key   INTEGER NOT NULL REFERENCES dim_position(position_key),
    employee_key   INTEGER NOT NULL REFERENCES dim_employee(employee_key),
    org_key        INTEGER NOT NULL REFERENCES dim_organization(org_key),
    fund_key       INTEGER NOT NULL REFERENCES dim_fund(fund_key),
    date_key       INTEGER NOT NULL REFERENCES dim_date(date_key),
    budgeted_amt   NUMERIC(12,2) NOT NULL,
    actual_amt     NUMERIC(12,2) NOT NULL,
    PRIMARY KEY (position_key, fund_key, date_key)
);

The REFERENCES constraints are not decorative — they guarantee that no fact row points to a missing dimension row. The ETL must load dimensions before facts, and the foreign keys enforce that order at the schema level.

A query against this star — total budgeted dollars by organization for fiscal 2024, the same question F2 asked:

-- Walk two spokes: dim_organization for the GROUP BY,
-- dim_date for the WHERE filter. Two joins, one SUM.
SELECT o.org_name,
       SUM(f.budgeted_amt) AS total_budgeted
FROM   fct_position_budget f
JOIN   dim_organization o ON o.org_key = f.org_key
JOIN   dim_date        d ON d.date_key = f.date_key
WHERE  d.fiscal_year = 2024
GROUP BY o.org_name
ORDER BY total_budgeted DESC;

Compare the join count here — two — to the Banner query in [[F1_why_a_warehouse]]. Three tables, a correlated MAX() subquery, and shared locks on live transaction rows. The star query has two joins, both against tiny dimension tables indexed on their primary keys. The only table that gets scanned is the fact — and the fact is columnar-friendly, read-only, and never touched by a live transaction.

Where intuition fails

Four lessons the star teaches — usually the hard way:

"Bigger fact = better star." No. A larger fact table is slower to load,

slower to scan, and harder to back up. The smallest fact that answers the question is the best fact. If your fact has columns that are only populated for 10% of rows, those columns belong in a separate fact table. The star's strength is its density — every column in every row matters.

"Snowflake everywhere — dimensions should be normalized too." A

snowflake schema chains dimensions together: dim_organization has a parent_org_key that points to another row in dim_organization, or worse, a separate dim_org_hierarchy bridge table. Kimball's rule: flatten. Dimensions are small — a few hundred to a few thousand rows. The storage saved by normalizing them is measured in kilobytes. The query complexity added by snowflaking is measured in confused analysts and missed deadlines. Snowflake only when a dimension is both huge (millions of rows) and volatile — and in higher ed, that almost never happens.

"Two facts, one star — I'll just UNION them." Two facts with different

grains belong in two different fact tables. A position-month-budget fact and an employee-pay-period-payroll fact live at different grains — their rows represent different events. Joining them at query time is fine (that is what conformed dimensions are for — see [[G8_second_star]]). Shoehorning both into one table produces rows that mean neither.

"Dimensions in the WHERE, measures in the SELECT." This is the terse

heuristic that tells you whether your fact-dimension split is correct. You should filter on dimension attributes (WHERE d.fiscal_year = 2024), not on measure values (WHERE budgeted_amt > 10000 — that is a different kind of question, valid but secondary). You should aggregate measures (SUM, COUNT, AVG), not dimension keys (SUM(position_key) is nonsense). If you find yourself filtering on a measure regularly, the measure probably needs a companion dimension. If you find yourself summing a key, something is structurally wrong.

The one-sentence takeaway

The grain is a one-sentence contract. The star is the shape that enforces it.

Track F · From Banner to a warehouse

Slowly Changing Dimensions — Keeping History When Attributes Change

A dimension says what something is. But things change. If you overwrite the old value, you rewrite history. If you keep every version, you need a way to tell them apart. The three choices are the difference between a warehouse you trust and one you quietly stop using.

10 min readwarehousekimballscdscd-type-2surrogate-keyshistory

The hook

A dimension says what something is. Position 100123 is titled "Director of Information Technology," sits in the Academic Affairs org, and reports to the VP of Academic Affairs. But things change. HR re-titles the position. The university reorganizes and the position now reports to the Provost. If you overwrite the old values, you rewrite history — and last year's budget report now shows an org chart that did not exist last year. If you keep every version, you need a way to tell them apart when a fact row joins to the dimension. The three choices — overwrite, version, or annotate — are the difference between a warehouse you trust and one you quietly stop using.

The everyday analogy

A house has one stable identity: a parcel number, a street address, a legal description. But ownership of that house changes. When it sells, the county recorder does not erase the prior owner's name from the records. The recorder adds a new deed — dated, signed, witnessed — to the chain. The old deed stays in the book. A title attorney researching the property walks the chain backwards: who owns it today, who owned it before that, who owned it before that, all the way to the original land grant.

A county recorder's title-chain book: the same parcel, three dated deeds stacked chronologically. Only the top deed is active; the older two bear a SUPERSEDED stamp. Identity is stable; ownership is versioned.

That backwards-walkable chain is exactly SCD Type 2. The parcel number is the natural key (position_code). Each deed is a version — one row in dim_position. Each deed has a date it took effect (effective_start_date) and a date the next deed superseded it (effective_end_date). Exactly one deed is current (current_flag = TRUE). To answer "who owned this house in 1997?" a title attorney finds the deed whose effective range covers 1997. To answer "what was position 100123's title in August 2024?" a warehouse query finds the dim_position row whose effective range covers August 2024. The operation is identical.

The analogy has two other branches, and they map to the other two SCD types:

SCD Type 1 = forge the deed. Cross out the prior owner's name and write the new one on top. One document instead of two — cheaper to file, faster to read. But the history is gone. The prior owner's name is illegible under the scribble. Type 1 is appropriate for corrections (the clerk typed "Smyth" instead of "Smith" — fix the spelling error, nobody needs the typo preserved) but catastrophic for changes (the property actually sold — that needs a new deed). Confusing a correction with a change is the most common Type-1 mistake, and it is irreversible.

SCD Type 3 = add a "previous owner" line to the current deed. The deed reads: "Currently owned by Smith; previously owned by Jones." You can answer "who owned it before Smith?" but you cannot answer "who owned it before Jones?" Only the last transition is preserved. Type 3 is rare in practice — useful when exactly one prior value matters (a department that was renamed once in a reorganization, a fund code that changed in a one-time merger) and a full chain of versions would be overkill.

What it really is

A slowly changing dimension is a dimension table where descriptive attributes — title, category, status, parent org — can change over time, and the warehouse must decide how to record each change. The three choices come from Ralph Kimball and are numbered by the order he wrote them down, not by any hierarchy of quality:

Type 1 — Overwrite. The dimension row is updated in place. UPDATE dim_position SET position_title = 'Director of Digital Transformation' WHERE position_code = '100123'. One row, one key, the new value. Every fact row that points to that position — past, present, future — now sees the new title. History is lost. Use Type 1 only for attributes that nobody reports on: typo fixes, internal codes, a misspelled name. Never for anything that appears in a report whose historical version might ever be re-run.

Type 2 — Add a new row. The old row is retired (effective_end_date set, current_flag set to FALSE). A new row is inserted with a new surrogate position_key, the new attribute values, a new effective_start_date, and current_flag = TRUE. Facts from before the change still carry the old surrogate key and report the old title. Facts from after the change carry the new surrogate key and report the new title. Type 2 is the default for any attribute where history matters — and at a college, history almost always matters: fiscal-year reporting, multi-year cohort analysis, audit trails, accreditation evidence.

Type 3 — Add a column. The dimension row stays one row. Extra columns (previous_title, previous_org) hold the immediately prior value. An UPDATE sets previous_title = current_title and then current_title = new_value. Only one step back is preserved. Type 3 is a niche tool, but when the business only ever asks "what did this used to be?" and never "what was it three changes ago?", it avoids the row multiplication of Type 2.

The same source change — position 100123 gets a new title — handled three different ways. Type 1 overwrites and destroys history. Type 2 adds a row and preserves it. Type 3 adds a column and remembers only the last change.

The surrogate key is the piece that makes Type 2 possible. Without a surrogate, the natural key (position_code = '100123') is identical across all versions of the position, and a fact row has no way to specify which version it belongs to. The surrogate position_key is a pure warehouse-generated integer — 1001 for the 2012 version, 1042 for the 2019 version, 1087 for the 2024 version — and it is what the fact table stores. The natural key stays on the dimension for lookups from the source system. The surrogate key is the join column. See The Star Schema — One Fact, Many Dimensions, and the Grain for why surrogate keys are a star-schema requirement, not a style preference.

See it — the diagram

One position, three titles over twelve years, three rows in the dimension. The facts do not multiply — each fact row stores the surrogate key that was current when that fact was recorded, and the join resolves to exactly one version.

One position over time as three rows in dim_position. Each fact row points to the version that was current on its date_key. The surrogate position_key is the load-bearing piece: without it, a fact cannot tell September's row from August's.

A fact row for August 2024 carries position_key = 1042 and joins to "Director of Information Technology." A fact row for October 2024 carries position_key = 1087 and joins to "Director of Digital Transformation." The query does not need a BETWEEN on dates or a MAX() subquery — the surrogate key encodes the "as-of" relationship directly. This is the payoff for the extra rows: joins stay simple, history stays intact, and the SQL that Banner users already know works unchanged.

Show me the code

Position 100123 is titled "Director of Information Technology" through August

In September, HR changes the title to "Director of Digital

Transformation." Here is what each SCD type does to the dimension.

Type 1 — one UPDATE, history destroyed:

UPDATE dim_position
SET    position_title = 'Director of Digital Transformation'
WHERE  position_code = '100123';

One row. One key. Every fact — January through December — now reports the new title. The August budget report, re-run in October, silently changes its label.

Type 2 — retire, then insert, history preserved:

-- Step 1: retire the current version.
UPDATE dim_position
SET    current_flag = FALSE,
       effective_end_date = DATE '2024-08-31'
WHERE  position_code = '100123' AND current_flag = TRUE;

-- Step 2: insert the new version with a fresh surrogate key.
INSERT INTO dim_position (position_key, position_code, position_title,
                          effective_start_date, effective_end_date,
                          current_flag)
VALUES (nextval('dim_position_key_seq'), '100123',
        'Director of Digital Transformation',
        DATE '2024-09-01', NULL, TRUE);

Two rows now. Two different surrogate keys. Facts before September point to the old key; facts from September onward point to the new key. Re-run the August report in October, and the title still says "Director of Information Technology." The history is locked.

Type 3 — one UPDATE with a previous_title column:

UPDATE dim_position
SET    previous_title = position_title,
       position_title = 'Director of Digital Transformation'
WHERE  position_code = '100123';

One row. The old title survives in previous_title. The title before that — if there was one — is gone.

The proof that Type 2 was worth it is the join. When every fact row carries the surrogate key that was current at its date_key, the dimension resolves the correct historical title with a plain equi-join:

-- Budgeted dollars by position title — as the title was AT THE TIME.
SELECT p.position_title,
       SUM(f.budgeted_amt) AS total_budgeted
FROM   fct_position_budget f
JOIN   dim_position p ON p.position_key = f.position_key
WHERE  f.date_key BETWEEN 20240101 AND 20241231
GROUP BY p.position_title;

No BETWEEN on effective dates in the join. No correlated subquery. No window function to find the "max effective date before the fact date." The surrogate key does that work at ETL time — once, correctly — and every query afterward is a simple join. Compare this to the Banner pattern in The MAX() Subquery — Getting the Row That's Current, where every query must compute the "current as of" version at read time. The warehouse shifts that cost from query time to load time, where it belongs.

Where intuition fails

Five lessons that cost most teams at least one painful rebuild:

Pick the type per-column, not per-table. A single dimension mixes all

three types. position_title gets Type 2 because fiscal-year reports depend on it. position_code gets Type 1 because correcting a Banner data-entry typo should not create a new surrogate key. previous_reports_to might get Type 3 if leadership only ever asks "who did this position report to before the reorg?" and a full chain is wasted complexity. Most teams pick Type 2 as the default and downgrade individual columns when there is a specific, documented reason.

Type 1 on reported attributes is silent revisionism. If you overwrite an

org name and someone re-runs last year's accreditation report, the numbers stay the same but the labels change — and nobody can tell. The report looks correct. It is not correct. Type 1 + historical reports = a warehouse that lies about the past, and the lies are invisible because the old values are gone. Reserve Type 1 for attributes that nobody ever filters or groups by: internal codes, typo corrections, comment fields.

Late-arriving source changes break naive Type 2. HR marks a title change

"effective September 1" but does not enter it into Banner until September 12. A naive ETL that reads effective_start_date from the source would insert the new Type 2 row with effective_start_date = 2024-09-01, but the old row was still flagged current (effective_end_date IS NULL, current_flag = TRUE) — and now two versions overlap on September 1–11. The partial unique index on current_flag (see Build the Position Dimension — SCD Type 2 and the Discipline of History) refuses the write, which is the good failure mode. The pragmatic fix is to detect nbbposn_activity_date < CURRENT_DATE, log it, and either back-date the old row's effective_end_date or accept "we learned about the change today" as the policy. See G4's watch-out section for what Waubonsee actually does.

Type 2 inflates the dimension, not the facts. A position with five title

changes across ten years contributes five rows to dim_position — not 5x the fact rows. The fact table stores the surrogate key that was current at the time; it does not store all versions. Beginners see the row count grow in dim_position and panic, reaching for Type 1 to "keep things small." This instinct is wrong. A dimension with a few thousand entity rows handles decades of SCD versions trivially. A fact table with 10 million rows stays at 10 million rows regardless of how many dimension versions exist.

Reusing a natural key for a different entity corrupts the chain. If

Banner recycles NBBPOSN_POSN = '100123' for a brand-new, unrelated position after the original is abolished, Type 2 will append the new entity to the same chain — and now position 100123 appears to have transformed from an IT Director into a Biology Lab Coordinator. The title chain makes no sense, and every historical report that joins to dim_position silently mixes two different positions. Catch this with HR before you build the dimension: agree on a recoding policy so natural keys are never reused for unrelated entities. Banner's own NBBPOSN table is not guaranteed to do this for you — the policy is organizational, not technical.

The one-sentence takeaway

Type 1 overwrites the past. Type 2 versions it. Type 3 keeps one ghost of it. Pick consciously, because the choice is permanent.

Track F · From Banner to a warehouse

ETL from Banner — Moving Data on a Schedule, with Windmill

9 min readwarehouseetlwindmillscheduleidempotencywatermarks

The hook

A warehouse that is not fed fresh data every night is not a warehouse. It is a museum — the exhibits are accurate, frozen in time, and increasingly irrelevant with each passing day. The difference between the two is a scheduled, repeatable, monitored ETL pipeline, and that pipeline is the only part of the system Banner users ever actually feel. When the budget report runs Monday morning and the numbers look stale, nobody opens dim_position to check the SCD logic. They ask: did the load run last night? ETL is the part of the warehouse that touches production Banner, the part that can fail silently, and the part that, when it works, nobody notices. That invisibility is the highest compliment an ETL can receive.

The everyday analogy

Every night at 02:00, a freight train departs its home terminal on a fixed route. Its schedule is published and immutable — the railroad does not decide each evening whether to run; it always runs. The train stops at a series of Banner stations: NBBPOSN, NBRPLBD, SPRIDEN, FTVORGN, GOVSDAV. At each station it loads the cargo that has accumulated since its last visit — new positions created, titles changed, budget lines frozen, employees hired. It does not load the whole station. It loads only what is new.

A freight train at a small industrial station at 02:00, loading cargo under amber sodium lamps for the warehouse terminal. The schedule noticeboard is visible on the platform. Scheduled, repeatable, industrial-scale data movement.

At each station the train consults a small logbook bolted to the platform: the watermark — a timestamp recording when the train last stopped here successfully. "Last pickup from NBBPOSN: 2024-09-14 02:00," the logbook says. The train only loads rows with an activity_date later than that timestamp. When the loading is complete and verified, the conductor updates the logbook with the new watermark and the train moves to the next station.

Between the stations and the warehouse terminal sits a staging yard — a set of holding tracks where the cargo is inspected, reformatted into warehouse-shape, and only released to the terminal once every car is clean. If a single car fails inspection — a row with a fund code that does not exist, a position that references a deleted org — the whole consist holds at staging. The train retries the load (the retry policy) and either recovers automatically or pages the conductor (the monitoring).

The train is idempotent: if last night's run failed halfway through and the dispatcher orders a rerun, the warehouse already knows which cargo it received. Running the same load twice does not duplicate the shipment — the warehouse terminal checks for duplicates and replaces or skips them. Running three times is the same as running once correctly. Without idempotency, every retry is a cleanup emergency and every operator is afraid of the "run" button.

The pieces map cleanly: the freight train is the ETL flow, a Windmill script. The schedule is the crontab expression that fires the flow. The stations are the source Banner tables. The cargo is the rows being moved — facts and dimension changes. The staging yard is the staging schema where transforms run before the warehouse is touched. The watermark is the etl_watermark table — one row per source, recording where the last successful load stopped. The conductor is the on-call alert.

What it really is

ETL is three operations chained together and wrapped in a schedule:

Extract. A read-only SELECT against Banner's Oracle database. The query is parameterized by a watermark — the highest activity_date or audit column value seen in the last successful run. Banner tables carry _activity_date columns (NBBPOSN_ACTIVITY_DATE, SPRIDEN_ACTIVITY_DATE) that Banner's own forms update whenever a row changes; these are the natural anchor points for the extract watermark. The extract never writes to Banner. Banner does not know it is being read.

Transform. The raw Banner rows are reshaped into dimensional form inside the staging schema. For each source row, the transform resolves surrogate keys — it looks up the existing dim_position to find the position_key for the current version, or if the attributes changed, executes the SCD Type 2 retire-and-insert pattern from Slowly Changing Dimensions — Keeping History When Attributes Change. NULL foreign keys in the source are mapped to the dimension's sentinel "Unknown" row (key = -1). The transform is pure SQL, run inside the warehouse — no external tools, no file transfers, no Python required.

Load. The transformed rows are inserted into the fact table or upserted into dimensions. For facts, the load uses an UPSERT (INSERT ... ON CONFLICT) keyed on the composite natural grain — (position_key, fund_key, date_key) for Position-Budget. For dimensions, Type 2 changes create new rows; Type 1 corrections update in place. The entire load for a batch runs inside a single database transaction. If any row fails, the transaction rolls back and the warehouse is unchanged.

The ETL pipeline in three stages: Extract from Banner with a watermark, Transform in staging (resolve surrogate keys, handle SCD Type 2), Load to warehouse with an UPSERT. On success, the watermark advances; on failure, it does not.

Schedule. The whole pipeline fires on a crontab. Most ETL runs nightly between 02:00 and 04:00 — Banner's quietest window, and early enough that the warehouse is ready when the first Argos report fires at 06:30. Some loads run more often: during open enrollment, registration fact loads might fire every hour. The schedule lives in Windmill, not in Banner — Banner has no awareness that it is being extracted. Windmill also holds the Banner connection credentials, stored as secrets (see the Variables vs Secrets article in the WindmillExplainer wiki for the pattern), so no password appears in the ETL code.

Watermark. After a successful load, one row in etl_watermark is updated:

UPDATE etl_watermark
SET    last_loaded_at = :max_activity_date_in_batch
WHERE  source_table = 'NBBPOSN';

If the load fails — the transaction does not commit, the watermark does not advance, and the next run reprocesses the same window. No data is lost. The watermark is the ETL's memory.

Idempotency. The UPSERT pattern on facts, combined with the transaction wrapper, guarantees that running the same load twice produces the same warehouse state. A retry is safe. An emergency re-run is safe. The operator does not need to manually delete rows before retrying. The ETL earns its own trust.

See it — the diagram

A week of nightly runs tells the whole story in one picture.

A seven-day timeline of nightly runs. Six successes, one failure and retry. The watermark only advances on success — the failed night's window is reprocessed by the retry, and zero rows are lost.

Monday through Thursday, the train runs at 02:00, extracts new rows from each station, transforms them in staging, loads them into the warehouse, advances the watermark, and completes. Friday night the GOVSDAV extract fails — the Banner session pool was exhausted — and the transaction rolls back. The watermark does not advance. Saturday night's run picks up Friday's unprocessed window plus Saturday's new rows, processes them together, and succeeds. Zero rows are lost, zero rows are duplicated. The operator was paged Friday night but the retry resolved itself. The warehouse opened Monday morning complete.

Show me the code

Here is the Position-Budget ETL — the same through-line as F2, F3, and F4. A row means one position, one fund, one month of budget and actual dollars.

Step 1 — Extract from Banner. Pull NBBPOSN rows changed since the watermark. Banner's NBBPOSN_ACTIVITY_DATE is the anchor:

-- Extract: pull position changes since last successful run.
SELECT nbbposn_posn, nbbposn_title, nbbposn_pcls_code,
       nbbposn_status, nbbposn_activity_date
FROM   banner.nbbposn
WHERE  nbbposn_activity_date > :watermark_position
ORDER BY nbbposn_activity_date;

Step 2 — Transform. For each extracted row, resolve the surrogate position_key. If the title, status, or class code changed from the current dim_position version, execute the Type 2 retire-and-insert from Slowly Changing Dimensions — Keeping History When Attributes Change. If nothing changed, reuse the existing key. This is the step that makes the warehouse query simple.

Step 3 — Load the fact with an UPSERT. The natural grain is (position_key, fund_key, date_key). An UPSERT guarantees that reprocessing the same window does not duplicate rows:

-- Load: UPSERT into the fact — idempotent by design.
INSERT INTO fct_position_budget (position_key, fund_key, date_key,
                                  budgeted_amt, actual_amt)
VALUES (:pk, :fk, :dk, :budget, :actual)
ON CONFLICT (position_key, fund_key, date_key) DO UPDATE
SET budgeted_amt = EXCLUDED.budgeted_amt,
    actual_amt   = EXCLUDED.actual_amt;

Step 4 — Advance the watermark. Only runs if the transaction commits:

UPDATE etl_watermark
SET    last_loaded_at = :max_activity_date_in_batch
WHERE  source_table = 'NBBPOSN';

Four SQL statements, wrapped in a transaction, fired by a Windmill schedule at 02:00. That is the whole ETL. The complexity is not in the code — it is in the discipline: idempotent load, transactional boundary, watermark guard, retry policy. Get those four right and the pipeline runs unattended for months.

Where intuition fails

Five gotchas that separate ETL you trust from ETL you babysit:

**Banner's activity_date is not a wall clock.** Banner application code

sets _ACTIVITY_DATE to SYSDATE in most cases, but batch corrections, back-dated entries, and HR late filings can set it to any date — including dates weeks in the past. If your extract uses activity_date > watermark and the watermark has already advanced past that back-dated value, the row is silently dropped. Mitigate by combining activity_date with a separate _audit_timestamp if available, or by widening the extract window to include a safety margin on every run.

Schema drift breaks the extract. An Ellucian patch adds a column to

NBBPOSN, or renames one, and your SELECT * extract breaks because the downstream transform references a column that no longer exists — or worse, silently shifts ordinal positions. Always enumerate every column in the extract SELECT. Never use SELECT * against a source you do not control. Treat the Banner schema as a contract that changes on Ellucian's schedule.

The initial load is a separate operation. On day one there is no

watermark, so the "incremental" extracts everything — potentially millions of rows, hours of runtime, and Banner session contention during business hours. Do the initial load as a separate, one-time Windmill flow, scheduled for a weekend window, with a full-table extract. Enable the incremental watermark flow only after the initial load confirms clean.

Retries without idempotency double the data. A Windmill flow configured

to retry on failure (see the Retry & Failure article in the WindmillExplainer wiki) will blindly re-execute the load. If the load is a plain INSERT with no conflict detection, every retry inserts another copy of the batch. Either make every load an UPSERT or wrap the batch in a transaction with a pre-check that deletes the window's rows before reinserting. The operator should be able to press "retry" without fear.

Time zones corrupt watermarks. Banner's Oracle instance runs in

America/Chicago. Your warehouse PostgreSQL may run in UTC. If the ETL watermark stores last_loaded_at in one time zone and the extract WHERE compares it against activity_date in another, the one-hour offset between them silently drops or duplicates exactly one hour of data per run. Pin every timestamp in the ETL layer — watermarks, extract comparisons, fact date_key conversions — to UTC. Convert at the edge.

The one-sentence takeaway

ETL is the freight train that moves data from Banner to the warehouse every night. The schedule is the contract. The watermark is the odometer. Idempotency is the undo button. Get those three right, and the pipeline runs unattended for months.

Track F · From Banner to a warehouse

The Semantic Layer — Where Argos, Power BI, and Dashboards Sit

8 min readwarehousesemantic-layerpower-biargosdatablockmeasuresbusiness-vocabulary

The hook

You have built the warehouse. The facts are clean, the dimensions are conformed, the ETL runs every night at 02:00. You hand the keys to the report writers and they stare at fct_position_budget and ask: "What is a position_key? How do I get 'Department' onto this report? Which of these five tables do I join to get 'Fiscal Year'?" The warehouse is not the product. The warehouse is the kitchen. The product is the menu — the single curated view where database columns become business labels, measures are defined once, and every report and dashboard consumes the same vocabulary. That menu is the semantic layer, and if you skip it, every consumer rebuilds it from scratch in their own head — with different names, different definitions, and different answers to the same question.

The everyday analogy

Walk into a good restaurant. You are handed a menu: a clean page with dish names, brief descriptions, and prices. "Grilled Salmon — wild Pacific, lemon caper butter — $28." You order. Twenty minutes later your dish arrives, plated, garnished, served.

Behind the swinging door is the kitchen: stainless prep tables, hot fires, sous-chefs, walk-in refrigerators, sauce stocks reducing for hours. The kitchen is where the real work happens — the salmon is portioned, the butter emulsified, the plate composed. None of that complexity reaches you. You read the menu. You ordered "salmon." You got salmon.

A clean restaurant menu on a wooden host stand in the foreground; through a swinging door behind it, a glimpse of a busy stainless-steel kitchen. The menu is the semantic layer — calm, curated, business-readable. The kitchen is the warehouse — where the real work happens.

The menu does three things for you as a diner that the semantic layer does for the BI consumer:

Hides complexity. You do not need to know the supplier, the cut, the

cooking temperature, the seasoning ratio. The menu gives you the smallest amount of information you need to make a decision. The semantic layer gives the Argos report writer "Department" and "Fiscal Year" and "Budgeted $" — not dim_organization.org_name, dim_date.fiscal_year, and SUM(fct_ position_budget.budgeted_amt) across five JOINs.

Enforces vocabulary. "Wild Pacific" means something specific. "Grilled"

means something specific. The menu uses words the diner understands, not the kitchen's shorthand. The semantic layer uses "Variance $" — a single definition of SUM(actual_amt) - SUM(budgeted_amt) — across every report and every dashboard. Three reports later, "variance" has not drifted into three different meanings.

Stays stable across kitchen changes. The kitchen can swap suppliers,

retrain a cook, reconfigure the prep line — the menu does not change. The diner orders "grilled salmon" today and gets exactly the same thing as six months ago. The semantic layer insulates reports from warehouse refactors. Rename dim_organization to dim_org, add a new SCD Type 2 column, or repartition fct_position_budget — the menu stays the same, and the BI consumer never knows the kitchen changed.

What it really is

A semantic model sits between the physical warehouse tables and the BI tool. It exposes business-named objects — entities, attributes, measures — that map down to warehouse columns and computations. The JOIN graph is declared once. The measures are defined once. Every consumer above reads from the model, not from the raw warehouse.

Entities are the business-facing equivalents of warehouse tables. The fct_position_budget star — the fact plus its conformed dimensions — becomes a single entity called "Position Budget." The Argos report writer drags it onto a canvas. The Power BI analyst adds it to a report. Neither one writes a JOIN.

Attributes are dimension columns exposed with friendly names. dim_organization.org_name becomes "Department," sorted to respect the org-chart hierarchy, grouped in a folder called "Organization." dim_date.fiscal_year becomes "Fiscal Year," with a defined hierarchy: Fiscal Year → Fiscal Quarter → Fiscal Month → Full Date. The hierarchy lives in the semantic model, not in the warehouse — the warehouse stores the date values; the model encodes the roll-up relationships.

Measures are aggregations defined once and reused everywhere. "Budgeted $" maps to SUM(fct_position_budget.budgeted_amt). "Variance $" maps to SUM (actual_amt) - SUM(budgeted_amt) — computed correctly at any level of aggregation because the subtraction happens after the aggregation. "Average Budget per Position" is SUM(budgeted_amt) / COUNT(DISTINCT position_key) — defined once, correct whether sliced by department, by fund, or by fiscal year. Recall from Facts, Dimensions, Measures — The Multidimensional View that an average is a ratio, not a measure; the semantic layer encodes that ratio so no report writer has to remember it.

Security lives in the model. Row-level filters defined once apply to every consumer: the HR director sees all departments; a department chair sees only their own org_key's rows. The filter is in the semantic model, not duplicated across thirty individual reports. Change the filter once, and every consumer's view updates.

Three-layer cake: the warehouse at the bottom (facts and dimensions, surrogate keys, SCD Type 2), the semantic model in the middle (business names, measures, hierarchies, security), and the BI consumers on top (Argos, Power BI, dashboards, scorecards).

The BI tool — Argos, Power BI, Tableau, Looker, dbt-metrics — reads from the semantic model. The user drags entities, attributes, and measures onto a canvas. The semantic layer translates those drags into warehouse SQL. The user never sees the JOINs. They are in the kitchen.

At Waubonsee, the semantic layer lives in two consumers in parallel. Power BI semantic models drive the strategic dashboards consumed at annual program review — disaggregated by race, gender, and age, aligned with the institution's RISE 2030 Strategic Plan and the Achieving the Dream partnership. The Argos DataBlock catalog — 670 reports and counting — remains the operational reporting layer the campus has run for years. Both consumers must share the same warehouse-side measure definitions. A "headcount" computed three ways across Argos, Power BI, and a Tableau workbook is exactly the failure mode the semantic layer exists to prevent. See What Waubonsee Actually Reports Today — and Where the Warehouse Should Land First for the catalog and the cultural layer behind it.

See it — the diagram

One measure, defined once, flowing to many consumers.

One measure definition — 'Variance $' — flowing out to multiple BI consumers: an Argos DataBlock, a Power BI dashboard, an executive scorecard. All three render the same number because they share the same semantic-layer definition.

"Variance $" is defined in the semantic model as SUM(actual_amt) - SUM (budgeted_amt). An Argos DataBlock references it. A Power BI dashboard references it. An executive scorecard — a PDF emailed every Monday at 07:00 — references it. All three render the same number because they share the same definition. If the CFO asks why the Argos report and the Power BI dashboard disagree, the answer is not "different definitions." The answer is "same definition, different filters — let's check which dimension values are selected." The semantic layer eliminates the definition-drift class of bugs entirely.

Show me the code

Here is the warehouse query an analyst would write directly — the kitchen, raw:

-- Warehouse SQL: explicit star joins, surrogate keys, terse names.
-- The analyst must know the grain, the join graph, and the measures.
SELECT o.org_name             AS department,
       SUM(f.budgeted_amt)    AS total_budgeted
FROM   fct_position_budget f
JOIN   dim_organization    o ON o.org_key  = f.org_key
JOIN   dim_date            d ON d.date_key = f.date_key
WHERE  d.fiscal_year = 2026
GROUP BY o.org_name
ORDER BY total_budgeted DESC;

Here is the equivalent in a semantic model definition — the menu the BI consumer reads from. In Power BI DAX it looks like:

-- Semantic model measures (Power BI DAX):
Total Budgeted = SUM ( 'Position Budget'[Budgeted $] )
Total Actual   = SUM ( 'Position Budget'[Actual $] )
Variance       = [Total Actual] - [Total Budgeted]
Variance %     = DIVIDE ( [Variance], [Total Budgeted] )

The Argos equivalent — a DataBlock that queries a pre-built semantic view instead of raw warehouse tables:

-- Argos DataBlock consuming the semantic layer, not the raw warehouse.
-- The analyst writes one line; the view resolves the joins.
SELECT department, total_budgeted
FROM   semantic.position_budget_by_dept
WHERE  fiscal_year = :main_DD_fiscal_year;

In Power BI, the analyst does not write SQL at all. They drag "Department" onto Rows, "Fiscal Year" onto Filter (set = 2026), and "Total Budgeted" onto Values. No JOINs. No GROUP BY. No SUM(). The semantic model resolves it all.

Where intuition fails

Five lessons that separate a useful semantic layer from a confusing one:

The semantic layer is not optional — even a thin one. Skipping it means

every report writer composes their own JOIN graph, defines their own "variance" measure, and picks their own column labels. Three reports later, "variance" means three different things and reconciliation between them is impossible. Build the semantic layer. Even if it starts as a single SQL view — CREATE VIEW position_budget AS SELECT ... with friendly column aliases — it centralizes meaning and prevents drift.

One measure, one definition. "Headcount" is a single measure with a

single definition in the semantic model. Not three COUNT(DISTINCT) calls across three Argos DataBlocks. Not COUNT(pidm) in one report and COUNT (position_key) in another. The semantic layer is the single source of truth for measure definitions. Every consumer reads the same number for the same question because the measure is defined in exactly one place.

**Dimensions belong to the warehouse; hierarchies belong to the semantic

layer.** The warehouse stores dim_date with fiscal_year, fiscal_ quarter, fiscal_month, full_date. The hierarchical relationship — that fiscal_month rolls up into fiscal_quarter which rolls up into fiscal_year — is declared in the semantic model. The warehouse has the values; the model has the relationships. If you encode the hierarchy in the warehouse (a parent_key column on dim_organization), that is fine — but the semantic layer still declares the drill path explicitly so the BI tool can navigate it.

Argos DataBlocks ARE a thin semantic layer — treat them that way. A

well-organized library of DataBlocks with pre-baked SQL, declared parameters, and typed result columns is a semantic layer in everything but name. Once the warehouse exists, the discipline is: shape every DataBlock to query the warehouse (not Banner directly), wire parameters to warehouse dimension keys, name output columns consistently across DataBlocks, and treat the DataBlock catalog as the semantic catalog. See Argos Parameters — `:main_`, `:lcl_`, `:dbn_` for the DataBlock scoping rules that make this work.

The semantic layer amplifies warehouse quality — it cannot create it. If

dim_date has incorrect fiscal-year labels, no amount of semantic-model polish hides it. If fct_position_budget mixes grains, no measure definition reconciles them. The menu cannot fix a kitchen that is sending out the wrong dishes. Get the warehouse right first (Tracks F1–F5, G1–G7), then invest in polishing the semantic layer. A correct warehouse with a thin semantic layer beats a broken warehouse with a beautiful one.

The one-sentence takeaway

The semantic layer translates warehouse column names into business language, defines each measure once, and enforces security in one place. The BI consumer reads from the menu. The warehouse is the kitchen.

Track F · From Banner to a warehouse

The Three Fact-Table Patterns — Transaction, Periodic, Accumulating

10 min readwarehousekimballfact-patternstransactionperiodic-snapshotaccumulating-snapshot

The hook

A fact table holds measurements. That much you know from Facts, Dimensions, Measures — The Multidimensional View. But not all measurements behave the same way. A payroll transaction happens once and is never touched again. A monthly budget snapshot is taken on schedule whether anything changed or not. An admissions applicant progresses through milestones — inquiry, interview, decision, enrollment — and the same row is revisited and updated at each step. These are not three styles of the same thing. They are three fundamentally different fact-table patterns, each optimized for a different kind of business process. The first design decision when you model a new star is not which columns to include. It is which of these three patterns the fact table follows. Pick wrong, and the star cannot answer the questions the business hired you to answer.

The everyday analogy

Walk into a school office and you will see three filing systems running in parallel, each answering a different kind of question.

The hall-pass log. A stack of carbon-copy slips near the attendance desk, each one filled out the moment a student leaves a classroom: name, time, destination. A new slip for every event. Nobody goes back and edits old slips. To ask "how many hall passes were issued this week?" you count the slips. To ask "who left first period most often this month?" you filter by time and name and count. Each slip is a row. New slips arrive constantly. Old slips are never revisited.

The weekly enrollment census. Every Friday afternoon the registrar walks through the schedule and fills out the same form: 23 students in Bio 101, 18 in Calc 201, 41 in Comp Sci 100. Same courses, same form, every Friday. Most weeks most numbers do not change — but the snapshot is taken regardless, because the point is not to record changes. The point is to have a predictable, evenly spaced series of state records. To ask "what was Calc 201's enrollment in week 7?" you go to the week-7 form. There is exactly one form per week.

The student transcript folder. One manila folder per student, started when they enroll, pulled from the rack and updated every time they finish a course, declare a major, complete gen-eds, apply to graduate, or receive a degree. The same folder is revisited and appended to across the student's entire college career. To ask "where is each currently enrolled student in their degree journey?" you open each folder and read the latest entries.

A school office desk showing three filing systems side by side: a tall stack of hall-pass slips (transaction), a clipboard of weekly enrollment census forms (periodic snapshot), and a row of manila transcript folders in a wooden rack (accumulating snapshot). Three record systems, three kinds of questions, one office.

Three kinds of records. Three kinds of questions. Each system is optimized for what it captures and useless for what the other two do well. The hall-pass log cannot tell you a student's GPA. The transcript folder cannot tell you how many hall passes were issued Tuesday. The census clipboard cannot tell you about a single hall-pass incident. The choice of record IS the choice of question. And that is the choice you make when you pick a fact-table pattern.

What it really is

Ralph Kimball identified three canonical fact-table patterns. The single question that tells them apart: is the row revisited after insert?

Transaction fact — one row per event, never revisited. The grain is one row per individual business event. Each row has a date_key (when the event happened), the participating dimension keys, and one or a few measures. Rows are inserted and never updated. The table grows with every event, potentially to billions of rows. Perfect for "what happened when and how much" questions. Banner examples: PHRHIST (every payroll line is a row), TBRACCD (every AR charge or payment), SFRSTCR (every course registration — though this one is borderline factless, see Factless Fact Tables — Events and Coverage). The strength of a transaction fact is granularity — you can drill to individual events. The weakness is that "current state" questions require aggregating millions of rows every time.

Periodic snapshot fact — one row per (entity × time bucket), taken on schedule. The grain is one row per entity per regular period — daily, weekly, monthly, by term. The row is inserted on a calendar schedule, not triggered by an event. The same entity gets a new row every period whether or not anything changed. That predictability is the pattern's superpower: as-of queries are a simple WHERE date_key = target, with no aggregation over time. The Position-Budget star this wiki builds in Build the Position-Budget Fact — The Center of the First Star is a periodic snapshot — one row per (position × fund × month), ~36,000 rows per year for ~600 positions. Banner examples: a monthly headcount snapshot, a term-end enrollment census, a fiscal-year-end fund balance. The strength is predictable size and trivial as-of queries. The weakness is redundancy — most rows are identical to the previous period's — but the predictability is worth the storage.

Accumulating snapshot fact — one row per entity, revisited at milestones. The grain is one row per entity across its entire (short) lifecycle. Multiple _date_key columns — most NULL at insert — fill in as the entity progresses through known milestones. The same row is UPDATED, not a new row inserted. Lag measures (days from milestone A to milestone B) are natural and easy. Perfect for processes with a defined start, a defined end, and 3–15 known intermediate steps. The canonical higher-ed example is admissions tracking: applicant → inquiry → application submitted → interview → decision → accepted → enrolled, each milestone filling in a date_key on the same row. Banner source: SARADAP and related admissions tables. The strength is that "where is each entity in the pipeline?" is a single SELECT with no aggregation. The weakness is that the pattern only works for short, bounded lifecycles — a student's entire degree (4–6 years, dozens of possible milestones, no single clean end) does not fit.

Three fact-table cards side by side: TRANSACTION (one row per event, append-only, never revisited), PERIODIC SNAPSHOT (one row per entity per period, inserted on schedule), ACCUMULATING SNAPSHOT (one row per entity, revisited and updated at each milestone).

The decision tree is straightforward: does the business process have a defined end? If no, the fact is a transaction or a periodic snapshot. If yes, and the process is short with known milestones, it is an accumulating snapshot. If yes but the process is long and open-ended, it is a transaction fact (events along the way) optionally paired with a periodic snapshot (state at regular intervals).

See it — the diagram

The accumulating snapshot row tells the pattern's story in one sequence.

The same applicant's row in fct_admissions_pipeline shown over time: most date_keys NULL at insert, filling in one by one as milestones occur. One row, revisited four times across six months.

An applicant row is inserted at inquiry. Six date_key columns are NULL — the applicant has not yet applied, interviewed, received a decision, been accepted, or enrolled. Over the next six months, the ETL revisits the same row four times. At application: application_date_key fills in, inquiry_to_app_lag calculates. At interview: interview_date_key fills in. At decision: decision_date_key fills in. At enrollment: enrolled_date_key fills in, enrolled_count flips from 0 to 1. One row. Six months. Seven milestones. The row is a living record — the opposite of a transaction fact's immutable event.

Show me the code

One DDL per pattern, using real Banner-mappable examples.

**Transaction — payroll lines from PHRHIST:**

-- One row per payroll event: append-only, never revisited.
-- Source: PHRHIST. A row is a single earnings line on a paycheck.
CREATE TABLE fct_payroll_transaction (
    payroll_txn_id    BIGINT       PRIMARY KEY,
    employee_key      INTEGER      NOT NULL,
    earnings_code_key INTEGER      NOT NULL,
    date_key          INTEGER      NOT NULL,   -- when paid
    hours             NUMERIC(8,2),
    gross_amount      NUMERIC(12,2)
);
-- "Sum of gross by earnings code for FY2024" = SUM(gross_amount)
-- across millions of rows. Fast for aggregates, slow for "current."

Periodic snapshot — the Position-Budget star, this wiki's through-line:

-- One row per (position × fund × month), taken every month-end
-- whether or not the position's budget changed.
CREATE TABLE fct_position_budget (
    position_key   INTEGER NOT NULL,
    fund_key       INTEGER NOT NULL,
    date_key       INTEGER NOT NULL,       -- month-end snapshot date
    budgeted_amt   NUMERIC(12,2),
    actual_amt     NUMERIC(12,2),
    PRIMARY KEY (position_key, fund_key, date_key)
);
-- See [[G2_declare_the_grain]] for the grain decision and
-- [[G5_position_budget_fact]] for the full build.

Accumulating snapshot — admissions pipeline (the G8 target):

-- One row per applicant across their whole admissions lifecycle.
-- Date keys are NULL at insert and fill in as milestones occur.
CREATE TABLE fct_admissions_pipeline (
    applicant_key            INTEGER  NOT NULL,
    inquiry_date_key         INTEGER,           -- filled at inquiry
    application_date_key     INTEGER,           -- filled at submit
    interview_date_key       INTEGER,           -- filled when interviewed
    decision_date_key        INTEGER,           -- filled at decision
    accepted_date_key        INTEGER,           -- filled if accepted
    enrolled_date_key        INTEGER,           -- filled if enrolled
    admissions_decision_key  INTEGER,
    inquiry_to_app_lag       INTEGER,           -- days
    app_to_decision_lag      INTEGER,           -- days
    decision_to_enroll_lag   INTEGER,           -- days
    application_count        INTEGER  DEFAULT 1,
    accepted_count           INTEGER  DEFAULT 0,
    enrolled_count           INTEGER  DEFAULT 0,
    PRIMARY KEY (applicant_key)
);
-- One row per applicant, revisited and UPDATED as milestones happen.
-- Source: SARADAP and related Banner admissions tables.
-- See [[G8_second_star]] for the full build.

Now ask the same business question — "how many applicants enrolled in Fall 2026?" — answered three ways depending on which pattern stores the data:

-- Transaction: count enrollment events in the window.
SELECT COUNT(*)
FROM   fct_enrollment_transaction
WHERE  date_key BETWEEN 20260801 AND 20260831;

-- Periodic snapshot: read the as-of value on the snapshot date.
SELECT SUM(enrolled_headcount)
FROM   fct_enrollment_snapshot
WHERE  date_key = 20260831;

-- Accumulating snapshot: count rows whose enrollment milestone
-- fell within the window.
SELECT COUNT(*)
FROM   fct_admissions_pipeline
WHERE  enrolled_date_key BETWEEN 20260801 AND 20260831;

Same number, three different SQL shapes, three different underlying patterns. The business question is the same. The pattern you chose at design time determines the query you write at run time.

Where intuition fails

Five lessons that take most people a year of warehouse work to learn:

Mixing two patterns in one table is the cardinal design sin. If half your

rows are append-only events and the other half are revisited snapshots, the SQL becomes unreadable and the joins produce wrong answers. Two patterns = two fact tables. The grain decision in Declare the Grain — One Row Equals One What? forces you to pick one; respect it.

Periodic snapshots look wasteful — they are not. The Position-Budget

star stores ~600 positions × 12 months × 5 funds = 36,000 rows per year. Most of those rows differ from the previous month only in date_key. New designers instinctively reach for "only insert when something changes" — but that breaks the snapshot's entire purpose. As-of queries become conditional (WHERE date_key <= target with a correlated subquery, exactly the Banner pattern from The MAX() Subquery — Getting the Row That's Current that the warehouse exists to escape). Trend lines develop gaps when nothing changed for three months. Store the snapshot even when it is identical to last month's. The storage cost is trivial; the query complexity you avoid is not.

Accumulating snapshots are for short, bounded lifecycles. Admissions

(inquiry to enrolled, ~6 months, ~10 milestones) fits perfectly. A student's entire degree (4–6 years, dozens of possible milestones, no single clean end) does not. For long-running processes, model a transaction fact for the events (course completions, term registrations) and optionally a periodic snapshot for the regular state checks (term-end enrollment status). Do not force an accumulating snapshot onto a process that has no finish line.

Transaction facts cannot answer "current state" cheaply. Summing

PHRHIST across a fiscal year is correct but scans millions of rows every time. The standard warehouse pattern is to derive a periodic snapshot (monthly payroll-by-org totals) from the transaction fact for the common dashboard queries — and keep the transaction fact for the deep "drill into one employee's specific paycheck" investigation. Both patterns coexist as sibling fact tables fed from the same source.

NULL date_keys in accumulating snapshots need a sentinel, not a NULL.

At insert time, most milestone date_keys are unknown — the applicant has not yet interviewed, been decided on, or enrolled. NULL foreign keys break joins silently (see Facts, Dimensions, Measures — The Multidimensional View, gotcha #4). The standard pattern: dim_date has a row with date_key = -1 (or 19000101) labeled "Unknown / Not Yet Occurred," and every nullable date_key defaults to that sentinel. As milestones occur, the ETL updates the date_key to the real value. Every join works from day one.

The one-sentence takeaway

Transaction facts record events. Periodic snapshots record state at regular intervals. Accumulating snapshots track an entity through milestones. The pattern IS the grain decision — pick it first.

Track F · From Banner to a warehouse

Factless Fact Tables — Events and Coverage

9 min readwarehousekimballfactlesscoveragesfrstcrshrattrregistrationattendance

The hook

Some of the most valuable questions a warehouse can answer have no numbers in them. Which students registered for this course? Which classrooms sat empty this term? Which admitted applicants never enrolled? A fact table with no measures — no dollars, no hours, no quantities — sounds like a contradiction. Every article so far in this track has treated measures as the point of the fact table. But the collision of dimension keys at a moment in time is information even when there is nothing to sum. Kimball calls these factless fact tables, and they are the cleanest answer to the "what happened" and "what did not happen" questions that dollars-and-hours fact tables cannot touch.

The everyday analogy

Walk into any classroom on the first day of the term and the instructor has two pieces of paper on the desk. Both are keyed on student identity. Both cover the same days. But they answer fundamentally different questions — and the difference between them is the difference between the two flavors of factless fact table.

The attendance sheet is the event log. The instructor only marks students who showed up. One checkmark per (student × day) when the student walked in the door. Empty rows mean nothing was recorded — those days are simply absent from the log. To ask "how many class meetings did Maria attend this month?" the instructor counts Maria's checkmarks. The attendance sheet is sparse, light, and only knows what happened.

The class roster is the coverage table. Every enrolled student appears in every cell, every day of the term, with a status letter: P for present, A for absent, E for excused. Most cells on most days say A — let's be honest about 8 a.m. lectures — but every cell has a value, because every (student × day) combination is in scope. To ask "who skipped Tuesday's class?" the instructor scans the Tuesday column for A's. The absentees are explicitly recorded, with a mark, in the same table as the attendees.

A teacher's desk with two papers side by side: a sparse attendance sheet (only attendees marked with checks) and a full class roster (every student per day, every cell filled: P for present, A for absent). The same data subject, two different record schemes, two different sets of answerable questions.

Now try to answer "who skipped Tuesday?" using only the attendance sheet. You cannot. The sheet has no record of who was supposed to be there — only who showed up. You would need a second piece of paper (the roster) to define the scope, then a NOT IN dance to find the students on the roster who are absent from the sheet. The coverage roster answers the question directly: filter Tuesday, filter status = 'A', done.

These two pieces of paper are the two flavors of factless fact table:

Event factless (the attendance sheet): a row is the record of an event

that happened. Sparse. Only events appear. To find non-events, you compare against something else.

Coverage factless (the roster): a row is created for every (entity ×

period) in scope, regardless of whether an event occurred. Dense. Non-events are visible because they are explicitly recorded as "absent" / "available" / "not utilized."

Both tables store only dimension keys. Neither stores dollars or hours. The difference is what the row means: a record that something occurred, or a record that something was in scope and here is what happened.

What it really is

A factless fact table is a fact table whose only "facts" are the collision of dimension keys at a moment in time. The table has no continuously-valued measures — no amounts, no quantities, no rates. It has foreign keys to every dimension involved in the event, and optionally a single *_count column always equal to 1. That count = 1 is what Kimball calls a useful artifact — it adds no information to the row, but it makes the SQL self-documenting: SUM(registration_count) reads as "total registrations" in a way that COUNT(faculty_key) never will.

Event factless — one row per event that happened. The grain is the moment of the event. Only events that occurred generate rows. The classic Kimball higher-ed example (Chapter 12) is Student Registration Events: one row per (term, student, course, faculty, declared major) combination. Banner source: SFRSTCR. Another is Student Attendance Events: one row per (student, date, course, faculty, facility) when the student attended. Banner source: SHRATTR. Event factless tables are naturally sparse — millions of possible combinations, only a fraction realized.

Coverage factless — one row per (entity × period) in scope. Every combination of dimensions in the scope generates a row, regardless of whether an event occurred. Includes a status dimension that names the outcome: dim_utilization_status with rows Utilized / Available; dim_attendance_status with rows Present / Absent / Excused. The Kimball Chapter 12 example is Facilities Utilization Coverage: one row per (facility, day-of-week, hour-block, term), with a utilization status. Banner source: SSRMEET plus room scheduling data. Coverage tables are dense — every cell in the scope matrix gets a row, and "nothing happened" is a named outcome, not a missing row.

Side by side: an event-factless table (sparse — only rows where an event occurred) versus a coverage-factless table (dense — one row per scope cell, with a status column naming the outcome: Present / Absent, Utilized / Available).

The "what didn't happen" question is the coverage table's superpower. Which classrooms sat empty last term? Filter utilization_status = 'Available'. Which admitted students never enrolled? Filter enrollment_status = 'Declined' on the admissions coverage table. Which products on promotion sold nothing? Filter promotion_status = 'Unsold'. Every "didn't happen" is a row with a status value, not a missing row you have to deduce.

See it — the diagram

The anatomy of a factless fact row is deceptively simple.

Anatomy of a factless fact row: foreign keys to every dimension involved in the event, plus a single column registration_count = 1 — the useful artifact that makes SUM(registration_count) read as English instead of COUNT(faculty_key).

Five foreign keys and one column that always equals 1. That is the whole row. The registration_count = 1 artifact is the only non-key column in the table, and it exists for one reason: downstream readability. Ten aggregate queries that say SUM(registration_count) are ten moments of instant comprehension for the next developer. The artifact has no semantic content — every row is 1, always — but it buys clarity forever.

Show me the code

**Event factless — student registrations from SFRSTCR:**

-- One row per (term, student, course, faculty) registration event.
-- The count = 1 artifact makes SUM self-documenting.
CREATE TABLE fct_registration_event (
    term_key           INTEGER NOT NULL,
    student_key        INTEGER NOT NULL,
    course_key         INTEGER NOT NULL,
    faculty_key        INTEGER NOT NULL,
    declared_major_key INTEGER NOT NULL,
    registration_count INTEGER DEFAULT 1 NOT NULL,  -- the artifact
    PRIMARY KEY (term_key, student_key, course_key)
);
-- Source: SFRSTCR. "How many registrations per faculty in Fall 2026?"
SELECT f.faculty_name,
       SUM(r.registration_count) AS total_registrations
FROM   fct_registration_event r
JOIN   dim_faculty f ON f.faculty_key = r.faculty_key
JOIN   dim_term    t ON t.term_key    = r.term_key
WHERE  t.term_code = '202610'
GROUP BY f.faculty_name
ORDER BY total_registrations DESC;

**Coverage factless — facility utilization from SSRMEET + room scheduling:**

-- One row per (facility, day, hour-block, term) IN SCOPE.
-- Utilization status names the outcome: Available or Utilized.
CREATE TABLE fct_facility_coverage (
    facility_key            INTEGER NOT NULL,
    term_key                INTEGER NOT NULL,
    day_of_week_key         INTEGER NOT NULL,
    hour_block_key          INTEGER NOT NULL,
    owner_org_key           INTEGER NOT NULL,
    assigned_org_key        INTEGER NOT NULL,
    utilization_status_key  INTEGER NOT NULL,  -- Available / Utilized
    coverage_count          INTEGER DEFAULT 1 NOT NULL,
    PRIMARY KEY (facility_key, term_key, day_of_week_key, hour_block_key)
);
-- "Which classrooms are most underutilized?"
SELECT f.facility_room,
       SUM(CASE WHEN s.status_desc = 'Available'
                THEN c.coverage_count ELSE 0 END) AS idle_blocks,
       SUM(c.coverage_count) AS total_blocks
FROM   fct_facility_coverage c
JOIN   dim_facility           f ON f.facility_key = c.facility_key
JOIN   dim_utilization_status s ON s.status_key   = c.utilization_status_key
WHERE  c.term_key = 202610
GROUP BY f.facility_room
ORDER BY idle_blocks DESC;

The payoff — "what didn't happen" is one filter, not a NOT EXISTS dance:

-- Which students were absent on a specific class meeting?
-- The coverage roster makes this a single WHERE clause.
SELECT s.student_name
FROM   fct_attendance_coverage c
JOIN   dim_student       s ON s.student_key = c.student_key
JOIN   dim_date          d ON d.date_key    = c.date_key
JOIN   dim_attend_status a ON a.status_key  = c.attendance_status_key
WHERE  d.full_date       = DATE '2026-09-15'
  AND  c.course_key      = 4287
  AND  a.status_desc     = 'Absent';

Compare this to the alternative: a NOT EXISTS subquery against the event table, joined to a scope-defining dimension (which students are even enrolled in this course?) — three tables, two subqueries, and a developer scratching their head six months later. The coverage table collapses all of that into one filter on one status column. That is what the extra rows buy you.

Where intuition fails

Five lessons that catch teams off guard:

**COUNT(any_key) works, but use the artifact anyway.** SQL lets you put

any FK column inside COUNT() and get the same row count — they all count the rows that pass the filter. But reading COUNT(faculty_key) six months later, you pause: "why faculty_key specifically? Is there a NULL handling subtlety here?" The registration_count = 1 artifact is one column in the schema and a hundred moments of instant comprehension downstream. SUM(registration_count) cannot be misinterpreted.

Coverage tables can explode in row count — be deliberate about the grain.

Facility coverage at (facility × day × hour-block × term) for 200 facilities × 90 days × 16 hours × 4 terms = 11.5 million rows per year. The table is dense by design. That is predictable and manageable, but only if the grain is the coarsest level that still answers the business question. Hour-block, not minute. Day, not timestamp. If the question is "which classrooms sit empty on Friday afternoons?", you need hour-blocks. If the question is "which buildings are underused?", maybe the grain is (facility × day), and the row count drops 16x.

**Event-factless + "what didn't happen" = the coverage table you should have

built.** If you only build the event table and try to answer "who was absent?" via NOT EXISTS, you first have to define WHO COULD HAVE BEEN PRESENT — which students are enrolled, which rooms exist, which products are on promotion. That definition IS a coverage query against another table. The standard warehouse pattern is to build the event-factless for "what happened" questions and a separate coverage-factless for "what was in scope." Two tables, each good at its question.

**Adding real measures later changes the table's identity — and that's

fine.** A registration event fact may start factless (just the collision of student × course × term) and later acquire real measures: credit hours earned, tuition charged, final grade points. The table is no longer factless. It has graduated to a regular fact table at the same grain — and the registration_count = 1 artifact stays as a useful inheritance for downstream readability. The schema evolves. The grain stays.

The status dimension on coverage tables should be tiny. dim_attendance_status

has three rows (Present, Absent, Excused). dim_utilization_status has two (Available, Utilized). Resist the urge to skip the dimension and put status as a text column on the fact table — that forces string comparisons in every WHERE clause and bloats the fact row. Resist also the urge to overengineer the dimension with sub-statuses and categories. Two or three rows is exactly right. The tiny dimension is a feature, not a shortcut.

The one-sentence takeaway

Factless facts capture the collision of dimension keys — what happened, and what was supposed to happen. Add a count = 1 artifact so your SUM reads like English. Use event tables for what occurred; use coverage tables for what was in scope, whether it occurred or not.

Track G · Step 1 of 8 · Building the Waubonsee warehouse

Pick a Process — Why Position-Budget Is the First Star

The first star is the choice that decides whether the warehouse gets adopted or shelved. For Waubonsee, the evidence picks it for you.

7 min readwarehousekimballfirst-starposition-budgetprioritization

Goal

By the end of this step you will have a one-page brief that names:

the business process the first star will model,
the audience that will use the reports it powers, and
the list of existing Argos reports it would replace on day one.

You will not have written a line of SQL. You will not have drawn a star diagram. You will have picked the thing, and you will be able to defend the pick.

For Waubonsee, the answer this guide arrives at is Position-Budget — the monthly snapshot of every position, its assignment(s), and its budgeted-versus-actual cost by organization and fund. If you skip ahead to G2 already convinced, that is fine. If you do not yet see why Position-Budget beats Registration here, read on — the call is contrarian and the evidence is what carries it.

Before you start

You should have:

Read What Waubonsee Actually Reports Today — and Where the Warehouse Should Land First. That article is the

evidence base for everything this step decides. The bar chart there is the punch list this step turns into a priority.

Skimmed Kimball's "Four-Step Dimensional Design Process" (Kimball &

Ross, The Data Warehouse Toolkit, 2nd ed., ch. 2). The four steps are: pick a process, declare the grain, pick the dimensions, pick the facts. This guide is step 1; the next four steps of Track G are step 2 through step 4.

Two clean afternoons. This is a brief, not a build. Resist the urge

to open SQL Developer.

You do not need to know the warehouse tools you will eventually use (Windmill, the load target, the BI layer). All of that lives downstream of the choice you make in this step.

Build it

Kimball gives four criteria for choosing the first process. Apply them to the three obvious candidates for Waubonsee:

Most pain. Which process is currently served by the largest pile of

brittle, repetitive Argos reports? Whichever it is, replacing it produces the most visible win.

Most data. Which process has the richest underlying data — many

dimensions, many measures, real history — so the star is non-trivial and the audience cannot get the same answer from a single source table?

Clearest grain. Which process has a single, indisputable atomic

row? A muddy grain in star #1 poisons everything downstream.

Highest political will. Which process has an executive sponsor who

will defend the warehouse when the first quarterly close gets messy?

Kimball's four criteria for the first star, scored against three candidates from Waubonsee's reality.

F0 already showed that no domain has overwhelming raw frequency at Waubonsee — the catalog is spread across five buckets, none dominant. So "most pain" alone does not pick the first star. The decision has to weigh all four criteria together, and lean hardest on the two that gate first delivery: grain and political will. Apply them to the three obvious candidates:

Registration is Kimball's textbook first star and almost every

university BI deck starts there. At Waubonsee it has real Argos load — SFRSTCR is the third-most-touched table in the catalog (74 reports) — and the grain is famously clean (one student × one section × one term). But it scores lower than Position-Budget on political will: the CFO is pushing for budget visibility; no comparable champion is pushing for a registration warehouse first. Star #2 or #3, not star #1.

HR / Payroll detail scores highest on pain (146 reports — the

largest single named domain in the catalog) but the grain is muddy. A single payroll run produces earnings, deductions, taxes, benefits, leave, retro adjustments — each at a different grain. Modeled correctly it is two or three stars. Modeled hastily it is a swamp. Save it for after Position-Budget proves the engine.

Position-Budget scores well on three of the four, including the

two that gate a successful first delivery: grain and political will. Pain: smaller than HR or Registration in raw count (44 reports directly today), but recurring — every department head re-asks "what does this cost?" monthly, and that question is currently answered by stitching together five Argos reports. Data: positions, jobs, organizations, funds, accounts, calendar — five clean dimensions and one obvious additive measure (dollars). Grain: a position-month-fund row, unambiguous and atomic. Political will: the CFO and the budget office want this yesterday, with ongoing pressure from the board to produce it. And the institution's RISE 2030 Strategic Plan formally commits to data-informed decision-making — a written sponsor for any data initiative that delivers, and a defensible argument for a CFO who needs to justify the spend internally.

The choice is Position-Budget — not because it dominates the catalog, but because it is the most deliverable first star: clean grain, defined scope, and a sponsor who will defend the bet.

The Position-Budget star's boundary — what is IN scope for the first delivery, what is explicitly OUT.

The brief, then, has three lines:

Process: the monthly snapshot of every active position, its current

assignment(s), and its budgeted-versus-actual cost — by organization, by fund, by account.

Audience: Budget Office, CFO, deans, department chairs, HR Business

Partners.

Replaces (day one): the Argos reports built on NBBPOSN,

NBRJOBS, FTVORGN, GOVSDAV, plus their PWVEMPL/PEBEMPL joins. The exact list comes out of argos_catalog.json — open it, filter table_frequency to those tables, take the union of report names. That list is your day-one win-condition.

# from src/argos_ingest.py output - the day-one replace-list query
import json
cat = json.load(open("data/argos_catalog.json"))
seeds = {"nbbposn", "nbrjobs", "ftvorgn", "govsdav"}
hits = set()
for row in cat["table_frequency"]:
    if row["table"] in seeds:
        hits.update(row["reports"])
print(len(hits), "Argos reports this star would replace")

Save that list in the brief. It is the only number that matters when the warehouse gets to its first quarterly review.

Verify against Banner

There is nothing to verify technically in this step — you have written no code. The verification here is sociological: walk the one-page brief past the people whose reports it claims to replace.

Concretely:

The Budget Office. Show them the replace-list. Ask: "Is there a

report on this list whose loss would hurt? Is there a report missing that should be on it?" Their answers tighten the scope.

The CFO sponsor. Confirm the executive sponsor with one sentence:

"We are starting the warehouse with a Position-Budget star. The first reports it produces will land at the end of fiscal Q3. Are you good with that?" Get the nod (or the redirect) before G2.

HR. Position is not Person. Tell HR you are not building their

payroll detail in star #1 — that comes later. Avoid the assumption that "data warehouse = my report next week."

If any of the three blocks, stop and redo this step. A first star without a defended audience is a wasted star.

Watch out

Three traps:

"More tables = bigger star." No. The first star is the smallest

sufficient slice that can stand on its own. Resist scope creep. If a table is not in {NBBPOSN, NBRJOBS, FTVORGN, GOVSDAV, PWVEMPL, PEBEMPL, FUND/ACCT lookup, calendar}, it does not belong in star #1. It can join in star #2 or star #3.

Position is not Job is not Employee. Banner is precise here: a

NBBPOSN row is a slot ("Director of IT, position 100123"), a NBRJOBS row is an assignment ("Pedro is in position 100123 from 2024-01-15"), and a PWVEMPL row is the person who fills it. The star you build needs all three, but they are three dimensions or roles, not one. Confusing them is the most common mistake on this path; the PIDM — The Number Behind Every Person and The MAX() Subquery — Getting the Row That's Current articles will sort it out when you get to G4.

"While we're at it, let's also..." The voice that wants to add

one more thing to the brief is the same voice that ships nothing. Write the brief, get the three sign-offs, declare step 1 done. The urge to add belongs in a backlog of future stars (see The Second Star — Admissions as an Accumulating Snapshot for where).

The one-sentence takeaway

The first star is a choice grounded in evidence, not ambition; for Waubonsee that evidence picks Position-Budget.

Track G · Step 2 of 8 · Building the Waubonsee warehouse

Declare the Grain — One Row Equals One What?

The grain is the single most consequential sentence you will write about your warehouse. Get it right and every dimension follows; get it wrong and every report lies in subtle ways for years.

8 min readwarehousekimballgrainposition-budgetfact-table

Goal

By the end of this step you will have:

A grain sentence — one line that says what a single row of

fct_position_budget represents. It will be specific enough that anyone who reads it can answer "could there be two rows for the same X?" without ambiguity.

An in-scope / out-of-scope list — which Banner facts the grain

can answer, and which it cannot.

A back-of-envelope row count — so you know up front whether the

fact is going to be 10 thousand rows or 10 million. (For Waubonsee's Position-Budget at the recommended grain: it is a small table by warehouse standards.)

You will not have written CREATE TABLE DDL. You will not have picked surrogate-key types. You will have declared the grain, and that sentence will be the contract every subsequent step is held to.

For Waubonsee, the grain this guide arrives at is:

> *One row in fct_position_budget = one (position, fund, month) > combination.*

Before you start

You should have:

Completed Pick a Process — Why Position-Budget Is the First Star with sign-off — the process is

Position-Budget, the audience is the Budget Office + CFO + deans, and the replace-list of Argos reports is documented.

**Spent 30 minutes with a real NBBPOSN row open** in your SQL tool

of choice. Pull a position by posn_code and look at its budget amount, its organization code, and how it's funded (which means joining to GOVSDAV and the FOAPAL machinery). Get a feel for the shape of the data before you try to model it.

A clear mental model of Position vs Job vs Employee. Banner

distinguishes them precisely: a NBBPOSN row is a slot ("Director of IT, position 100123"), a NBRJOBS row is a job assignment (a person assigned to that slot from a date), and PWVEMPL is the person who fills it. Confusing these three is the most common trip-up on this path — see PIDM — The Number Behind Every Person and (when written) A_position_job_employee for the explicit walkthrough.

You do not need to know the warehouse DDL syntax yet. You do not need to have picked an ETL tool. Those are downstream of the grain.

Build it

Kimball's "Four-Step Dimensional Design Process" (chapter 2 of The Data Warehouse Toolkit) is unambiguous about the order:

Pick the business process. (done in G1.)
Declare the grain. (this step.)
Identify the dimensions. (G3 and G4.)
Identify the facts. (G5.)

Steps 3 and 4 are consequences of step 2. Pick a different grain and you get different dimensions and different facts — and the same analytic question gets a different answer, or no answer at all. The grain is the contract. Pick it carefully; pick it once.

For Position-Budget at Waubonsee, there are five obvious candidate grains. Walk them in order from coarsest to finest:

Five candidate grains for the Position-Budget fact, scored on row count and what each can answer.

**position × month** — one row per active position per month.

Simple. Cannot answer "this position is 60% State, 40% Federal" — the fund split is invisible. Reject.

**position × fund × month** — one row per active position per

fund per month. Exposes the split. Monthly cadence matches the recurring budget question. Row count is tiny. Recommended.

**position × pay-period × fund** — every 26 pay periods per year,

per fund. Finer than monthly, matches payroll cadence. Useful, but the audience (Budget Office, CFO) thinks in months, not pay periods. Save for a second fact later if needed.

**position × day × fund** — daily snapshot. 30× the row count of

monthly. Cannot point at a single business question that needs daily resolution. Reject.

**position × employee × fund × month** — adds the assigned

employee to the grain. Doubles or triples the row count (turnover, splits) and creates ambiguity for vacant positions (no employee at all, but the position still has a budget). The employee belongs as a slowly-changing attribute of the position-month row, not as part of the grain. Reject — but pull employee in as a dimension reference in G4.

The winner is position × fund × month. The grain sentence:

> *One row in fct_position_budget = one combination of one position, > one funding source (FOAPAL fund), and one month. The row carries the > budgeted amount and the actual amount for that combination, plus > dimension keys for the position, the fund, the month, the position's > current employee (or "vacant"), and the position's current > organization.*

That sentence is the contract. Print it out, put it on the wall, and hold every subsequent decision to it.

One row of fct_position_budget — its three grain keys, its two measures, and the dimensions that hang off it.

Row count, back of the envelope: Waubonsee has on the order of 600 active positions; the average position is split across roughly 1.4 funds; the warehouse will hold 60 months of history at launch. That is 600 × 1.4 × 60 ≈ 50,000 rows. A very small fact by warehouse standards — which is good, because every analytic query will fly.

What the grain CAN answer, day one:

Total budgeted dollars by organization, by fund, by month — any

rollup along those three dimensions.

Budget vs actual variance for any (position, fund, month) cut.
Trend lines: organization X's payroll budget, last 24 months.
"Who fills this position?" — via the employee dimension reference.

What the grain CANNOT answer, by design:

Daily cash-flow questions (need pay-period grain, second fact).
Per-deduction or per-earnings breakdowns (need a payroll-detail

fact, future star).

Course-section enrollment by funded position (cross-domain — a

conformed-dimension question for The Second Star — Admissions as an Accumulating Snapshot).

The "cannot" list is just as important as the "can". Anything on it is not a failure of this star — it is the boundary you agreed to. Future stars cover what this one deliberately leaves out.

Verify against Banner

The grain is verified two ways: numerically and sociologically. Both matter.

Numerically — does the data split the way you said it does? Pick three real positions across different funding patterns (one single-fund, one 60/40 split, one with three or more funds). For each, pull the FOAPAL detail straight from Banner:

-- Current funding split for a single position, straight from Banner.
-- This is what your warehouse row(s) for this (position, month) must
-- match. The MAX(eff_date) is the canonical Banner pattern — see
-- [[B3_effective_max]] for why it's there.
SELECT  l.nbrplbd_posn       AS position,
        l.nbrplbd_coas_code  AS chart,
        l.nbrplbd_fund_code  AS fund,
        l.nbrplbd_orgn_code  AS orgn,
        l.nbrplbd_acct_code  AS acct,
        l.nbrplbd_percent    AS pct,
        l.nbrplbd_budget     AS budgeted_amt
FROM    nbrplbd l
WHERE   l.nbrplbd_posn = '100123'           -- your test position
  AND   l.nbrplbd_effective_date = (
            SELECT MAX(l2.nbrplbd_effective_date)
            FROM   nbrplbd l2
            WHERE  l2.nbrplbd_posn = l.nbrplbd_posn
              AND  l2.nbrplbd_effective_date <= TRUNC(SYSDATE, 'MM'))
ORDER BY l.nbrplbd_fund_code;

If position 100123 has three funding lines today, the warehouse will have three rows for (100123, this-month) — one per fund. The sum of budgeted_amt across those rows must equal the position's total budget. If it does not, the grain is wrong, or the source is being read wrong, or both. Stop and re-derive before continuing to G3.

Sociologically — does the grain match how the audience thinks? Take the grain sentence to the Budget Office. Ask: "When you say 'how much is this position costing us,' is it one number, or is it a breakdown by fund? When you say 'this month,' is that calendar month or fiscal month?" Their answers will either confirm the grain or send you back to step 2. Better to find out now than after G5 ships.

Watch out

Four traps:

Grain creep. "Let's just add employee to the grain — it's

useful." That instinct is what produces facts that are 5× too big and ambiguous for vacant positions. The grain stays at (position, fund, month); employee is a dimension attribute on the row (the current employee as of that month), not a grain component. The single best discipline for keeping the fact lean: if a candidate column varies within the declared grain, it belongs in a dimension; if it does not vary, it can be either.

Fund explosion. A small number of positions at Waubonsee are

technically split across many micro-funds (some grant-funded positions touch 8–10 funds). At one row per fund per month, those positions produce a fan-out you should size for. Pull a count of distinct funds per position from NBRPLBD before locking the grain; if there is a long tail, decide explicitly whether to keep all funds or to roll the tail into "Other" at ETL time. Decide once, document the decision, and stick to it.

Position is not Job is not Employee. Easy to confuse. The grain

is on POSITION (the slot), not job (the assignment) and not employee (the person). A position with no current job — a vacant position — still gets a row, with the employee dimension reference set to the "vacant" sentinel. A position turned over mid-month produces one row for the month, with the latest employee in the employee dimension reference. Resist the urge to model the turnover; that goes in a future job-history fact.

The month boundary. A position that starts mid-month or ends

mid-month gets one row for that month. Do not pro-rate. The budget question is "what was budgeted for this position-fund during month X?" — a single number. If the audience ever asks for daily pro-rated cost, you will know the second star you need to build.

The one-sentence takeaway

Declare the grain in one sentence — and let every dimension and measure follow from it.

Track G · Step 3 of 8 · Building the Waubonsee warehouse

Build the Date Dimension — One Row Per Day, Three Calendars in One Table

Every star in your warehouse will join to this one dimension. Build it once, get the three calendars right, and never touch it again — except to add holidays.

9 min readwarehousekimballdim-datecalendarfiscal-yearacademic-term

Goal

By the end of this step you will have:

A populated **dim_date table with daily grain**, covering 10

years of history and 5 years forward — roughly 5,500 rows. Tiny by warehouse standards.

A surrogate key in the canonical Kimball form: an integer

formatted YYYYMMDD (e.g. 20250815). It sorts naturally, joins fast, and is readable by humans when you debug — three properties at once.

Calendar, academic, and fiscal attributes on every row, so any

star joining dim_date can slice by calendar month, by academic term, or by fiscal quarter without computing anything at query time.

A Windmill flow that regenerates the table from scratch on a

schedule (annually is enough), so the future-date rows always stay ahead of the system clock.

The first star (Declare the Grain — One Row Equals One What?) is monthly. For that grain, every fact row's date_key points to the first-of-month date (20250801, 20250901, …). The daily grain in dim_date is built once and reused: monthly facts use first-of-month, pay-period facts use period-end dates, daily facts use every date. One dimension serves all of them.

Before you start

You should have:

Completed Declare the Grain — One Row Equals One What? — you know that

fct_position_budget is at (position, fund, month) grain, and that monthly fact rows will use first-of-month dates as their date_key.

The Waubonsee academic calendar in hand. You need term start and

end dates for the past 10 years and forward 2–5 years. The registrar's office maintains it; STVTERM already has the codes.

The fiscal-year start date confirmed with Finance. Illinois

community colleges typically run on a July 1 – June 30 fiscal year, but confirm before encoding. Encode it wrong now and every finance query is off-by-one for years.

A Windmill PostgreSQL resource pointing at the warehouse target.

See resources plugs in wiki #1 for how to wire that up.

You do not need any source data from Banner for this dimension. The date dimension is generated, not extracted. That makes it the easiest dimension to build and the natural place to start the warehouse load pipeline.

Build it

Three design decisions, then the build.

Decision 1 — the grain. Daily. Even though the first fact at Waubonsee will be monthly, building dim_date at monthly grain would force you to rebuild it the day someone wants a daily fact. Kimball's guidance: pick the finest grain you will ever need; coarser facts join to the same dim by pointing at the appropriate row (first-of-month, period-end, end-of-quarter). Cost is trivial — 5,500 rows for 15 years is a rounding error.

Decision 2 — the surrogate key. An integer in YYYYMMDD form. 20250815 for August 15, 2025. Three properties matter:

Sorts chronologically as an integer. No ORDER BY date_key gotchas.
Joins on a 4-byte integer, not a 10-byte string or an 8-byte date.
Readable by humans when you SELECT date_key FROM fct_… — you

instantly know what date it represents without a join.

Resist the temptation to use a plain DATE type as the key. It works, but you lose the human-readable debugging property. Resist also the temptation to use a sequential integer (1, 2, 3, …) — you gain nothing and you lose all three properties above.

Decision 3 — the attributes. The dimension is wide on purpose. Three attribute groups live side by side on every row:

Calendar: full_date, day_of_week, day_name, day_of_month,

day_of_year, week_of_year, month_number, month_name, calendar_quarter, calendar_year, is_weekend.

Academic: term_code (matches STVTERM), term_name, term_type

(FALL/SPRING/SUMMER), is_in_term, is_registration_open, weeks_into_term.

Fiscal: fiscal_year (the FY this date falls in — for July 1

start, July 2025 is FY2026), fiscal_quarter (1–4), fiscal_period_number (1–12, where period 1 is July), fiscal_period_name, is_fiscal_year_end.

Plus a small handful of derived flags: is_holiday (campus closed), is_business_day, is_first_of_month, is_last_of_month. These let finance queries filter cleanly without recomputing.

One row of dim_date — the calendar, academic, and fiscal attribute groups all live side by side in the same row.

The dimension is roughly 35 columns wide. That sounds wide for a table of 5,500 rows; it is exactly right for a dimension. Width on dimensions is cheap and read-friendly; width on facts is expensive and slow. The discipline of "fat dimensions, thin facts" is what makes the star fast.

Three calendars overlaid on the same year. A single date belongs to all three contexts — the dimension stores all three on one row.

The three calendars all live in one row. Any analytic question phrased as "show me X for fall semester" or "show me X for fiscal Q2" becomes a single attribute filter on dim_date — no math, no JOINs to other tables, no risk of off-by-one between modules.

Build the table. The CREATE TABLE for the dimension:

-- dim_date - one row per calendar day, with three calendars in one row.
-- Loaded once by generation, refreshed annually to extend the future range.
CREATE TABLE dim_date (
    date_key              INTEGER     PRIMARY KEY,   -- YYYYMMDD
    full_date             DATE        NOT NULL,
    day_of_week           INTEGER     NOT NULL,      -- 1=Sun .. 7=Sat
    day_name              VARCHAR(10) NOT NULL,
    day_of_month          INTEGER     NOT NULL,
    month_number          INTEGER     NOT NULL,
    month_name            VARCHAR(10) NOT NULL,
    calendar_quarter      INTEGER     NOT NULL,
    calendar_year         INTEGER     NOT NULL,
    is_weekend            BOOLEAN     NOT NULL,
    -- academic
    term_code             VARCHAR(10),               -- nullable: gaps between terms
    term_name             VARCHAR(40),
    term_type             VARCHAR(10),               -- FALL/SPRING/SUMMER
    is_in_term            BOOLEAN     NOT NULL,
    is_registration_open  BOOLEAN     NOT NULL,
    -- fiscal (July 1 - June 30)
    fiscal_year           INTEGER     NOT NULL,
    fiscal_quarter        INTEGER     NOT NULL,
    fiscal_period_number  INTEGER     NOT NULL,      -- 1=Jul .. 12=Jun
    is_fiscal_year_end    BOOLEAN     NOT NULL,
    -- flags
    is_holiday            BOOLEAN     NOT NULL DEFAULT FALSE,
    is_business_day       BOOLEAN     NOT NULL,
    is_first_of_month     BOOLEAN     NOT NULL,
    is_last_of_month      BOOLEAN     NOT NULL
);

Generate the rows. A recursive CTE handles the calendar arithmetic in one pass; the fiscal-year and term mappings are applied in a single UPDATE afterward (or computed inline). The whole thing is ~80 lines of SQL — short enough to live inside a Windmill script:

-- One-shot population: 10 years past + 5 future, all calendars in one go.
INSERT INTO dim_date (date_key, full_date, day_of_week, day_name,
                      day_of_month, month_number, month_name,
                      calendar_quarter, calendar_year, is_weekend,
                      fiscal_year, fiscal_quarter, fiscal_period_number,
                      is_fiscal_year_end, is_business_day,
                      is_first_of_month, is_last_of_month,
                      is_in_term, is_registration_open)
WITH RECURSIVE d AS (
    SELECT DATE '2015-07-01' AS dt
    UNION ALL
    SELECT dt + INTERVAL '1 day' FROM d
    WHERE  dt < DATE '2030-12-31'
)
SELECT  TO_CHAR(dt, 'YYYYMMDD')::INTEGER                  AS date_key,
        dt                                                AS full_date,
        EXTRACT(DOW FROM dt) + 1                          AS day_of_week,
        TO_CHAR(dt, 'Day')                                AS day_name,
        EXTRACT(DAY FROM dt)::INTEGER                     AS day_of_month,
        EXTRACT(MONTH FROM dt)::INTEGER                   AS month_number,
        TO_CHAR(dt, 'Month')                              AS month_name,
        EXTRACT(QUARTER FROM dt)::INTEGER                 AS calendar_quarter,
        EXTRACT(YEAR FROM dt)::INTEGER                    AS calendar_year,
        EXTRACT(DOW FROM dt) IN (0, 6)                    AS is_weekend,
        -- fiscal year starts July 1: July 2025 belongs to FY2026
        CASE WHEN EXTRACT(MONTH FROM dt) >= 7
             THEN EXTRACT(YEAR FROM dt) + 1
             ELSE EXTRACT(YEAR FROM dt) END::INTEGER      AS fiscal_year,
        -- fiscal quarter: Jul-Sep=1, Oct-Dec=2, Jan-Mar=3, Apr-Jun=4
        ((EXTRACT(MONTH FROM dt)::INTEGER + 5) % 12) / 3 + 1 AS fiscal_quarter,
        -- fiscal period: Jul=1 .. Jun=12
        ((EXTRACT(MONTH FROM dt)::INTEGER + 5) % 12) + 1     AS fiscal_period_number,
        (EXTRACT(MONTH FROM dt) = 6 AND EXTRACT(DAY FROM dt) = 30)
                                                          AS is_fiscal_year_end,
        EXTRACT(DOW FROM dt) NOT IN (0, 6)                AS is_business_day,
        EXTRACT(DAY FROM dt) = 1                          AS is_first_of_month,
        dt = (date_trunc('month', dt) + INTERVAL '1 month - 1 day')::DATE
                                                          AS is_last_of_month,
        FALSE                                             AS is_in_term,
        FALSE                                             AS is_registration_open
FROM    d;

The is_in_term, is_registration_open, and term_code columns are left at default values; a second UPDATE pass joins STVTERM and the registrar's term-window table to fill them. Holidays come from a small seed table (dim_date_holidays) maintained by hand — usually 20–25 dates per year.

Wrap both passes in a Windmill flow with a single resource binding for the warehouse PostgreSQL connection. Schedule it once a year, every June, to extend the future range another year. See crontab to schedules in wiki #1 for the scheduling pattern.

Verify against Banner

Two checks make the dimension trustworthy.

**Cross-check the academic attributes against STVTERM.** Pick three real terms — last fall, current spring, next fall — and confirm:

-- Every date inside a term should have term_code = STVTERM's code
-- and is_in_term = TRUE.
SELECT  d.date_key, d.full_date, d.term_code, d.is_in_term,
        t.stvterm_code, t.stvterm_start_date, t.stvterm_end_date
FROM    dim_date d
JOIN    stvterm  t ON t.stvterm_code = d.term_code
WHERE   d.full_date BETWEEN t.stvterm_start_date
                        AND t.stvterm_end_date
  AND   (d.term_code <> t.stvterm_code OR d.is_in_term = FALSE)
LIMIT 10;

The query should return zero rows. Any hit is a gap between dim_date and STVTERM — fix the seed before continuing.

Cross-check fiscal year boundaries against a known FY-aware Argos report. Pick a payroll or budget report that prints fiscal-year totals (most NBR_ and FOR_ reports do). Sum the same dollars from dim_date.fiscal_year = 2024 and confirm the totals match the Argos output. They should agree to the cent. If they do not, either the FY start date is wrong, or the Argos report has its own FY logic that disagrees — find out which.

Watch out

Five traps:

The "today" trap. Never reference CURRENT_DATE in dim_date

itself. The dimension is built once and reused; if it contains a "today" flag, the flag is wrong every day after it is generated. Computing "is this row in the past?" is a query-time job, not a dimension-build job.

Time zones. dim_date is a naive date dimension — calendar

dates only, no times, no timezone. If a future star needs timestamp-grain data (an event log), build a separate dim_time for the hour-of-day attributes. Mixing them here turns the dimension into a million-row monster for no analytic gain.

The fiscal-year off-by-one. July 2025 belongs to fiscal year

2026, not 2025. Every Illinois community college finance person knows this; every developer new to Waubonsee gets it wrong the first time. Encode it once, name the column fiscal_year clearly, and add a row-level test: MIN(full_date) for FY2026 should be 2025-07-01, not 2026-01-01. The wrong answer is silent; the right answer takes ten seconds to verify and saves a year of wrong reports.

Holiday list maintenance. The campus holiday list changes every

year (new closures, calendar shifts). Add a calendar reminder for June to refresh dim_date_holidays for the next FY. A stale holiday flag makes is_business_day lie quietly for years.

Monthly-fact date_key convention drift. Every monthly fact in

the warehouse uses first-of-month as its date_key. Document this once, in this article and in every fact-table comment. Inconsistency (some facts using end-of-month, some using first) breaks every cross-fact comparison.

The one-sentence takeaway

Build the date dimension once, get it right, and every star afterward points to it without effort.

Track G · Step 4 of 8 · Building the Waubonsee warehouse

Build the Position Dimension — SCD Type 2 and the Discipline of History

A position's title changes — and your warehouse must remember both versions, so a query about last year reports last year's title, not today's. That is Slowly Changing Dimension Type 2, and getting it right once is the difference between a warehouse you trust and one you have to apologize for.

8 min readwarehousekimballdim-positionscd-type-2bannernbbposn

Goal

By the end of this step you will have:

A populated **dim_position table with one row per version of

every position** ever active at Waubonsee. For ~600 currently-active positions plus historical versions, expect ~1,500–2,500 rows total — still tiny.

A working SCD Type 2 pattern: every change to a position

(title, classification, grade, status) creates a new row with a new surrogate key, and the old row gets retired. The fact table will always point to the version that was current at the fact's date.

A Windmill flow with two scripts — one for the initial load

(run once), one for the incremental load (run on a schedule, daily is plenty).

A working understanding of the Position ≠ Job ≠ Employee

discipline: this dimension stores the slot, not the person filling it.

The grain of dim_position is one row per (position-code, version). Banner thinks of position 100123 as one entity that changes over time; the warehouse thinks of it as a stream of versioned snapshots, each with its own surrogate key.

Before you start

You should have:

Completed Build the Date Dimension — One Row Per Day, Three Calendars in One Table — dim_date is loaded and

reachable from the warehouse target.

Read Slowly Changing Dimensions — Keeping History When Attributes Change if it exists, or the

Kimball-canonical SCD chapter. Type 2 is the version this dimension uses; Type 1 (overwrite) and Type 3 (extra column) come up in the Watch-Out section.

**Familiarity with NBBPOSN** — the position master table. Open one

row in your SQL tool and look at the columns: nbbposn_posn, nbbposn_title, nbbposn_pcls_code, nbbposn_pcat_code, nbbposn_status, nbbposn_eclass_code, plus the effective-date pattern Banner uses for almost every master table (see Effective Dating — Why Banner Never Forgets and The MAX() Subquery — Getting the Row That's Current).

**Confirmation that your warehouse target supports SCD-friendly

patterns** — specifically, that effective_end_date can be NULL (or a far-future sentinel like 9999-12-31) without breaking the query patterns the analytic layer uses.

You do not need NBRJOBS (job assignments — that powers dim_employee and the fact's employee_key) or PWVEMPL (employee attributes) for this step. Those come in G5 when the fact is built.

Build it

A Type 2 dimension has three jobs at once:

Record the current version of every entity (the row with

current_flag = TRUE is what the live system shows).

Preserve every prior version, with the dates each was valid,

so historical facts still resolve to the historical truth.

Hand a stable surrogate key to every version so the fact table

can point at the right one without ambiguity.

The shape of the table looks like this. Every column falls into one of four groups — identity (which position), classification (what it is), status (what state it's in), and SCD control (when this version is valid).

One row of dim_position — identity columns, classification attributes, status flags, and the four SCD-Type-2 control columns that make versioning work.

-- dim_position - Type 2 SCD. One row per (position, version).
CREATE TABLE dim_position (
    -- SCD control
    position_key            INTEGER     PRIMARY KEY,    -- surrogate
    effective_start_date    DATE        NOT NULL,       -- this version starts
    effective_end_date      DATE,                       -- NULL = still current
    current_flag            BOOLEAN     NOT NULL,       -- TRUE for one row per code
    -- identity (the natural key + descriptive labels)
    position_code           VARCHAR(8)  NOT NULL,       -- Banner's NBBPOSN_POSN
    position_title          VARCHAR(60) NOT NULL,
    -- classification
    pclass_code             VARCHAR(8),                 -- NBBPOSN_PCLS_CODE
    pclass_desc             VARCHAR(60),
    pcat_code               VARCHAR(8),                 -- pay category
    eclass_code             VARCHAR(4),                 -- employee class (F/P)
    salary_grade            VARCHAR(8),
    -- status
    position_status         VARCHAR(2)  NOT NULL,       -- A/F/C/Q (Active/Frozen/Cancelled/Quasi)
    is_active               BOOLEAN     NOT NULL,
    -- provenance
    source_updated_at       TIMESTAMP   NOT NULL        -- when ETL last touched
);
CREATE UNIQUE INDEX ux_dim_position_current
    ON dim_position (position_code) WHERE current_flag = TRUE;

The partial unique index is the most important line in the file. It enforces — at the schema level, not by convention — that **exactly one row per position code has current_flag = TRUE**. If your ETL ever tries to mark two versions current at the same time, the database refuses the write. That single line catches a class of bugs that otherwise hides for months.

Initial load. Run once, from a snapshot of NBBPOSN:

-- Seed dim_position from current NBBPOSN. Each position gets ONE row
-- with effective_start = its creation date (or a default historical
-- date if creation is unknown), current_flag = TRUE, end = NULL.
INSERT INTO dim_position (
    position_key, effective_start_date, effective_end_date, current_flag,
    position_code, position_title, pclass_code, pcat_code, eclass_code,
    position_status, is_active, source_updated_at)
SELECT  nextval('dim_position_key_seq')          AS position_key,
        COALESCE(nbbposn_activity_date,
                 DATE '2010-01-01')              AS effective_start_date,
        NULL                                     AS effective_end_date,
        TRUE                                     AS current_flag,
        nbbposn_posn                             AS position_code,
        nbbposn_title                            AS position_title,
        nbbposn_pcls_code                        AS pclass_code,
        nbbposn_pcat_code                        AS pcat_code,
        nbbposn_eclass_code                      AS eclass_code,
        nbbposn_status                           AS position_status,
        (nbbposn_status = 'A')                   AS is_active,
        CURRENT_TIMESTAMP                        AS source_updated_at
FROM    banner.nbbposn;

Incremental load. Run daily. For each position in Banner, compare the source attributes to the current dim_position row. If they match, do nothing. If they differ, retire the old row and insert a new one — that is the Type 2 dance:

-- Find positions whose source row differs from our current version.
-- A change in any of (title, pclass, pcat, eclass, status) triggers
-- a new SCD version.
WITH banner_now AS (
    SELECT nbbposn_posn        AS position_code,
           nbbposn_title        AS title,
           nbbposn_pcls_code    AS pclass,
           nbbposn_pcat_code    AS pcat,
           nbbposn_eclass_code  AS eclass,
           nbbposn_status       AS status
    FROM   banner.nbbposn),
changed AS (
    SELECT b.*
    FROM   banner_now b
    JOIN   dim_position d
           ON d.position_code = b.position_code
          AND d.current_flag  = TRUE
    WHERE  (d.position_title, d.pclass_code, d.pcat_code,
            d.eclass_code, d.position_status)
        <> (b.title, b.pclass, b.pcat, b.eclass, b.status))
-- Step 1: retire the old version.
UPDATE dim_position
SET    current_flag = FALSE,
       effective_end_date = CURRENT_DATE
WHERE  current_flag = TRUE
  AND  position_code IN (SELECT position_code FROM changed);

-- Step 2: insert the new version.
INSERT INTO dim_position (
    position_key, effective_start_date, effective_end_date, current_flag,
    position_code, position_title, pclass_code, pcat_code, eclass_code,
    position_status, is_active, source_updated_at)
SELECT  nextval('dim_position_key_seq'), CURRENT_DATE, NULL, TRUE,
        c.position_code, c.title, c.pclass, c.pcat, c.eclass,
        c.status, (c.status = 'A'), CURRENT_TIMESTAMP
FROM    changed c;

Wrap both steps in a single Windmill flow with a transaction so the retire+insert is atomic — see results conveyor for the pattern of passing results between steps and retry failure for the retry posture.

One position (code 100123) over time — three versions, three surrogate keys, three rows in dim_position. The fact for each month points to whichever version was current that month.

The figure shows the payoff: position 100123 has three versions over two years. The August 2024 fact row points to version position_key = 1042. The November 2024 fact row points to position_key = 1187 because a title change happened in October. The April 2025 fact row points to position_key = 1304 because the classification changed again in March. Each fact tells the historical truth, because each fact points to the version that was true at its date.

Verify against Banner

Two checks make the dimension trustworthy:

Row-count parity for current rows. The set of dim_position rows with current_flag = TRUE must match the set of active positions in NBBPOSN. Exactly. No off-by-one:

-- The count must match - and the codes must match.
WITH dim_current AS (
    SELECT position_code
    FROM   dim_position
    WHERE  current_flag = TRUE),
banner_active AS (
    SELECT nbbposn_posn AS position_code
    FROM   banner.nbbposn)
SELECT 'in dim but not banner' AS issue, position_code
FROM   dim_current
WHERE  position_code NOT IN (SELECT position_code FROM banner_active)
UNION ALL
SELECT 'in banner but not dim', position_code
FROM   banner_active
WHERE  position_code NOT IN (SELECT position_code FROM dim_current);

Zero rows or stop. Any row here is a load gap or a deletion the ETL missed.

Audit a known historical change. Pick a position that you know changed title or classification last year — the Budget Office can name one. Confirm dim_position has multiple rows for that position_code, that exactly one has current_flag = TRUE, that the effective_end_date of the prior version equals the next version's effective_start_date (no gaps), and that the most-recent version's attributes match NBBPOSN today. If any of those breaks, the SCD logic has a bug.

Watch out

Five traps — at least three of them will bite you the first time:

The "current" flag must be exactly one row per code. The

partial unique index in the DDL enforces this, but you can still write ETL code that violates it temporarily inside a transaction that swaps versions. Always retire the old row before inserting the new one, in the same transaction. If you reverse the order, you get a constraint violation that takes hours to debug.

Vacant positions still get a row. A position with no NBRJOBS

assignment today is still a position; it has a budget, a title, a classification, and an is_active status. It belongs in dim_position like every other position. The "no employee" fact is handled by dim_employee's "vacant" sentinel row, not by omitting the position.

Late-arriving changes. Banner occasionally back-dates a

position attribute change — somebody marks a title effective last month, not today. Pure Type 2 does not naturally handle this: inserting a new version today loses last month's correct title. The pragmatic policy at Waubonsee: detect any nbbposn_activity_date < CURRENT_DATE in the incremental load, log a warning, and either (a) accept it as "we learned about the change today" or (b) hand-correct prior versions. Most teams pick (a) because (b) means re-loading every fact row from that date forward.

Position recodes lose the trail. When Banner reissues a

nbbposn_posn (e.g. "100123" is retired and replaced with "100124" for the same actual role), pure Type 2 treats it as two unrelated positions. The fact rows split, the historical trail breaks. If recodes are common, add a previous_position_code attribute and a lineage_id that survives the recode. Document the recode policy with HR before you build the column.

**Do not store employee data on dim_position.** Tempting — "I

know who fills this position right now, let me put current_pidm on the row." Resist. Employee data belongs in dim_employee. The moment you store the employee on the position dim, you either (a) create a Type 2 cascade where every employee-change versions the position dim, or (b) silently let the position dim lie about history. Both are bad. Position and employee are separate dimensions for a reason — keep them separate.

The one-sentence takeaway

Type 2 is the discipline of remembering — every change becomes a new row, every fact points to the version that was true at its moment.

Track G · Step 5 of 8 · Building the Waubonsee warehouse

Build the Position-Budget Fact — The Center of the First Star

Everything in the warehouse exists to support one thing: a fact table you can query without thinking about Banner. This is the step that builds it. After this, an analyst can answer 'budgeted vs actual by department by month' with three joins and no MAX subquery — a five-second query against a star that did not exist yesterday.

11 min readwarehousekimballfact-tableposition-budgetperiodic-snapshotnbbposnnbrplbdupsert

Goal

By the end of this step you will have:

A populated **fct_position_budget** table with one row per

(position × fund × month), spanning the warehouse's effective history. For Waubonsee's ~600 active positions across ~5 funds per position over 12 months per fiscal year, expect roughly 36,000 new rows per year — small for a fact table, large enough to make trend analysis meaningful.

The periodic snapshot pattern working end-to-end (see

The Three Fact-Table Patterns — Transaction, Periodic, Accumulating for where this pattern fits in the taxonomy). Every month, the same set of positions gets a fresh set of fact rows regardless of whether the budget changed — predictability is the point.

A Windmill flow that runs monthly, extracts changes from

Banner, resolves surrogate keys, upserts the fact, and advances the watermark. Idempotent: rerunning the flow does not duplicate data.

A reconciliation query that ties the warehouse's totals back

to the Banner-side Position Control reports the Budget Office is already running. The two must agree to the dollar.

The grain of fct_position_budget is one row per (position-code, fund-code, snapshot-month). The grain was declared in Declare the Grain — One Row Equals One What? and nothing in this step changes it.

Before you start

You should have:

Build the Date Dimension — One Row Per Day, Three Calendars in One Table complete. dim_date is loaded, the

fiscal-year hierarchy is populated, and each month-end has a date_key you can look up by (year, month, day = month_end).

Build the Position Dimension — SCD Type 2 and the Discipline of History complete. dim_position is loaded

with SCD Type 2 discipline. Every active position has a current row (current_flag = TRUE), and the surrogate position_key is stable across the position's version history.

**dim_organization, dim_fund, and dim_employee loaded** —

the three other conformed dimensions this fact joins to. dim_organization from FTVORGN, dim_fund from GOVSDAV, dim_employee from NBRJOBS + PWVEMPL (the latter is the Banner view that materializes employee attributes from PEBEMPL + SPRIDEN). For each, an "Unknown" sentinel row with _key = -1 exists, ready to absorb NULL source FKs.

**Familiarity with NBRPLBD** — the position labor distribution

table. One row per (position, suffix, fund, account, effective date) declaring what percentage of the position's labor is charged against each funding source. The amount per fund per month is derived: (annual_salary × percent / 12) / 100. Verify the exact formula with your Banner Finance team before shipping the load.

An ETL watermark row for NBRPLBD and one for NBBPOSN in

the etl_watermark table (see ETL from Banner — Moving Data on a Schedule, with Windmill for the watermark pattern). On day one both can be a far-past sentinel like '1900-01-01'.

You do not need PHRHIST (payroll transaction history — that feeds a separate transaction fact in a later star), TBRACCD (student AR — a different domain), or actual GL ledger postings (those are a different periodic snapshot). The Position-Budget fact is BUDGET data and what the position SHOULD have cost, not what the payroll actually paid.

Build it

A periodic snapshot fact has three jobs at once:

Record the budgeted dollars per position per fund per month.
Record the actual dollars per position per fund per month

(encumbrances or paid, depending on the snapshot date).

Carry foreign keys to the conformed dimensions so analysts can

slice by department, employee, fund, time, and position attributes without ever touching Banner.

The shape of the table is deliberately tiny. Every column has a job; nothing extra is allowed.

One row of fct_position_budget at full detail — five foreign keys to the conformed dimensions, two additive measures, the composite primary key that enforces the grain. Every column has a job; nothing extra is allowed in.

-- fct_position_budget — periodic snapshot, monthly grain.
-- One row per (position, fund, month). Every column is either an
-- FK to a conformed dimension or an additive measure.
CREATE TABLE fct_position_budget (
    -- composite primary key enforces the grain at the DB level
    position_key   INTEGER  NOT NULL REFERENCES dim_position    (position_key),
    fund_key       INTEGER  NOT NULL REFERENCES dim_fund        (fund_key),
    date_key       INTEGER  NOT NULL REFERENCES dim_date        (date_key),
    -- non-grain conformed FKs (informational, not part of grain)
    employee_key   INTEGER  NOT NULL REFERENCES dim_employee    (employee_key),
    org_key        INTEGER  NOT NULL REFERENCES dim_organization(org_key),
    -- additive measures
    budgeted_amt   NUMERIC(12,2)    NOT NULL DEFAULT 0,
    actual_amt     NUMERIC(12,2)    NOT NULL DEFAULT 0,
    -- provenance
    source_loaded_at TIMESTAMP      NOT NULL,
    PRIMARY KEY (position_key, fund_key, date_key)
);
CREATE INDEX ix_fct_pb_date  ON fct_position_budget (date_key);
CREATE INDEX ix_fct_pb_org   ON fct_position_budget (org_key, date_key);
CREATE INDEX ix_fct_pb_emp   ON fct_position_budget (employee_key, date_key);

The composite primary key (position_key, fund_key, date_key) is the most important line in the file. It enforces the grain at the database level — the database itself refuses to accept two rows for the same position-fund-month combination. If the ETL ever tries to insert a duplicate, the constraint fires and the transaction rolls back. That single line catches a class of bugs the load logic otherwise hides.

employee_key and org_key are FKs but NOT in the primary key — they are informational dimensions, derived from the position at snapshot time. The same position-fund-month is one fact row even if the assigned employee changed mid-month. The position's current employee is captured as a snapshot attribute, not as part of the grain. If a position changed hands during the month, the fact row holds whichever employee was current at the snapshot date — the SCD Type 2 row of dim_employee resolves correctly because we look it up using the snapshot date.

Extract. The source is NBBPOSN × NBRPLBD. NBBPOSN is the position master (already feeding dim_position per Build the Position Dimension — SCD Type 2 and the Discipline of History); NBRPLBD is the labor distribution table — one row per (position, suffix, fund, account, effective date) declaring how the position's labor is split. The extract pulls only rows changed since the last successful load:

-- Extract: pull NBRPLBD distribution rows changed since the watermark.
-- Each row tells us a percent allocation of a position's labor to a fund.
SELECT  pld.nbrplbd_posn         AS position_code,
        pld.nbrplbd_suff         AS suffix,
        pld.nbrplbd_fund_code    AS fund_code,
        pld.nbrplbd_orgn_code    AS org_code,
        pld.nbrplbd_acct_code    AS account_code,
        pld.nbrplbd_percent      AS distribution_percent,
        pld.nbrplbd_effective_date AS effective_date,
        pld.nbrplbd_activity_date  AS activity_date,
        -- position attributes that drive the budget calculation
        pos.nbbposn_salary_table AS salary_table,
        pos.nbbposn_salary_grade AS salary_grade,
        pos.nbbposn_step         AS salary_step,
        pos.nbbposn_budget       AS annual_budget,
        pos.nbbposn_eclass_code  AS eclass_code
FROM    nbrplbd  pld
JOIN    nbbposn  pos
        ON  pos.nbbposn_posn = pld.nbrplbd_posn
WHERE   pld.nbrplbd_activity_date > :watermark_nbrplbd
   OR   pos.nbbposn_activity_date > :watermark_nbbposn
ORDER BY pld.nbrplbd_posn, pld.nbrplbd_suff, pld.nbrplbd_fund_code;

Notice the OR in the WHERE — a change to EITHER the labor distribution OR the position itself triggers a re-snapshot of the affected position. Two watermarks advance independently.

Transform. For each extracted row, resolve the surrogate keys and compute the monthly amount. The position-key lookup uses the SCD Type 2 row that was current at the snapshot date — exactly the join from Build the Position Dimension — SCD Type 2 and the Discipline of History's "Verify against Banner" section, but anchored to the snapshot date rather than today:

-- Transform: for each (position, fund, month) combination,
-- resolve the surrogate keys against the dim tables AS OF the
-- snapshot date, and compute the monthly amount.
WITH snapshot_rows AS (
    SELECT  e.position_code,
            e.fund_code,
            e.org_code,
            -- per-position-fund monthly budget = annual × pct ÷ 100 ÷ 12
            ROUND(e.annual_budget * e.distribution_percent / 100.0 / 12.0, 2)
              AS budgeted_amt,
            :snapshot_date AS snapshot_date
    FROM    extract_staging e
)
SELECT  dp.position_key,
        df.fund_key,
        dd.date_key,
        de.employee_key,
        do_.org_key,
        s.budgeted_amt,
        -- actual is loaded from a separate query against the
        -- actuals snapshot table - placeholder here.
        0.00 AS actual_amt,
        CURRENT_TIMESTAMP AS source_loaded_at
FROM    snapshot_rows s
JOIN    dim_position    dp ON dp.position_code = s.position_code
                          AND dp.effective_start_date <= s.snapshot_date
                          AND (dp.effective_end_date >= s.snapshot_date
                               OR dp.effective_end_date IS NULL)
JOIN    dim_fund        df ON df.fund_code     = s.fund_code
JOIN    dim_organization do_ ON do_.org_code   = s.org_code
JOIN    dim_date        dd ON dd.full_date     = s.snapshot_date
LEFT JOIN dim_employee  de ON de.position_code = s.position_code
                          AND de.effective_start_date <= s.snapshot_date
                          AND (de.effective_end_date >= s.snapshot_date
                               OR de.effective_end_date IS NULL);

The LEFT JOIN to dim_employee is deliberate — a vacant position has no employee, and we want the fact row anyway (with employee_key = -1, the "Unknown" sentinel). A COALESCE(de.employee_key, -1) in the SELECT handles it.

Load. The fact is loaded with an UPSERT, keyed on the composite grain. Rerunning the load for the same month is a no-op or an in-place update — never a duplicate:

-- Load: UPSERT into fct_position_budget.
-- Idempotent by design — running twice = running once.
INSERT INTO fct_position_budget (
    position_key, fund_key, date_key,
    employee_key, org_key,
    budgeted_amt, actual_amt, source_loaded_at)
SELECT  position_key, fund_key, date_key,
        COALESCE(employee_key, -1), org_key,
        budgeted_amt, actual_amt, source_loaded_at
FROM    transform_staging
ON CONFLICT (position_key, fund_key, date_key) DO UPDATE
SET     employee_key     = EXCLUDED.employee_key,
        org_key          = EXCLUDED.org_key,
        budgeted_amt     = EXCLUDED.budgeted_amt,
        actual_amt       = EXCLUDED.actual_amt,
        source_loaded_at = EXCLUDED.source_loaded_at;

Advance the watermark (only after the transaction commits):

UPDATE etl_watermark
SET    last_loaded_at = :max_activity_date_in_batch
WHERE  source_table IN ('NBRPLBD', 'NBBPOSN');

Wrap it in a Windmill flow. The full sequence — extract, transform, load, watermark — runs as four steps in a single Windmill flow, scheduled monthly (typically the 2nd or 3rd of each month, after Banner's month-end close is complete). See ETL from Banner — Moving Data on a Schedule, with Windmill for the schedule pattern and The ETL Flow — Wiring the Load into Windmill for the actual Windmill flow definition.

The monthly load — extract NBBPOSN × NBRPLBD changes since the watermark, resolve surrogate keys against the dim tables, UPSERT into the fact keyed on (position_key, fund_key, date_key), advance the watermark on success.

The monthly flow takes minutes — the row counts are small and the warehouse is on dedicated hardware. The bigger time investment is the initial load (one-time backfill of historical months from Banner snapshots), which can take an hour for a few years of history.

Verify against Banner

Two reconciliation queries make the fact trustworthy. Both should return zero rows or stop and investigate.

Total budgeted dollars per fiscal year must match Banner. The Banner-side report — typically a Position Control summary from the Budget Office — has a total budgeted-dollars number per fiscal year. Your warehouse query against fct_position_budget summing budgeted_amt for the same fiscal year must produce the same number. To the dollar:

-- Total budgeted dollars per fiscal year.
SELECT  d.fiscal_year,
        SUM(f.budgeted_amt) AS warehouse_budgeted_total
FROM    fct_position_budget f
JOIN    dim_date d ON d.date_key = f.date_key
WHERE   d.fiscal_year IN (2024, 2025, 2026)
GROUP BY d.fiscal_year
ORDER BY d.fiscal_year;

Compare to the Banner Finance Office's fiscal-year budget summary. If the numbers differ by more than rounding, the load has a bug — usually a mis-mapped fund code, a percent calculation error, or a position that was active in Banner but missed by the extract's WHERE clause.

Per-org totals for the most recent closed month. A finer-grain check catches the same bugs at a smaller blast radius:

-- Budgeted by org for the most recent fully-loaded month.
SELECT  o.org_name,
        SUM(f.budgeted_amt) AS budgeted_this_month,
        COUNT(DISTINCT f.position_key) AS positions_in_org
FROM    fct_position_budget f
JOIN    dim_date            d ON d.date_key = f.date_key
JOIN    dim_organization    o ON o.org_key  = f.org_key
WHERE   d.year_month = (SELECT MAX(d2.year_month)
                        FROM   dim_date d2
                        JOIN   fct_position_budget f2
                               ON f2.date_key = d2.date_key)
GROUP BY o.org_name
ORDER BY budgeted_this_month DESC;

This list — top departments by budgeted dollars for last month — should be instantly recognizable to anyone in the Budget Office. Show it to them. If a department is missing or a number looks wildly off, you have a load gap, a dim_organization mapping mistake, or a position whose org code changed without the SCD catching it. Fix before declaring the star "open for queries."

Watch out

Five traps that catch teams loading their first fact:

The grain is the primary key. Defend it. If your load is

producing rows that violate PRIMARY KEY (position_key, fund_key, date_key), the database tells you immediately. Do NOT add a surrogate fct_id PK that masks duplicates; let the composite key fail loudly. The constraint violation IS the feature — it catches every grain bug at load time instead of in a confused VP's email three months later.

**NBRPLBD is effective-dated.** Multiple rows per

(position, fund) with different nbrplbd_effective_date values. The extract must use the MAX-effective pattern from The MAX() Subquery — Getting the Row That's Current anchored at the snapshot date — not just the latest row. A reorganization mid-year that re-allocates a position's funding split is real; the snapshot for January should use January's split, not today's. The transform's WHERE needs nbrplbd_effective_date = (SELECT MAX(...) WHERE effective_date <= snapshot_date AND same position/fund).

**Percent × salary ÷ 12 has rounding error — anchor the

reconciliation.** Each monthly row rounds the cents. Summing 12 monthly rows for a position-fund may not exactly equal the annual-budget × percent ÷ 100. The discrepancy is small (under $1 per position per year) but real. If your Budget Office expects to-the-cent agreement against an annual figure, store the annual amount as a separate measure or have the monthly load round-trip against the annual to absorb the cents.

Vacant positions DO appear in the fact. A position with no

assigned employee still has a budget allocation. The fact row carries employee_key = -1 (the "Unknown / Vacant" sentinel in dim_employee). Filtering vacant positions out of analytic queries is a reporting choice; the fact stores the truth. New teams sometimes drop vacant positions during the load — "there's no employee, the row is meaningless" — which silently under-states budgeted totals by every vacancy. Wrong. Vacancies are budget.

**Do not add credit_hours, gpa, or any cross-grain measure.**

The temptation will arise: "while we're loading this fact, let me also add the position's headcount, the FTE, the credit hours the employee teaches." Stop. Those measures live at different grains (employee-term for credit hours, position-month for FTE if you want it, etc.). Cross-grain measures in one fact table produce silent multiplication when joined. If a measure is at a different grain than (position × fund × month), it belongs in its own fact table — see The Second Star — Admissions as an Accumulating Snapshot for the second star and The Three Fact-Table Patterns — Transaction, Periodic, Accumulating for the pattern.

The one-sentence takeaway

The fact table is one DDL, one extract, one transform, one UPSERT, one watermark — wrapped in a transaction and fired by a Windmill schedule. The discipline is in the grain, the surrogate-key lookups, and the reconciliation against Banner.

Track G · Step 6 of 8 · Building the Waubonsee warehouse

The ETL Flow — Wiring the Load into Windmill

The fact table is built. The load query works. Now the harder question: how does it RUN every month, unattended, recoverable, monitored — at 02:00 while you are asleep? Windmill is the stage manager. The flow is the cue sheet. This step turns the load you wrote in G5 into a piece of infrastructure that just works.

17 min readwarehouseetlwindmillflowschedulewatermarkretryposition-budget

Goal

By the end of this step you will have:

A working Windmill flow named load_fct_position_budget

that runs the G5 load end-to-end: read watermarks → extract from Banner → transform → UPSERT into the fact → advance watermarks → notify.

A monthly schedule firing the flow on the 2nd of each

month at 02:00 (after Banner's month-end close window).

Secrets holding the Banner read-only credentials and the

warehouse connection, separated from the flow code.

A retry policy that handles transient Banner-session

exhaustion automatically, plus an on-failure handler that pages on-call when retries exhaust.

The ability to trigger the flow manually for backfills or

emergency reloads, without changing the scheduled run.

The flow does not invent any new SQL — every query was already written in Build the Position-Budget Fact — The Center of the First Star. G6 wires it into the infrastructure that runs it on schedule, recovers from failure, and tells you when it finishes.

Before you start

You should have:

Build the Position-Budget Fact — The Center of the First Star complete — the

fct_position_budget table exists, the extract/transform/load SQL is tested in a manual run, and the etl_watermark table has rows for NBBPOSN and NBRPLBD.

Conceptual familiarity with ETL pipelines — see

ETL from Banner — Moving Data on a Schedule, with Windmill for the extract/transform/load model and the watermark pattern.

A Windmill workspace with admin access — you will need to

create resources (database connections), variables/secrets (credentials), scripts, and a flow. If you have never touched Windmill, work through the WindmillExplainer wiki's introductory tracks first; this article assumes you know what a script and a flow are.

The Banner database account with read-only permissions on

NBBPOSN, NBRPLBD, NBRBJOB, PWVEMPL, FTVORGN, GOVSDAV, and the SPRIDEN/SGBSTDN ecosystem (per Schemas — Which Drawer the Table Lives In grants). The account should NOT have write privileges on Banner.

The warehouse database account with write permissions on

fct_position_budget, the dim_* tables, and etl_watermark.

A monitoring channel — Slack webhook, email distribution,

or PagerDuty integration — for the on-failure handler.

You do not need Validate Against Banner — Agree to the Cent or Stop complete yet (it is the next step). You do not need any of G8/G9/G10 — those are future stars and will get their own flows.

Build it

A Windmill flow is a sequence of script steps connected by data dependencies — the output of step N becomes input to step N+1. Behind the flow sits a schedule that fires it on a cron expression, resources that supply database connections, and secrets that hold the credentials neither the flow code nor the schedule should ever see.

The choreography analogy is exact. A stage manager running a theatrical production has a cue sheet — an ordered list of events: "Cue 1: house lights down. Cue 2: stage lights up on position 4. Cue 3: actor enters from stage left. Cue 4: spotlight follows actor to center." Each cue is a discrete action, called in order, by the same stage manager, every performance. When a cue fails — a sound effect doesn't fire, a prop isn't where it should be — the stage manager has a backup plan: skip the cue, recover, signal the next cue. The audience sees a polished show; the stage manager sees the cue sheet.

The Windmill flow is the cue sheet. The scripts are the cues. The audience is the analyst opening Power BI Monday morning expecting fresh numbers. The stage manager is Windmill, invisible, calling the cues at 02:00 every first weekday of the month while everyone else is asleep.

The six-step Position-Budget load flow as Windmill sees it — read watermarks, extract from Banner, transform with surrogate-key lookups, UPSERT into the fact, advance the watermark, notify on completion. Each step is its own Windmill script; the flow connects them.

Step 1 — Set up the resources

Before the flow can run, Windmill needs to know how to connect to Banner and to the warehouse. Both connections are resources — Windmill's typed wrappers around credentials and connection strings.

Create two database resources:

banner_oracle_readonly — Oracle connection to the Banner

production database, account scoped to read-only on the source tables. The password lives in a Windmill secret, not in the resource definition; the resource references the secret by name.

warehouse_postgres_writer — PostgreSQL (or your warehouse

database) connection with write privileges on dim_*, fct_*, and etl_watermark. Same secret pattern.

The secrets/variables distinction matters — see the WindmillExplainer wiki's article on variables vs secrets. Anything you would not write on a sticky note (passwords, API keys, database URLs containing credentials) goes in a secret. Anything you would (schedule cron expressions, database hostnames without credentials, slack channel names) goes in a variable. Get this wrong and either you leak secrets to logs or you cannot rotate credentials without re-deploying code.

Step 2 — Write the script steps

Each of the six load steps from Build the Position-Budget Fact — The Center of the First Star becomes its own Windmill script. Scripts are small, single- purpose, individually testable. The discipline: each script takes typed inputs and returns typed outputs. The flow wires the outputs of one script to the inputs of the next.

The six scripts:

**read_watermarks** — input: nothing. Output: the

current watermark timestamps for NBBPOSN and NBRPLBD, plus the snapshot date for this run (computed from today's date — the month-end of the prior month).

**extract_from_banner** — input: the two watermarks

and the snapshot date from step 1. Output: a staging table populated with the changed rows from NBBPOSN × NBRPLBD.

**transform_resolve_keys** — input: the staging table

reference. Output: a second staging table with surrogate keys resolved against dim_position, dim_fund, dim_date, dim_employee, dim_organization.

**load_fact** — input: the transformed staging table.

Output: row count loaded into fct_position_budget.

**advance_watermarks** — input: the maximum

activity_date from the extracted batch (passed down from step 2). Output: the new watermark values written to etl_watermark.

**notify_completion** — input: the row count from

step 4, the new watermark values from step 5. Output: nothing (sends a Slack message and returns).

Each script is small — typically 20-60 lines. The full SQL for steps 2, 3, and 4 lives in Build the Position-Budget Fact — The Center of the First Star; the script just wraps the SQL in the Windmill script signature.

For example, the load_fact script (Python or PL/SQL — Windmill supports both):

# load_fact — wraps the UPSERT from G5 in a Windmill script.
# Returns the count of rows loaded for downstream steps and
# for the completion notification.
import psycopg2

def main(transform_staging_table: str,
         warehouse_pg: dict) -> dict:
    conn = psycopg2.connect(**warehouse_pg)
    with conn, conn.cursor() as cur:
        cur.execute(f"""
            INSERT INTO fct_position_budget (
                position_key, fund_key, date_key,
                employee_key, org_key,
                budgeted_amt, actual_amt, source_loaded_at)
            SELECT position_key, fund_key, date_key,
                   COALESCE(employee_key, -1), org_key,
                   budgeted_amt, actual_amt, source_loaded_at
            FROM   {transform_staging_table}
            ON CONFLICT (position_key, fund_key, date_key)
              DO UPDATE
              SET employee_key = EXCLUDED.employee_key,
                  org_key      = EXCLUDED.org_key,
                  budgeted_amt = EXCLUDED.budgeted_amt,
                  actual_amt   = EXCLUDED.actual_amt,
                  source_loaded_at = EXCLUDED.source_loaded_at;
        """)
        rows_loaded = cur.rowcount
    conn.close()
    return {"rows_loaded": rows_loaded}

The script does ONE thing — execute the UPSERT — and reports the row count. It does not own the schedule, the retry policy, the credentials, or the dependencies on other steps. Those live in the flow definition and the resource configuration.

Here are the other five scripts in the same shape — each tight, each typed, each focused on a single responsibility.

**read_watermarks** — opens the flow, reads the watermark state and computes the snapshot date for this run:

# read_watermarks - the flow's first step. No inputs (the schedule
# fires it), produces the values every downstream step needs.
import psycopg2
from datetime import date
from calendar import monthrange

def main(warehouse_pg: dict) -> dict:
    # snapshot date = last day of the PRIOR month
    today = date.today()
    y, m = (today.year, today.month - 1) if today.month > 1 \
           else (today.year - 1, 12)
    snapshot_date = date(y, m, monthrange(y, m)[1])

    conn = psycopg2.connect(**warehouse_pg)
    with conn, conn.cursor() as cur:
        cur.execute("""
            SELECT source_table, last_loaded_at
            FROM   etl_watermark
            WHERE  source_table IN ('NBBPOSN', 'NBRPLBD')
        """)
        wm = {row[0]: row[1].isoformat() for row in cur.fetchall()}
    conn.close()

    return {
        "snapshot_date":       snapshot_date.isoformat(),
        "watermark_nbbposn":   wm.get('NBBPOSN', '1900-01-01'),
        "watermark_nbrplbd":   wm.get('NBRPLBD', '1900-01-01'),
    }

**extract_from_banner** — reads Banner with the watermarks, writes a staging table the next step will transform:

# extract_from_banner - read-only against Banner, writes to staging.
# Returns the staging table name + the max activity_date seen
# (needed for advancing the watermark in step 5).
import cx_Oracle, psycopg2

def main(watermark_nbbposn: str, watermark_nbrplbd: str,
         snapshot_date: str,
         banner_oracle: dict, warehouse_pg: dict) -> dict:
    staging = f"staging_pb_extract_{snapshot_date.replace('-', '')}"

    bcon = cx_Oracle.connect(**banner_oracle)
    pcon = psycopg2.connect(**warehouse_pg)
    max_activity = None
    with bcon.cursor() as bcur, pcon, pcon.cursor() as pcur:
        pcur.execute(f"DROP TABLE IF EXISTS {staging}; "
                     f"CREATE TABLE {staging} (LIKE staging_pb_template "
                     f"INCLUDING ALL);")
        bcur.execute("""
            SELECT pld.nbrplbd_posn, pld.nbrplbd_suff,
                   pld.nbrplbd_fund_code, pld.nbrplbd_orgn_code,
                   pld.nbrplbd_percent, pld.nbrplbd_activity_date,
                   pos.nbbposn_budget, pos.nbbposn_status
            FROM   nbrplbd pld
            JOIN   nbbposn pos ON pos.nbbposn_posn = pld.nbrplbd_posn
            WHERE  pld.nbrplbd_activity_date > TO_DATE(:wm1, 'YYYY-MM-DD')
               OR  pos.nbbposn_activity_date > TO_DATE(:wm2, 'YYYY-MM-DD')
        """, wm1=watermark_nbrplbd, wm2=watermark_nbbposn)
        for row in bcur:
            pcur.execute(f"INSERT INTO {staging} VALUES (%s, ...)", row)
            if max_activity is None or row[5] > max_activity:
                max_activity = row[5]
    bcon.close(); pcon.close()

    return {
        "staging_table":  staging,
        "max_activity":   max_activity.isoformat() if max_activity else None,
        "snapshot_date":  snapshot_date,
    }

**transform_resolve_keys** — looks up surrogate keys against each conformed dimension, writes a second staging table ready for the load:

# transform_resolve_keys - SCD-aware lookups, produces fact-shaped staging.
import psycopg2

def main(staging_table: str, snapshot_date: str,
         warehouse_pg: dict) -> dict:
    transformed = f"{staging_table}_transformed"
    conn = psycopg2.connect(**warehouse_pg)
    with conn, conn.cursor() as cur:
        cur.execute(f"""
            DROP TABLE IF EXISTS {transformed};
            CREATE TABLE {transformed} AS
            SELECT  dp.position_key, df.fund_key, dd.date_key,
                    COALESCE(de.employee_key, -1)   AS employee_key,
                    do_.org_key,
                    ROUND(s.nbbposn_budget * s.nbrplbd_percent
                          / 100.0 / 12.0, 2)         AS budgeted_amt,
                    0.00                              AS actual_amt,
                    CURRENT_TIMESTAMP                 AS source_loaded_at
            FROM    {staging_table} s
            JOIN    dim_position    dp ON dp.position_code = s.nbrplbd_posn
                                       AND dp.effective_start_date <= %s
                                       AND (dp.effective_end_date >= %s
                                            OR dp.effective_end_date IS NULL)
            JOIN    dim_fund        df ON df.fund_code = s.nbrplbd_fund_code
            JOIN    dim_organization do_ ON do_.org_code = s.nbrplbd_orgn_code
            JOIN    dim_date        dd ON dd.full_date = %s
            LEFT JOIN dim_employee  de ON de.position_code = s.nbrplbd_posn
                                       AND de.effective_start_date <= %s
                                       AND (de.effective_end_date >= %s
                                            OR de.effective_end_date IS NULL);
        """, (snapshot_date,) * 5)
    conn.close()
    return {"transformed_table": transformed}

**advance_watermarks** — only runs after load_fact commits; moves the watermark forward so the next run picks up where this one left off:

# advance_watermarks - the last fact-side step. Pure UPDATE.
import psycopg2

def main(max_activity: str, warehouse_pg: dict) -> dict:
    if max_activity is None:
        return {"advanced": False, "reason": "no rows in batch"}
    conn = psycopg2.connect(**warehouse_pg)
    with conn, conn.cursor() as cur:
        cur.execute("""
            UPDATE etl_watermark
            SET    last_loaded_at = GREATEST(last_loaded_at, %s::timestamp)
            WHERE  source_table IN ('NBBPOSN', 'NBRPLBD')
        """, (max_activity,))
    conn.close()
    return {"advanced": True, "new_watermark": max_activity}

**notify_completion** — parallel to the watermark step, posts a status to Slack so the team sees the result without polling:

# notify_completion - the operator-facing summary. Idempotent by
# including the flow run ID in the message to enable dedupe.
import requests, os

def main(rows_loaded: int, new_watermark: str,
         flow_run_id: str, slack_webhook: str) -> dict:
    text = (f":bar_chart: *Position-Budget load complete*\n"
            f"> {rows_loaded:,} rows upserted\n"
            f"> watermark advanced to `{new_watermark}`\n"
            f"> run: `{flow_run_id}`")
    r = requests.post(slack_webhook, json={"text": text}, timeout=10)
    return {"notified": r.ok, "status": r.status_code}

Six scripts total. Each is tight, typed, individually testable in the Windmill UI's "Run preview" pane. The flow YAML in the next step wires them together.

Step 3 — The complete OpenFlow YAML

Windmill flows can be defined visually (drag and drop in the designer) AND exported as OpenFlow YAML — a portable, version- controllable representation. The YAML for the full Position-Budget load looks like this:

The complete Position-Budget Windmill flow at full detail — six script steps with typed data flowing between them, the Banner Oracle and warehouse PostgreSQL resources bound, the retry/failure handler decoration, and the monthly schedule firing the whole orchestration at 02:00 on day 2 of each month. The YAML below is what this picture renders as.

# flows/load_fct_position_budget.flow.yaml
summary: Load fct_position_budget (monthly periodic snapshot)
description: |
  Monthly load of the Position-Budget warehouse fact from
  Banner NBBPOSN x NBRPLBD. Idempotent (UPSERT); recoverable
  (watermark advances only on commit).
schedule:
  cron: "0 2 2 * *"           # 02:00 on the 2nd of each month
  timezone: America/Chicago
  catch_up: false
value:
  modules:
    - id: a
      summary: Read watermarks
      value:
        type: rawscript
        language: python3
        input_transforms:
          warehouse_pg:
            type: javascript
            expr: "resource('f/banner_dw/warehouse_postgres_writer')"
        content_path: f/banner_dw/read_watermarks.py

    - id: b
      summary: Extract from Banner
      value:
        type: rawscript
        language: python3
        input_transforms:
          watermark_nbbposn:
            type: javascript
            expr: results.a.watermark_nbbposn
          watermark_nbrplbd:
            type: javascript
            expr: results.a.watermark_nbrplbd
          snapshot_date:
            type: javascript
            expr: results.a.snapshot_date
          banner_oracle:
            type: javascript
            expr: "resource('f/banner_dw/banner_oracle_readonly')"
          warehouse_pg:
            type: javascript
            expr: "resource('f/banner_dw/warehouse_postgres_writer')"
        content_path: f/banner_dw/extract_from_banner.py
      retry:
        constant:
          attempts: 3
          seconds: 60

    - id: c
      summary: Transform - resolve surrogate keys
      value:
        type: rawscript
        language: python3
        input_transforms:
          staging_table:
            type: javascript
            expr: results.b.staging_table
          snapshot_date:
            type: javascript
            expr: results.b.snapshot_date
          warehouse_pg:
            type: javascript
            expr: "resource('f/banner_dw/warehouse_postgres_writer')"
        content_path: f/banner_dw/transform_resolve_keys.py

    - id: d
      summary: Load fact (UPSERT)
      value:
        type: rawscript
        language: python3
        input_transforms:
          transform_staging_table:
            type: javascript
            expr: results.c.transformed_table
          warehouse_pg:
            type: javascript
            expr: "resource('f/banner_dw/warehouse_postgres_writer')"
        content_path: f/banner_dw/load_fact.py

    - id: parallel_e_f
      summary: Watermark advance and notification (parallel)
      value:
        type: branchall
        parallel: true
        branches:
          - summary: Advance watermarks
            modules:
              - id: e
                value:
                  type: rawscript
                  language: python3
                  input_transforms:
                    max_activity:
                      type: javascript
                      expr: results.b.max_activity
                    warehouse_pg:
                      type: javascript
                      expr: "resource('f/banner_dw/warehouse_postgres_writer')"
                  content_path: f/banner_dw/advance_watermarks.py
          - summary: Notify completion
            modules:
              - id: f
                value:
                  type: rawscript
                  language: python3
                  input_transforms:
                    rows_loaded:
                      type: javascript
                      expr: results.d.rows_loaded
                    new_watermark:
                      type: javascript
                      expr: results.b.max_activity
                    flow_run_id:
                      type: javascript
                      expr: flow_input.iam.flow_id
                    slack_webhook:
                      type: javascript
                      expr: "variable('f/banner_dw/slack_etl_alerts')"
                  content_path: f/banner_dw/notify_completion.py

  failure_module:
    summary: On-failure handler
    value:
      type: rawscript
      language: python3
      input_transforms:
        flow_run_id:
          type: javascript
          expr: flow_input.iam.flow_id
        failed_module:
          type: javascript
          expr: flow_input.iam.previous_id
        slack_webhook:
          type: javascript
          expr: "variable('f/banner_dw/slack_oncall_pager')"
      content_path: f/banner_dw/notify_failure.py

Six modules (a through f), wired sequentially with one parallel branch at the end, a retry policy on the Banner-extract step (the most failure-prone), and a failure_module that pages on-call if the flow exhausts its retries. The YAML is version-controlled alongside the script files — every change to the flow is a commit, reviewable like any other code.

Step 4 — Wire the scripts into a flow visually

The flow is the ordered composition. In Windmill's flow designer, drag the six scripts onto the canvas and connect their outputs to the next step's inputs. The shape:

read_watermarks
    ↓ (passes watermarks + snapshot_date)
extract_from_banner
    ↓ (passes staging_table_name)
transform_resolve_keys
    ↓ (passes transformed_staging_name)
load_fact
    ↓ (passes rows_loaded)
advance_watermarks      [parallel branch]
    ↓ (passes new_watermarks)         ↓
                       notify_completion

Step 5 and step 6 can run in parallel after step 4, since they have no dependency on each other — the watermark advance and the notification are independent. Windmill's flow designer makes parallel branches explicit (see the WindmillExplainer wiki's Parallel Steps article).

The flow itself wraps every step in a single database transaction by enabling the flow's transactional setting — if ANY step fails, the entire flow rolls back, the warehouse is unchanged, the watermark does not advance, and the next run will reprocess the same window.

Step 5 — Set the schedule

Create a Windmill schedule that fires load_fct_position_budget on the 2nd of every month at 02:00:

0 2 2 * *

Cron format: minute=0, hour=2, day=2, month=any, day-of-week=any. 2 AM on the 2nd of each month gives Banner's month-end close process a full day to settle before the warehouse pulls.

The schedule should also have:

Time zone explicitly set to America/Chicago (or your

warehouse's TZ) — see ETL from Banner — Moving Data on a Schedule, with Windmill gotcha #5 on why time-zone clarity matters.

Catch-up disabled — if the Windmill instance was down

on the 2nd, do NOT auto-fire on the 3rd. A skipped monthly load is a thing the on-call should know about and trigger manually, not something the scheduler should silently recover from.

Step 6 — Configure retries and the failure handler

Transient failures are normal at 02:00 — Banner session pool exhaustion, network blips, a temporary lock on a source row. Configure the flow's retry policy:

Retry up to 3 times, with exponential backoff (1 min,

5 min, 15 min). Most transient issues resolve within an hour; persistent ones need a human.

On final failure, fire the on-failure handler — a

separate Windmill script that posts to Slack and pages the on-call. The handler should include the flow run ID, the failed step name, and the last successful watermark so the on-call has the context to triage.

See the WindmillExplainer wiki's Retry & Failure article for the deeper recipes — the article explains why exponential backoff beats fixed-interval, and when to use circuit breakers instead.

Step 7 — Test, then enable

Before flipping the schedule on:

Run the flow manually from the Windmill UI. Confirm

each step's output matches expectations.

Force a retry by killing a Banner connection mid-load.

The retry policy should kick in; the flow should recover and complete.

Force a failure by revoking the warehouse write grant

temporarily. The on-failure handler should fire; the Slack message should arrive; the watermark should NOT advance.

Restore the grant, manually fire the flow again. The

load should succeed and the watermark should advance.

Only after all four manual tests pass should you enable the monthly schedule. The first scheduled run is then a non-event — you have already proven the flow handles success, retry, and failure cleanly.

Verify against Banner

A successful flow run produces three observable signals. Check all three after the first scheduled fire.

1. The fact table grew by the expected row count. For a typical month with ~600 active positions × ~5 funds = ~3,000 new rows (plus any rows updated from prior months):

-- Rows added or updated by the most recent flow run.
SELECT COUNT(*)
FROM   fct_position_budget
WHERE  source_loaded_at >= CURRENT_TIMESTAMP - INTERVAL '1 day';

2. The watermark advanced exactly to the batch's max activity date. If the watermark did not move, the run silently failed to commit, OR the on-failure handler ran but the developer missed the alert:

SELECT source_table, last_loaded_at
FROM   etl_watermark
WHERE  source_table IN ('NBBPOSN', 'NBRPLBD');

3. The completion notification arrived on the monitoring channel. No notification = the notification step failed, which is itself a problem (the team is now flying blind on load status). Investigate immediately.

A year of monthly runs as the Windmill scheduler sees them — twelve scheduled fires, two retries after transient failures, one manual run for a backfill. The watermark advances only on successful completion.

Track the run history over the first three months. Successful runs produce: completion notification → watermark advance → fact table grew. Failed runs produce: on-failure handler fires → Slack alert → watermark unchanged → next scheduled run picks up the same window. The pattern is observable from the warehouse side without ever opening the Windmill UI.

Watch out

Five gotchas that show up in production Windmill loads:

Credentials in code, not in scripts. A common rookie

error: hard-coding a database password into the script for "quick testing." The password gets committed to Windmill's git mirror, gets indexed by every code-search tool, and becomes a credential rotation emergency three months later when somebody notices. Use Windmill secrets from day one. The WindmillExplainer wiki's Variables vs Secrets article is mandatory reading before any production flow.

The schedule's time zone vs the database's time zone.

0 2 * * * is "2 AM in the Windmill instance's local time zone" by default. If Windmill runs in UTC and the warehouse runs in America/Chicago, "2 AM" fires 5-6 hours off from what the operator expected — sometimes during Banner's business-day load window, defeating the off-hours intent. Pin the schedule's TZ explicitly.

Retries without idempotency are dangerous. The flow's

load step uses UPSERT (see Build the Position-Budget Fact — The Center of the First Star), which is idempotent. The watermark advance is also idempotent (writing the same value twice is a no-op). But the notification step is NOT idempotent — running it three times sends three Slack messages, and the on-call gets spammed. Either make the notification idempotent (dedupe by flow run ID) or place it OUTSIDE the retry boundary.

The "catch-up on missed runs" setting is a footgun.

Many schedulers offer a "catch up missed runs" option — if the scheduler was down for three days, fire the missed runs in sequence. For a monthly load, this means firing three loads back-to-back when the system comes back up. The first two will operate on overlapping windows and either fail (if non-idempotent) or do redundant work (if idempotent). Disable catch-up. Let the on-call decide whether to manually fire missed runs.

The on-failure handler must itself never fail. If the

Slack webhook is down when the load fails, the handler errors and the operator gets NO signal at all — silent failure on top of silent failure. The handler should have a fallback (email if Slack fails; log to a known table if email fails) and should be tested separately from the main flow. A failing failure handler is the worst possible state for an on-call rotation.

The one-sentence takeaway

A Windmill flow is the choreography around your load SQL — the schedule that fires it, the secrets that connect it to Banner, the retries that recover it, and the watermarks that make it idempotent. Get the flow right once and the warehouse runs unattended for months.

Track G · Step 7 of 8 · Building the Waubonsee warehouse

Validate Against Banner — Agree to the Cent or Stop

The warehouse is loaded. The flow runs every month. The dashboards render. None of it matters if the numbers do not match Banner. The first time a CFO sees a difference between the warehouse's number and the Banner Position Control report, the warehouse loses — every time, every institution, no exceptions. Reconciliation is the discipline that prevents that conversation from happening.

9 min readwarehousereconciliationvalidationbannerauditposition-budget

Goal

By the end of this step you will have:

A reconciliation query that totals

fct_position_budget at three grains (fiscal year, per-org, per-fund) and compares each to the Banner-side Position Control summary the Budget Office already produces.

A reconciliation report — built as a Windmill flow step

that runs immediately after the G6 load — that produces a pass/fail signal per grain.

A threshold policy that distinguishes acceptable rounding

drift (sub-dollar) from real load failures (dollars or more), with documented action paths for each.

A dashboard or notification that surfaces the

reconciliation status to the Budget Office BEFORE they notice on their own. The warehouse should be the first to raise its hand when its numbers do not match Banner — never the second.

The discipline of **fix-or-document every discrepancy on the

day it appears**. Unresolved drift compounds. A 5-cent drift in January becomes a 30-dollar drift by July if nobody traces it.

The reconciliation is not a "verification we do once at launch." It is a permanent feature of the load. Every monthly run produces a reconciliation result. Every result is inspected. Most are green. The red ones are events.

Before you start

You should have:

Build the Position-Budget Fact — The Center of the First Star complete —

fct_position_budget is populated end-to-end, all conformed dimensions are loaded, the load runs cleanly in manual mode.

The ETL Flow — Wiring the Load into Windmill complete — the monthly flow

fires on schedule, completes, and advances watermarks. The reconciliation step you build here will become step 7 of that flow.

Access to the Banner-side reference report — the Budget

Office runs a Position Control budget summary monthly, typically the NHRDIST or NBPBUDR Argos report family. Get a copy of the report's output for the most recent closed month, and get the SQL that produces it from the Argos DataBlock. You will reconcile against THAT SQL's output, not against a screenshot.

An agreed-upon threshold with the Budget Office for

what counts as "agree." Strict reading: zero variance. Real world: sub-dollar rounding drift is normal and acceptable; dollars-or-more is investigated. Get the number in writing before you start.

You do not need The Second Star — Admissions as an Accumulating Snapshot (next step) or any of G9/G10 — reconciliation is per-star and lives with G5/G6. Each future star gets its own G7-shaped reconciliation against its own Banner source-of-truth report.

Build it

A reconciliation is three queries run as one. The first asks Banner. The second asks the warehouse. The third compares them and emits a verdict.

Query 1 — the Banner-side reference

Lift the SQL from the Budget Office's existing Position Control report. For Waubonsee, this typically looks like:

-- Banner reference: total budgeted dollars per fiscal year,
-- aggregated from the same source tables fct_position_budget
-- reads from. Run via the read-only Banner connection.
SELECT  pld.nbrplbd_fund_code     AS fund_code,
        ftvfyr_fyear              AS fiscal_year,
        SUM(pos.nbbposn_budget *
            pld.nbrplbd_percent / 100.0)  AS banner_annual_budget
FROM    nbrplbd pld
JOIN    nbbposn pos
        ON pos.nbbposn_posn = pld.nbrplbd_posn
JOIN    ftvfyr
        ON ftvfyr.ftvfyr_fyear = :fiscal_year
WHERE   pld.nbrplbd_effective_date <= ftvfyr.ftvfyr_end_date
  AND   pld.nbrplbd_effective_date >= ftvfyr.ftvfyr_start_date
  AND   pos.nbbposn_status = 'A'
GROUP BY pld.nbrplbd_fund_code, ftvfyr_fyear;

This query is owned by Banner, not by the warehouse. If the Budget Office's report changes (a new fund excluded, a status filter added), this query changes to match. The reconciliation must use the SAME filters and aggregations the Banner report uses, or you are comparing two legitimately different numbers and calling it a discrepancy.

Query 2 — the warehouse-side aggregate

The warehouse equivalent — same fiscal year, same grouping, summed from fct_position_budget:

-- Warehouse aggregate: same shape as the Banner reference,
-- summed from the fact table for the same fiscal year.
-- Run via the warehouse PostgreSQL connection.
SELECT  df.fund_code,
        dd.fiscal_year,
        SUM(f.budgeted_amt * 12)  AS warehouse_annual_budget
FROM    fct_position_budget f
JOIN    dim_fund df ON df.fund_key = f.fund_key
JOIN    dim_date dd ON dd.date_key = f.date_key
WHERE   dd.fiscal_year = :fiscal_year
GROUP BY df.fund_code, dd.fiscal_year;

The × 12 is the conversion from the warehouse's monthly-snapshot grain to the annual figure the Banner report shows. If the Banner reference shows monthly amounts, drop the × 12. Make sure both queries produce numbers at the SAME granularity before comparing.

Query 3 — the comparison

Join the two result sets on the matching keys and compute the variance per row:

-- Comparison: per (fund, fiscal_year), Banner vs warehouse,
-- with absolute and percentage variance.
WITH banner AS (
    -- result of Query 1
), warehouse AS (
    -- result of Query 2
)
SELECT  COALESCE(b.fund_code, w.fund_code)  AS fund_code,
        COALESCE(b.fiscal_year, w.fiscal_year) AS fiscal_year,
        b.banner_annual_budget,
        w.warehouse_annual_budget,
        (w.warehouse_annual_budget -
         b.banner_annual_budget)            AS variance_amt,
        CASE WHEN b.banner_annual_budget = 0 THEN NULL
             ELSE (w.warehouse_annual_budget -
                   b.banner_annual_budget) /
                  b.banner_annual_budget * 100
        END                                  AS variance_pct
FROM    banner b
FULL OUTER JOIN warehouse w
        ON  w.fund_code   = b.fund_code
        AND w.fiscal_year = b.fiscal_year
ORDER BY ABS(COALESCE(variance_amt, 999999999)) DESC;

The FULL OUTER JOIN is deliberate. Rows that appear in ONLY Banner (warehouse missing the fund) or ONLY warehouse (warehouse holding a fund Banner does not) sort to the top of the result — those are bigger problems than rounding drift. The ORDER BY ABS(variance_amt) DESC puts the biggest discrepancies first.

The monthly reconciliation cycle — Banner's Position Control report and the warehouse's fct_position_budget aggregate run side by side; their totals are compared at three grains (fiscal-year, per-org, per-fund); the result is a green (agree) or red (variance > threshold) signal that fires automatically before the dashboards refresh.

Run the comparison as part of the flow

Add the comparison as step 7 of the Windmill flow built in G6 — right after the watermark advance, right before the completion notification. The step:

Runs Query 1 against Banner via the

banner_oracle_readonly resource.

Runs Query 2 against the warehouse.
Runs Query 3 to compare and produces a result set.
Checks each row's variance_amt against the agreed

threshold (e.g., ABS(variance_amt) >= 1.00).

If any row exceeds threshold, marks the flow as

reconciliation-failed and fires the on-failure handler from G6.

Either way, writes the reconciliation result to a

reconciliation_history table so the Budget Office (and you) can see the trend.

The completion notification (step 8) reports the reconciliation status alongside the rows-loaded count: "Load completed: 3,142 rows loaded, reconciliation: ✅ all funds within $0.50 of Banner."

The threshold policy

Strict reading: warehouse and Banner must agree to the cent. Reality: monthly rounding (per-position-per-fund: annual × percent / 100 / 12, rounded to cents) accumulates harmlessly across a year. A 600-position × 5-fund warehouse will drift ~$0.50-$2.00 per fund per year from the Banner annual calculation, depending on how funds split.

Agree with the Budget Office on tiered thresholds:

**< $1.00 per fund per year**: green. Logged but no

alert. Acceptable rounding drift.

**$1.00 – $50.00 per fund per year**: yellow. Notify but

do not block the load. The discrepancy MAY be rounding amplification or MAY be a small load gap. Investigate during the next business day.

**>= $50.00 per fund per year, OR any missing/extra fund**:

red. Block the dashboards from refreshing until resolved. Page the on-call. This is a load defect or a Banner reference change, not drift.

The numbers above are illustrative. Set them based on what the Budget Office considers material at your institution's budget scale.

Verify against Banner

This step IS the verification. The way to verify the reconciliation itself works is to deliberately introduce a discrepancy and confirm the reconciliation catches it.

Run these three tests before declaring G7 complete:

Test 1 — fabricate a missing fund. Delete one fund's rows from fct_position_budget temporarily (in a test environment):

DELETE FROM fct_position_budget
WHERE fund_key = (SELECT fund_key FROM dim_fund
                  WHERE fund_code = '12001');

Re-run the reconciliation. The result set should show a row with warehouse_annual_budget = NULL and a large negative variance_amt for fund '12001'. If the reconciliation silently passes, your comparison logic is wrong — fix it before restoring the rows.

Test 2 — fabricate a rounding shift. Adjust one fund's amounts by exactly $5.00:

UPDATE fct_position_budget
SET    budgeted_amt = budgeted_amt + 5.00
WHERE  fund_key = (SELECT fund_key FROM dim_fund
                   WHERE fund_code = '12002')
AND    date_key = (SELECT MAX(date_key) FROM dim_date
                   WHERE fiscal_year = 2026);

Re-run the reconciliation. The fund should appear with variance_amt ≈ +60.00 (+$5/month × 12, depending on grain). This tests the threshold logic — the discrepancy should land in the yellow or red tier per your policy.

Test 3 — fabricate a Banner change. Have someone in the Budget Office add a $1.00 adjustment to one fund in Banner in a test environment. Run the next monthly load + the reconciliation. The discrepancy should appear and the on-failure handler should fire.

After all three pass, restore the fabricated changes, re-run the load, and confirm reconciliation goes green.

Watch out

Five gotchas that distinguish a reconciliation that protects the warehouse from one that creates noise:

**The Banner-side query must match what the BO actually

uses.** New writers sometimes write a "clean" Banner reference query that uses the right tables but with different filters than the Budget Office's actual report. The reconciliation then fails forever because the warehouse is compared against a query nobody recognizes. Always use the Budget Office's exact SQL. If they revise the report, you revise the reconciliation in lock-step.

Reconciliation runs ALONGSIDE the load, not after. A

common rookie pattern: load runs at 02:00, reconciliation is a Monday-morning manual check. By Monday, users have already opened the dashboards. If the load was wrong, the wrong numbers are already in slide decks. Reconciliation must be a flow step that runs before the dashboards refresh — see The ETL Flow — Wiring the Load into Windmill for the flow integration.

**Sub-dollar drift is normal — set a threshold, don't

chase zero.** A 5-cent variance per fund per year is the sum of legitimate per-month rounding decisions. Chasing it to zero requires the warehouse to NOT round per monthly row, which then makes monthly slicing incorrect. Pick a threshold, document why, and ignore drift below it. The Budget Office will respect a well-defined threshold; they will lose patience with weekly "reconciled to the cent" theater.

**A missing OR extra fund is always red, even at $0

variance.** A fund that exists in Banner but not in the warehouse means the load skipped it. A fund that exists in the warehouse but not in Banner means a fund was deactivated and the warehouse did not catch the change. Both are structural failures, not amount drift. The reconciliation must check fund-set membership separately from amount variance.

**The reconciliation can hide the bug it should find if

you reconcile against ANOTHER warehouse query.** Some teams build a "second warehouse query" as the reference and compare two warehouse queries. This catches arithmetic bugs in your own code but NEVER catches "the warehouse disagrees with Banner" — because the reference is from the same data the warehouse loaded. Always reconcile against the SOURCE (Banner), not against another warehouse derivation.

The one-sentence takeaway

Reconcile every load against Banner's source-of-truth report. Run the reconciliation automatically alongside the load, surface mismatches before users see them, and fix or document every discrepancy on the day it appears. The warehouse's credibility is built one reconciled load at a time.

Track G · Step 8 of 8 · Building the Waubonsee warehouse

The Second Star — Admissions as an Accumulating Snapshot

The second star is not 'another star like the first.' If G5 was a periodic snapshot, the second star should teach a DIFFERENT fact pattern — otherwise you have learned half the dimensional vocabulary at twice the cost. For Waubonsee, the deliberate second star is Admissions as an accumulating snapshot — one row per applicant, multiple date_keys filling in as milestones happen. And the moment you build it, the warehouse's bus matrix appears: dim_date is shared across both stars, and the foundation for every star that follows is laid.

12 min readwarehousekimballaccumulating-snapshotadmissionssecond-starbus-matrixconformed-dimensionssaradap

Goal

By the end of this step you will have:

A populated **fct_admissions_pipeline** table — an

accumulating-snapshot fact at the grain of one row per prospective applicant per admit term, with date_key FKs to dim_date for each milestone (inquiry, application submitted, interview, decision, accepted, enrolled, plus any local milestones Waubonsee tracks).

A **dim_applicant** dimension built from SARADAP and

related admissions tables, designed to evolve into dim_student once the applicant enrolls — see the "expanded applicant dimension" pattern in Slowly Changing Dimensions — Keeping History When Attributes Change.

The first concrete bus matrix at Waubonsee: a table

showing which dimensions are shared (conformed) across Position-Budget and Admissions, and which are unique to each. **dim_date is the conformed seam.** Future stars will reuse it without re-building.

A clear understanding of what an accumulating snapshot is

good for (short-lived processes with known milestones, 6-15 of them) — and what it is BAD for (long-running processes with open lifecycles like the full degree journey).

A reconciliation pattern (see Validate Against Banner — Agree to the Cent or Stop) adapted to

the accumulating-snapshot grain: per-pipeline-stage counts, agreeing to Banner's Admissions Office reports.

The second star is deliberately a different fact pattern than the first. The catalog already has a periodic-snapshot worked example (G5); we do not need another one. We need a worked example of an accumulating snapshot — the second of Kimball's three fact types (The Three Fact-Table Patterns — Transaction, Periodic, Accumulating) — and Admissions is the higher-ed canonical case (Kimball The Data Warehouse Toolkit, 2nd ed., Ch. 12).

Before you start

You should have:

G1 through G7 complete or in flight. The first star is

loaded, reconciled, and producing trustworthy numbers. Second-star work is a multiplier on first-star credibility — if the first star is not trusted, the second one inherits the distrust.

Read The Three Fact-Table Patterns — Transaction, Periodic, Accumulating. The vocabulary of

"transaction / periodic snapshot / accumulating snapshot" is the spine of the next paragraphs. If you have not internalized which pattern is which, the rest of this article will read as accident.

Read Slowly Changing Dimensions — Keeping History When Attributes Change. The "expanded applicant

dimension" pattern — the same surrogate continuing as the applicant becomes a student and (eventually) an alumnus — is what makes the Admissions star a foundation for multiple future stars, not a dead end.

**Confirmation from the Admissions Office that the

milestone data exists at sufficient granularity.** This is the single biggest risk. Kimball's textbook example has 15 milestone date columns. Real-world SARADAP may have only 4-6 reliably populated (inquiry date, application submitted, decision date, enrollment date). If your installation only captures "submitted" and "decided," the accumulating snapshot collapses to two dates and loses its teaching power — and possibly its analytic value. **Profile SARADAP before committing to this star.** If the granularity is poor, switch to a simpler 4-milestone version, or substitute a different second-star candidate (Registration as factless — see Factless Fact Tables — Events and Coverage and registration factless).

You do not need dim_employee or the HR stack — Admissions operates on a different person population (applicants, not employees) and uses a separate Banner module.

Build it

An accumulating snapshot is structurally different from a periodic snapshot in three ways:

One row per entity for its entire (short) lifecycle.

Position-Budget had one row per (position × fund × month); the same position generated 12 rows per year. Admissions has ONE row per applicant — period. That row is revisited and updated as the applicant moves through milestones.

Multiple date_key FKs per row, most NULL at insert.

Position-Budget had one date_key per row (the snapshot month). Admissions has 8 (or more): inquiry_date_key, application_date_key, interview_date_key, decision_date_key, accepted_date_key, enrolled_date_key, plus institution-specific ones (campus visit, FAFSA received, deposit paid).

The row is UPDATED, not INSERTED. Position-Budget

inserts every month and never updates a prior row. Admissions inserts ONCE per applicant and updates that row repeatedly until the applicant either enrolls or formally declines.

These three structural differences cascade into different DDL, different ETL, different reconciliation patterns.

One row of fct_admissions_pipeline at full detail — the applicant_key, eight milestone date_keys (most NULL at insert), three lag measures, the admissions decision FK. Compare to G5's fct_position_budget anatomy — same star shape, fundamentally different row lifecycle.

The fact table

The DDL — pared down to the milestones a Waubonsee installation realistically populates. The exact column set will need confirmation with the Admissions Office:

-- fct_admissions_pipeline — accumulating snapshot.
-- One row per (applicant, admit_term). UPDATED, not inserted,
-- as milestones happen.
CREATE TABLE fct_admissions_pipeline (
    -- grain
    applicant_key             INTEGER NOT NULL
                              REFERENCES dim_applicant(applicant_key),
    admit_term_key            INTEGER NOT NULL
                              REFERENCES dim_term(term_key),

    -- milestone date_keys — most NULL at insert,
    -- fill in as the applicant progresses.
    -- Each defaults to dim_date.date_key = -1
    -- (the "Unknown / Not Yet Occurred" sentinel).
    inquiry_date_key          INTEGER NOT NULL DEFAULT -1
                              REFERENCES dim_date(date_key),
    application_date_key      INTEGER NOT NULL DEFAULT -1
                              REFERENCES dim_date(date_key),
    interview_date_key        INTEGER NOT NULL DEFAULT -1
                              REFERENCES dim_date(date_key),
    decision_date_key         INTEGER NOT NULL DEFAULT -1
                              REFERENCES dim_date(date_key),
    accepted_date_key         INTEGER NOT NULL DEFAULT -1
                              REFERENCES dim_date(date_key),
    enrolled_date_key         INTEGER NOT NULL DEFAULT -1
                              REFERENCES dim_date(date_key),

    -- decision dimension (Accept / Reject / Defer / Waitlist)
    admissions_decision_key   INTEGER NOT NULL DEFAULT -1
                              REFERENCES dim_admissions_decision(decision_key),

    -- lag measures — days between key milestones.
    -- Computed each time the row is updated.
    inquiry_to_application_lag    INTEGER,
    application_to_decision_lag   INTEGER,
    decision_to_enrolled_lag      INTEGER,

    -- artifact counts for self-documenting SUMs
    -- (see F8 factless fact pattern)
    application_count          INTEGER NOT NULL DEFAULT 1,
    accepted_count             INTEGER NOT NULL DEFAULT 0,
    enrolled_count             INTEGER NOT NULL DEFAULT 0,

    -- provenance
    source_loaded_at           TIMESTAMP NOT NULL,

    PRIMARY KEY (applicant_key, admit_term_key)
);

Notice the "Unknown / Not Yet Occurred" sentinel on every date_key. NULL FKs break the join graph for downstream reporting. Defaulting to date_key = -1 (a real row in dim_date labeled "Not Yet Occurred") makes the row safe to join from day one. As each milestone happens, the load updates the date_key to the actual occurrence date.

The *_count columns are the factless-fact "useful artifact" from Factless Fact Tables — Events and Coverage — they make SUM(accepted_count) read as "applicants accepted" without a comma-counting hack.

The dimension table

dim_applicant is a Type 2 SCD on a subset of applicant attributes that matter analytically (declared major, intended program, residency status, high school, recruitment source). For demographics that change rarely (date of birth, gender as captured at application, ethnicity) Type 1 is fine — those are stable identity facts, not slowly-changing attributes.

Crucially: **dim_applicant should evolve into dim_student when the applicant enrolls.** The same surrogate key continues. The applicant who became a student gets additional attributes filled in (matriculation date, declared major as a student, academic standing) but the surrogate is unchanged. This is the "expanded applicant dimension" pattern from Slowly Changing Dimensions — Keeping History When Attributes Change — the dimension's identity is the person, not their role.

The load pattern

Unlike G5 (which UPSERTs rows on every flow run), the accumulating snapshot has TWO distinct load operations:

Insert — when a new applicant appears in SARADAP

with an inquiry_date or application_date that has no matching fct_admissions_pipeline row yet. The insert fills in the milestone date_keys that are already known (typically inquiry and application_date if both exist) and leaves the rest at the -1 sentinel.

Update — when an existing applicant's row has a new

milestone populated in SARADAP (a decision was made; an acceptance was recorded; the applicant enrolled). The update sets the corresponding date_key and recomputes any affected lag measures.

A typical ETL implementation runs a single MERGE statement per load batch:

MERGE INTO fct_admissions_pipeline tgt
USING staging_admissions_changes src
  ON  tgt.applicant_key  = src.applicant_key
  AND tgt.admit_term_key = src.admit_term_key
WHEN MATCHED THEN UPDATE
  SET tgt.inquiry_date_key     = COALESCE(src.inquiry_date_key,
                                          tgt.inquiry_date_key),
      tgt.application_date_key = COALESCE(src.application_date_key,
                                          tgt.application_date_key),
      tgt.interview_date_key   = COALESCE(src.interview_date_key,
                                          tgt.interview_date_key),
      tgt.decision_date_key    = COALESCE(src.decision_date_key,
                                          tgt.decision_date_key),
      tgt.accepted_date_key    = COALESCE(src.accepted_date_key,
                                          tgt.accepted_date_key),
      tgt.enrolled_date_key    = COALESCE(src.enrolled_date_key,
                                          tgt.enrolled_date_key),
      tgt.admissions_decision_key = COALESCE(src.admissions_decision_key,
                                             tgt.admissions_decision_key),
      tgt.accepted_count       = CASE WHEN src.accepted_date_key > -1
                                      THEN 1 ELSE tgt.accepted_count END,
      tgt.enrolled_count       = CASE WHEN src.enrolled_date_key > -1
                                      THEN 1 ELSE tgt.enrolled_count END,
      tgt.application_to_decision_lag =
          CASE WHEN tgt.application_date_key > -1
                AND src.decision_date_key  > -1
               THEN src.decision_date_key - tgt.application_date_key
               ELSE tgt.application_to_decision_lag END,
      -- ... other lag recomputations
      tgt.source_loaded_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN INSERT (...)
  VALUES (...);

The COALESCE pattern is the key idiom for accumulating snapshots: never overwrite a populated milestone with a NULL. If the source row's interview_date_key is NULL but the target already has a non-sentinel value, the COALESCE keeps the target's value. This protects against load batches that pull partial information and would otherwise erase prior milestones.

The bus matrix

The moment the second star ships, you have a bus matrix — a table showing which dimensions are conformed across stars and which are unique to each. For Waubonsee with two stars, the matrix is small but the principle is established:

The Waubonsee warehouse's two-star bus matrix — rows are stars (Position-Budget, Admissions), columns are dimensions. dim_date is conformed across both. dim_position and dim_fund are unique to Position-Budget; dim_applicant and dim_admissions_decision are unique to Admissions. Conformed cells highlighted in coral — they are the seams that future stars will share.

dim_date is the conformed dimension. Position-Budget joins to it via the month-end snapshot date; Admissions joins to it via every milestone date_key. The same dim_date.fiscal_year attribute, the same dim_date.academic_term rollup — both stars use the same dimension table. A future Registration star (registration factless) joins to it too. A future Tuition star joins to it. The dimension was built once in G3 and pays off every subsequent star.

The bus matrix is the warehouse's growth contract: every future star is either reusing an existing dimension (zero build cost beyond the join) or contributing a new dimension (future stars can then reuse it). The matrix gets denser as the warehouse matures. Conformed dimensions are how the warehouse stops being a set of disconnected reports and becomes a system that supports cross-domain questions ("are applicants from high schools with higher SAT scores more likely to enroll AND stay employed by the institution after graduation?" — a query that crosses the Admissions star, the Registration star, and a future Alumni-Employment star, all joined through conformed dimensions).

Run the load through Windmill

The Admissions load is its own Windmill flow (load_fct_admissions_pipeline), structurally similar to G6 but with the MERGE-based load step instead of UPSERT. The schedule is typically weekly or daily during admissions season (when milestones are firing frequently) and monthly during off-season.

Reconciliation (per Validate Against Banner — Agree to the Cent or Stop) compares per-stage counts to Banner's Admissions Office reports: applications_received_this_term, applications_decided, applications_accepted, students_enrolled. Each count should agree between the warehouse and Banner; the Admissions Office runs these counts already and will notice instantly if they drift.

Verify against Banner

Three reconciliation checks, ordered by what is most likely to catch a defect:

1. Per-stage applicant counts agree. For the current admit term:

-- Warehouse-side counts per milestone.
SELECT  dt.term_code,
        SUM(application_count) AS total_applications,
        SUM(CASE WHEN accepted_date_key > -1
                 THEN 1 ELSE 0 END) AS total_accepted,
        SUM(enrolled_count) AS total_enrolled
FROM    fct_admissions_pipeline f
JOIN    dim_term dt ON dt.term_key = f.admit_term_key
WHERE   dt.term_code = '202610'
GROUP BY dt.term_code;

Compare to the Admissions Office's current-term dashboard. Numbers should match to the applicant. If they do not, either the warehouse is missing applicants (extract gap) or counting applicants the Admissions Office does not count (stale or test-data rows in SARADAP).

2. The yield rate (enrolled / accepted) should match what the Admissions Office reports for this term:

SELECT  SUM(enrolled_count)::FLOAT /
        NULLIF(SUM(CASE WHEN accepted_date_key > -1
                        THEN 1 ELSE 0 END), 0) AS yield_rate
FROM    fct_admissions_pipeline f
JOIN    dim_term dt ON dt.term_key = f.admit_term_key
WHERE   dt.term_code = '202610';

Yield is the Admissions Office's headline number. If the warehouse disagrees, the Admissions Office will not trust the warehouse for anything else.

3. Average application-to-decision lag should match the operational service-level the Admissions Office tracks:

SELECT  AVG(application_to_decision_lag) AS avg_decision_lag_days
FROM    fct_admissions_pipeline f
JOIN    dim_term dt ON dt.term_key = f.admit_term_key
WHERE   dt.term_code = '202610'
  AND   application_to_decision_lag IS NOT NULL;

The Admissions Office likely already targets a specific turnaround (e.g., "decisions within 14 days of complete application"). The warehouse should report the same number.

Watch out

Five gotchas specific to accumulating snapshots:

**The milestone count depends on what your installation

actually captures.** Kimball's 15-milestone example is aspirational. A real Banner installation may populate only 4-8 dates reliably. Audit SARADAP BEFORE building the DDL — SELECT column_name, COUNT(*), COUNT(date_column) FROM saradap reveals which dates are populated for what fraction of applicants. Build the fact for the milestones you actually have; do not over-engineer for the ones Kimball lists.

NEVER overwrite a populated milestone with NULL. The

COALESCE pattern in the MERGE is load-bearing. A buggy load that does UPDATE ... SET decision_date_key = src.decision_date_key without the COALESCE will erase the decision date for any batch where the source row happens to be re-extracted before the decision is in. The accumulating snapshot's whole value is the history of when milestones happened; erasing them is catastrophic and silent.

Lag measures must be recomputed on every update.

application_to_decision_lag depends on both application_date_key and decision_date_key. Set it only when both are populated, and recompute it whenever either changes. The MERGE in this article handles it for the most common case; verify your local load logic does too.

**The accumulating snapshot is for SHORT-LIVED processes

only.** Applicant lifecycle (inquiry → enrollment, ~6 months) fits perfectly. A student's full degree (4-6 years, dozens of possible milestones) does NOT. For long-running processes, model a transaction fact for the events (course completions, term registrations) and optionally a periodic snapshot (term-end enrollment status). See The Three Fact-Table Patterns — Transaction, Periodic, Accumulating for the decision tree.

The bus matrix is a discipline, not a deliverable.

It is easy to build the second star and forget to document that dim_date is conformed across both. Six months later, someone proposes a third star and rebuilds their own dim_date_v2 because the conformation was not visible. Maintain a bus-matrix document (a wiki page, a spreadsheet, a diagram in this article) that lists every star and its dimensions, with conformed dimensions explicitly marked. The matrix is the warehouse's architecture-of-record.

The one-sentence takeaway

The second star earns its right to exist by teaching what the first one could not — a new fact pattern, a conformed dimension, a worked example of the bus matrix. For Waubonsee that star is Admissions as an accumulating snapshot, and the moment it ships, the warehouse stops being one star and starts being a constellation.

Track H · DataBlock architecture & engineering decisions

One DataBlock Per Report, or One for Many? The Decision Framework

9 min readargosdatablockarchitecturedecision-frameworkconsolidationoptimizationgovernance

The hook

The everyday analogy

Open a professional chef's knife drawer and you find a collection of specialized blades. A 10-inch chef's knife for general chopping. A paring knife for small precise work. A bread knife with serrated edge for crusty loaves. A boning knife for breaking down chicken. A cleaver for joints. A Santoku for vegetables. Six knives in the drawer, six dedicated purposes, each one optimized for its single task. Each blade is sharpened on a different schedule, replaced individually when worn, ground at the angle its specific use demands. The drawer is busy. The maintenance is per-tool.

Open a backpacker's bag and you find one Leatherman multitool. Pliers, knife, screwdriver heads, can opener, file, scissors — all folded into one compact frame. The multitool weighs less than the equivalent collection of single-purpose tools. It is one item to maintain, one item to pack, one item to lose. Each function is slightly less optimal than its dedicated counterpart — the multitool's scissors are clumsier than dedicated kitchen shears; the knife blade is shorter than a chef's knife — but every function is there, in one place, ready.

A chef's knife drawer with six specialized blades laid out on the left; a backpacker's Leatherman multitool unfolded on the right; both photographed on the same wooden surface to emphasize the trade between optimization-per-task and consolidation.

Neither pattern is correct everywhere. The professional chef needs the specialized drawer because the workflow demands optimal performance per cut, and the kitchen has the space and the maintenance discipline. The backpacker needs the multitool because weight matters, space is constrained, and no single function is performed often enough to justify the optimal version.

Argos DataBlocks face the same choice. The 1:1 pattern is the specialized-knife drawer — each report gets a DataBlock cut to its purpose; each DataBlock can be optimized, indexed, and tuned in isolation; the cost is the size of the drawer and the per-knife maintenance. The consolidated pattern is the multitool — one DataBlock serves many reports via UNION ALL and a discriminator (see Shared DataBlocks — One SQL, Many Reports); each individual report may run slightly slower than its dedicated equivalent, but the catalog has one tenth the items to maintain. Both patterns work. The criteria for choosing between them are the analytical equivalent of "is this a professional kitchen or a backpack?"

What it really is

Two architectural patterns, two sets of tradeoffs. Neither is the "right" answer in the abstract.

The 1:1 pattern. One Argos DataBlock per business report. The DataBlock's SQL targets the report's exact columns, filters, and parameters. The contract is narrow: this DataBlock serves this report and nothing else.

Strengths:

Optimization in isolation. A specific report can be tuned — new index, query rewrite, parameter restructure — without affecting any other report's behavior.
Diagnostic clarity. A failing report points to one DataBlock. The blast radius of any defect is one report.
Cognitive load per DataBlock is low. The SQL is focused, the parameter set is narrow, the column contract is fixed.
Migration history. When an institution converted from Crystal Reports to Argos, 1:1 was the rational bridge — fastest migration path, no re-architecture during the conversion.

Costs:

Maintenance multiplies with reports. A change to a shared filter (e.g., a new status code that should be excluded) requires editing every DataBlock that has that filter.
Definition drift. Two reports that should compute the same number — "active student count" — may drift apart over time as developers edit them independently.
Catalog grows linearly with reports. A campus with 500 reports has 500 DataBlocks to enumerate, govern, and audit.

The consolidated pattern. One Argos DataBlock holds a SQL body that serves multiple reports. Implemented via Shared DataBlocks — One SQL, Many Reports's UNION ALL + discriminator approach — each branch of the UNION tags its rows with a constant; each consumer report filters by that tag. Alternatively, a column-superset SQL where each report hides the columns it does not need.

Strengths:

One source of truth. Business logic — joins, filter conventions, status-code policy — lives in one place. A change propagates to every consuming report automatically.
Catalog scales sublinearly with reports. A campus with 500 reports might have 80 DataBlocks. Each DataBlock serves many reports.
Cross-report consistency. Reports that share a DataBlock cannot drift apart in definition because they consume the same SQL.

Costs:

Per-DataBlock complexity grows. The SQL must accommodate every consumer's needs — every column any consumer might want, every filter combination, every sort order.
Optimization is harder. Tuning the SQL for one report's pattern may regress another report's performance. Indexing decisions become trade-offs across reports.
Blast radius is wider. A DataBlock defect affects every consuming report, not just one. Testing has to cover every consumer.

The decision criteria. Five factors tip the balance:

Change frequency. If the business logic in the SQL changes often, consolidation reduces the change-propagation burden. If the SQL is stable, 1:1's isolation penalty matters less.
Report similarity. Reports that share 80%+ of their SQL are good consolidation candidates. Reports that share only the table names are not.
Performance sensitivity. Reports where every second matters — executive dashboards, real-time queries — favor 1:1 for per-instance optimization. Reports run monthly in batch favor consolidation.
Team maturity. Consolidated DataBlocks need discipline: every consumer must be inventoried; every change tested across consumers. Less mature teams are safer with 1:1.
Governance capacity. A catalog of 500 DataBlocks needs more governance machinery than a catalog of 80. Match the approach to your team's actual capacity.

A 2×N decision matrix: rows = decision criteria (change frequency, report similarity, performance sensitivity, team maturity, governance capacity), columns = 1:1 favored / Consolidated favored; cells contain the condition that tips the criterion toward that pattern.

See it — the diagram

A decision matrix: five rows, one per criterion. The left column describes the condition that favors 1:1 — "SQL changes rarely," "reports share under 60% of SQL," "response time under 5 seconds required," "team of one or two generalists," "governance tooling is manual." The right column describes the condition that favors consolidation — "SQL changes monthly," "reports share over 80% of SQL," "batch report, overnight window," "team with dedicated Argos specialist," "governance tooling is automated." The middle is the gradient. No row says "always pick X." Each row says "if your situation looks like this column, lean that way."

Show me the code

The same business question modeled both ways. Two reports: "Active Students by Major" and "Active Students by College." Same underlying tables, same join logic, one different filter dimension each.

1:1 pattern — two DataBlocks for two reports:

-- DataBlock A — "Active Students by Major" report
SELECT s.spriden_id, s.spriden_last_name,
       g.sgbstdn_majr_code_1 AS major
FROM   sgbstdn g
JOIN   spriden s
       ON  s.spriden_pidm        = g.sgbstdn_pidm
       AND s.spriden_change_ind  IS NULL
       AND s.spriden_entity_ind  = 'P'
WHERE  g.sgbstdn_stst_code = 'AS'
  AND  g.sgbstdn_term_code_eff = (SELECT MAX(g2.sgbstdn_term_code_eff)
                                  FROM   sgbstdn g2
                                  WHERE  g2.sgbstdn_pidm = g.sgbstdn_pidm)
  AND  g.sgbstdn_majr_code_1 = :main_DD_major;

-- DataBlock B — "Active Students by College" report
-- Nearly identical SQL; one WHERE predicate differs.
SELECT s.spriden_id, s.spriden_last_name,
       g.sgbstdn_coll_code_1 AS college
FROM   sgbstdn g
JOIN   spriden s
       ON  s.spriden_pidm        = g.sgbstdn_pidm
       AND s.spriden_change_ind  IS NULL
       AND s.spriden_entity_ind  = 'P'
WHERE  g.sgbstdn_stst_code = 'AS'
  AND  g.sgbstdn_term_code_eff = (SELECT MAX(g2.sgbstdn_term_code_eff)
                                  FROM   sgbstdn g2
                                  WHERE  g2.sgbstdn_pidm = g.sgbstdn_pidm)
  AND  g.sgbstdn_coll_code_1 = :main_DD_college;

Two DataBlocks, two parameter sets, two column contracts. A change to the active-student logic — say, adding g.sgbstdn_levl_code = 'UG' — requires editing both.

Consolidated pattern — one DataBlock serving both reports:

-- One DataBlock exposes both filter dimensions. Each consumer
-- report sets the parameter it needs and leaves the other NULL.
SELECT s.spriden_id, s.spriden_last_name,
       g.sgbstdn_majr_code_1  AS major,
       g.sgbstdn_coll_code_1  AS college
FROM   sgbstdn g
JOIN   spriden s
       ON  s.spriden_pidm        = g.sgbstdn_pidm
       AND s.spriden_change_ind  IS NULL
       AND s.spriden_entity_ind  = 'P'
WHERE  g.sgbstdn_stst_code = 'AS'
  AND  g.sgbstdn_term_code_eff = (SELECT MAX(g2.sgbstdn_term_code_eff)
                                  FROM   sgbstdn g2
                                  WHERE  g2.sgbstdn_pidm = g.sgbstdn_pidm)
  AND  (:main_DD_major   IS NULL OR g.sgbstdn_majr_code_1  = :main_DD_major)
  AND  (:main_DD_college IS NULL OR g.sgbstdn_coll_code_1  = :main_DD_college);

One DataBlock, more columns, more optional parameters, more complex WHERE predicates. A change to the active-student logic edits one place and propagates to both reports. The cost: the SQL is harder to read and harder to tune than either of the originals. That is the tradeoff.

Where intuition fails

These gotchas apply to both patterns. Neither side escapes them.

Drift happens to both — just differently. 1:1 lets reports drift apart in definition: two reports that should compute "headcount" the same way slowly diverge as developers edit each in isolation. Consolidation lets reports drift apart in consumption: a column that one report needs gets added to the shared DataBlock, and over time the DataBlock accumulates columns no consumer still uses. Both patterns need governance discipline; the governance just looks different.

Optimization is asymmetric, not easier. 1:1 lets each report be optimized independently — but the optimization expertise is required per DataBlock, and similar reports may each need the same tuning applied separately. Consolidation lets one optimization improve many reports — but a tuning that helps Report A may regress Report B. Neither pattern is "easier to optimize" in the abstract.

Testing burden differs in shape, not total size. 1:1's testing is per-report — smaller scope per test, more tests total. Consolidation's testing is cross-consumer — fewer tests per change, but each test must verify every consumer. Total testing effort across the catalog is comparable. The difference is pain distribution, not volume.

Migration is not free in either direction. Splitting a consolidated DataBlock back into 1:1 means duplicating SQL into N copies and re-syncing the divergences that accumulated. Consolidating N DataBlocks into one means reconciling N different views of the same business logic into a single SQL body. Either direction is project work, not a refactor. Safe Consolidation Migration — How to Merge N DataBlocks into One Without Breaking Anyone covers the safe path.

The "right answer" depends on report cardinality and change frequency — and both change over time. A campus with 50 reports that rarely change finds 1:1 fully manageable. A campus with 500 reports changing weekly finds consolidation essential. The transition between these regimes is not a clean threshold; it is a gradient. The remaining Track H articles — Finding Consolidation Candidates — Programmatic Similarity Across the Catalog and When 1:1 Wins — The Case for One DataBlock Per Report — propose ways to measure where on that gradient your campus sits.

The one-sentence takeaway

The 1:1 pattern optimizes for isolation — each report can be tuned, debugged, and changed independently, at the cost of catalog size and definition drift across similar reports. The consolidated pattern (via Shared DataBlocks — One SQL, Many Reports's UNION ALL + discriminator) optimizes for consistency — one source of truth, one place to change, at the cost of per-DataBlock complexity and wider blast radius. The right choice depends on five criteria: change frequency, report similarity, performance sensitivity, team maturity, and governance capacity. Neither pattern is universally correct. Both require governance discipline — it just looks different in each.

Track H · DataBlock architecture & engineering decisions

Finding Consolidation Candidates — Programmatic Similarity Across the Catalog

10 min readargosdatablockconsolidationsimilarityjaccardtoolingcatalog

The hook

The everyday analogy

Open your photo library on a modern phone. The app has a "Suggestions" or "Duplicates" section. It shows you stacks of near-identical photos with a "Merge?" button — the burst of 12 photos you took of the same sunset, the screenshot saved three times, the family photo captured from two phones at the same scene. The app does NOT silently delete duplicates. It surfaces ranked candidates and lets you decide which to keep, which to merge, which were actually different — the two sunset photos that look identical but are from different trips.

How the photo app decides what is "similar" is a similarity score combining several signals: image hash (visual content), metadata (date, location), file size, dimensions. No single signal is conclusive — two photos with identical pixels but different EXIF dates may be genuinely different versions (edited vs. original); two photos with identical EXIF but different content are probably a multi-shot burst. The app weights the signals, ranks the suggestions, and asks you to confirm.

A phone screen showing a 'duplicate photos' suggestion panel with stacks of near-identical images, each pair marked with a similarity percentage and a 'Review' button; on the desk beside the phone, a printout of an Argos DataBlock similarity CSV showing parallel pair listings.

Finding Argos DataBlock consolidation candidates is the same shape of problem. The catalog has ~670 DataBlocks. Some pairs are obvious duplicates — "Adjunct Faculty Degree Info" and "FT Faculty Degree Info" at 0.986 similarity: same SQL shape, same tables, same fields, just a different WHERE pebempl_ecls_code = ... filter. Some pairs are false positives — two reports that hit the same tables but serve genuinely different business purposes. And most pairs are not similar at all and should be ignored.

The similarity tool at wiki/src/argos_similarity.py plays the same role as the photo app's Suggestions panel: scan everything, rank the candidates, surface them for human review, do not auto-merge. The developer reviews the top of the list, picks the obvious duplicates, opens the SQL side by side, and decides which to consolidate using the Shared DataBlocks — One SQL, Many Reports pattern.

What it really is

The tool scores every pair of DataBlocks on four similarity dimensions, combines them into a weighted score, and outputs a ranked CSV. Each dimension uses Jaccard similarity: |A ∩ B| / |A ∪ B| — how much of the combined set is shared.

The four dimensions, in order of weight:

**sql_jaccard × 0.40** — Jaccard on the normalized SQL token bag. The SQL is lowercased, comments stripped, string literals collapsed to '', and SQL stopwords removed (SELECT, FROM, WHERE, AND, OR, JOIN, LEFT, RIGHT, ON, GROUP, BY, ORDER, HAVING, CASE, WHEN, THEN, ELSE, END, NVL, DECODE, SYSDATE, TRUNC…). What remains is the structural skeleton of the query — table aliases, column names, operators, Banner-specific function calls. This is the strongest signal because two DataBlocks with the same query shape almost certainly do the same kind of work.

**table_jaccard × 0.30** — Jaccard on the set of Banner source tables, extracted from FROM/JOIN clauses and filtered to the 6–9 character lowercase Banner naming pattern. Two DataBlocks hitting spriden, sgbstdn, and stvmajr with high overlap are pulling from the same source data.

**field_jaccard × 0.20** — Jaccard on the set of visible output fields the DataBlock declares. Two DataBlocks exposing the same columns are likely serving similar consumer reports.

**param_jaccard × 0.10** — Jaccard on the set of :main_*, :lcl_*, :dbn_* parameter names extracted from the SQL. Parameters carry the user-input contract — two DataBlocks with the same parameters are likely meant to be consumed by similar reports. Weighted lowest because many DataBlocks share trivial parameter names (:main_DD_term, :main_DD_pidm) without being genuine duplicates.

Why weight SQL highest? SQL token overlap is the most precise signal. Table overlap alone is a weaker signal — hundreds of DataBlocks query SPRIDEN; that does not make them consolidation candidates. But two DataBlocks whose SQL token bags overlap 96% share the same join graph, the same filter structure, the same output shape. That is a consolidation candidate.

Pruning. The tool skips pairs that share zero source tables — two DataBlocks with no table overlap are not consolidation candidates by any definition, and skipping them cuts compute from ~224,000 pairs to a fraction.

The four similarity dimensions stacked vertically as a weighted layer cake: sql_jaccard (40%), table_jaccard (30%), field_jaccard (20%), param_jaccard (10%). Each dimension shows two example sets with the intersection highlighted coral. Total score at the bottom.

The threshold is tunable. The default 0.50 surfaced 256 pairs in Waubonsee's catalog — manageable for a sprint of review work. Raise to 0.70 to see only high-confidence duplicates (~30 pairs). Lower to 0.30 to see borderline cases worth a glance.

See it — the diagram

A weighted layer cake: four horizontal bars stacked vertically, each representing one similarity dimension. The widest bar at the top is sql_jaccard (40%), rendered in coral to mark it as the strongest signal. Below it: table_jaccard (30%), field_jaccard (20%), param_jaccard (10%). Each bar shows two small overlapping Venn-style sets — the intersection highlighted in coral, the non-overlapping portions in ink. At the bottom, the combined score is a single number: the weighted sum of the four bars above it. The visual says: four signals, one score, SQL shape does most of the work.

Show me the code

The tool's actual output — top 5 pairs from a real production run across 670 DataBlocks:

rank  score  datablock_a                                  datablock_b
   1  1.000  Argos Report Security-HR                     Argos Report Security
   2  1.000  Finance Security Classes with Users          Security Classes with Users
   3  1.000  FY End Salary Increase - Admin and Staff     FY End Salary Increase - FT Faculty
   4  0.991  Finance Security Classes with Users          Student Security Classes with Users
   5  0.991  Security Classes with Users                  Student Security Classes with Users

Pairs 2, 4, and 5 form a triangle: Finance, Student, and base Security Classes with Users all sit at near-1.0 similarity. Three DataBlocks, same SQL shape, same tables, same fields — three flavors of the same security-classes report, differentiated by a domain filter. One consolidated DataBlock with a discriminator column ('FINANCE', 'STUDENT', 'BASE') could replace all three. That is the consolidation opportunity H1's framework exists to evaluate.

The SQL token bag normalizer — the most novel piece of the tool:

# wiki/src/argos_similarity.py (excerpt)
def sql_token_bag(sql: str) -> set[str]:
    """Normalize SQL into a bag of tokens for Jaccard on shape."""
    s = html.unescape(sql).lower()
    s = _COMMENT.sub(" ", s)               # strip comments
    s = re.sub(r"'[^']*'", "''", s)         # collapse string literals
    tokens = _TOKEN.findall(s)              # word + number + operator
    stop = {"select", "from", "where", "and", "or", "as", "on",
            "group", "by", "order", "having", "is", "not", "null",
            "join", "left", "right", "inner", "outer", "case",
            "when", "then", "else", "end", "nvl", "decode",
            "sysdate", "trunc"}
    return {t for t in tokens if t not in stop and len(t) > 1}

Comments and literals encode values, not structure. SQL stopwords are shared by every query. What remains — column names, table aliases, Banner function calls, comparison operators — is the structural fingerprint. Two DataBlocks with the same fingerprint are doing the same shape of work.

The run invocation:

python wiki/src/argos_similarity.py --threshold 0.50 --top 200
# outputs:
#   briefs/argos_similarity_candidates.csv
#   briefs/argos_similarity_summary.md

The tool is read-only — it never modifies the source JSON files. It is cheap to re-run after every Argos export.

Update — v2.3 (the tool grew up)

The v1 architecture above (four weighted Jaccards) is conceptually correct and is still what produces the score. But running it against the real catalog surfaced limitations: 256 pairs scored high but only a handful were genuinely consolidatable. The score said "these look similar" without saying "these are cheap to merge". A column rename can break a calc field; a filter divergence can mean two queries serve different populations; the same DataBlock can be consumed by twelve Reports or one. None of those costs were in the v1 score.

So v2.0 → v2.3 added cost signals alongside the similarity score. The score still answers "do these look alike?". The new signals answer "what would it cost to merge them?". Same input data — different question. Eight additions, briefly:

Transitive clusters (v2.0) — if A~B and B~C, the tool groups {A, B, C} once via union-find instead of three pair rows. Clusters are the real refactor unit.
Containment alongside Jaccard (v2.0) — |A∩B| / min(|A|, |B|) next to Jaccard. Containment ≈ 1.0 means one DataBlock is mostly a subset of the other — the easiest "drop the subset" consolidation, hidden from plain Jaccard.
TF-IDF cosine on SQL tokens (v2.0) — replaced plain Jaccard. Rare tokens (the ones that actually discriminate near-duplicates) outweigh ubiquitous noise like spriden_pidm or to_char.
Blast radius (v2.1) — every DataBlock's reports list is loaded. Per pair: total_consumers = distinct Argos Reports a merge would touch. A pair affecting 2 Reports is a different conversation than one affecting 40.
Alias contract delta (v2.1) — Argos late-binds Reports to DataBlocks by column name (groupings, conditional print sections, calc fields, sorts all reference the alias). The script reports alias_a_only, alias_b_only, alias_shared so the cost of a rename is visible up front.
Calc-field reverse-index (v2.2) — each DataBlock's calculated_fields[].expression is parsed, identifier tokens intersected with that block's own columns. The intersection = frozen_columns: rename them and a calc field on this DataBlock will break. Per pair: a_calc_at_risk / b_calc_at_risk — calc dependencies that would be orphaned if that side were dropped.
Clause-split SQL similarity (v2.3) — the SQL token bag is also split at top-level WHERE / GROUP BY / ORDER BY positions (respecting parenthesis depth so subqueries stay with their enclosing clause). select_sim high + where_sim low surfaces the "same shape, different population" pattern — consolidation would require pushing filter logic to the Report side.
Consolidation-cost score (v2.3) — combines blast radius, alias delta, calc orphans, and filter divergence into a single merge_cost (float) and cost_band (low / medium / high). Per cluster: avg_pair_cost, max_pair_cost, and a cluster cost band.

For the operational guide — how to run the tool, how to read every column, the recommended workflow (orphans first, then cost_band = low clusters, then cost_band = low pairs, generally don't touch high), the gotchas — see Running Argos Similarity v2.3 — the operational guide.

Where intuition fails

The list is NOT a merge plan. It is a shortlist for human review. Two DataBlocks with score 1.000 may still serve genuinely different audiences. "Adjunct Faculty Degree Info" and "FT Faculty Degree Info" produce the same shape of report but go to different department heads; consolidating them affects governance, not just SQL. Always read the actual SQL of every candidate pair before declaring them consolidatable.

False positives are normal and expected. A DataBlock that hits SPRIDEN + SGBSTDN + a small lookup may share most of its tokens with another DataBlock that hits the same tables for a different business question. The tool surfaces them anyway — the developer's job is to filter them out.

False negatives are also possible — adjust the threshold. Two DataBlocks that SHOULD consolidate may score lower than expected because one uses different table aliases (s vs. spr), different parameter names (:term_code vs. :main_DD_term), or radically different formatting. If you suspect missing candidates, lower the threshold and scan the borderline.

Re-run after every major Argos export. The catalog grows organically — every new report is a new chance for duplicates to accumulate. The tool is cheap to re-run; make it part of the quarterly Argos hygiene cycle, not a one-time exercise.

The output is a working document, not a final report. Treat the CSV as a backlog. Pick the top N each sprint, work through them with the report owners, and re-rank after each consolidation lands. The score distribution will shift as obvious duplicates are removed. Safe Consolidation Migration — How to Merge N DataBlocks into One Without Breaking Anyone covers the safe migration path once you have chosen your candidates.

The one-sentence takeaway

A weighted Jaccard similarity tool (sql_jaccard × 0.40 + table_jaccard × 0.30 + field_jaccard × 0.20 + param_jaccard × 0.10) scans the Argos catalog and ranks DataBlock pairs by architectural similarity. SQL token overlap is the strongest signal — two DataBlocks with the same query shape almost certainly do the same kind of work. The output is a ranked shortlist for human review, NOT an automatic merge plan. Every candidate pair needs eyes on the actual SQL before consolidation work begins. Re-run after every major Argos export to catch newly accumulated duplicates.

Track H · DataBlock architecture & engineering decisions

Safe Consolidation Migration — How to Merge N DataBlocks into One Without Breaking Anyone

8 min readargosdatablockconsolidationmigrationrollbackparallel-verificationdiscriminator

The hook

The everyday analogy

A careful renter does not give up the old apartment the day they get the keys to the new one. They lease the new place starting one week before the old lease ends. For seven days, both apartments are alive. The renter moves boxes over in batches — first the books, then the kitchen, then the bedroom — testing as they go. If the new apartment has a problem (the water pressure is bad, the outlets do not work, the lock sticks), they have somewhere to sleep tonight while they sort it out. By day seven, every box has moved, every utility is transferred, the new place is functioning, and the old apartment is empty. Only then do they turn in the keys.

Two apartments side by side in a building cutaway: the old one partially packed with coral-labeled moving boxes, the new one half-furnished with matching boxes already arrived; both lit, both alive; a calendar on the wall between them shows a one-week overlap highlighted in coral.

Consolidating Argos DataBlocks works exactly the same way. The new consolidated DataBlock is built alongside the originals — both alive at the same time. Consuming reports migrate one at a time from old to new, each one verified. The old DataBlocks stay registered, queryable, fully functional, for the entire migration window. Only after every consumer has been switched and verified do the old DataBlocks get archived. If a consumer migration discovers a bug — a column the new DataBlock forgot, a filter that does not compose correctly — the consumer rolls back to the old DataBlock instantly. The old apartment is still leased.

The discipline is the same as moving. Inventory what you have. Build the new place before you commit to leaving the old. Verify each box arrived intact. Move in batches. Decommission only when the move is fully done. No big-bang swaps. No "I will figure it out when I get there." Every step has a rollback path.

What it really is

Five sequential phases. Each has a clear deliverable and a rollback path. None is optional.

Phase 1 — Inventory the consumers. Before building anything new, list every report that consumes each of the DataBlocks being merged. Use the Argos designer's "where is this DataBlock used?" view or the BSS schema search as the source of truth. The deliverable: a spreadsheet with one row per consumer report — which DataBlock it consumes today, which department owns the report, how often it runs. This is the migration's punch list. Rollback: trivial — you have written nothing yet.

Phase 2 — Shadow-build the new consolidated DataBlock. Build the new DataBlock alongside the originals using Shared DataBlocks — One SQL, Many Reports's UNION ALL + discriminator pattern. The new DataBlock produces the same columns as the originals (plus the discriminator column) and exposes every filter and parameter the consumers need. Critically, do NOT touch the original DataBlocks. They remain alive, registered, and consumable. Rollback: delete the new DataBlock; nothing else changes.

Phase 3 — Parallel verification. For each consumer report, run it twice: once against the old DataBlock, once against the new (with the appropriate discriminator filter). The outputs must be identical — same row count, same column values, same totals. Differences mean the new DataBlock has not faithfully reproduced the old logic; fix the new DataBlock until equivalence holds for every consumer. The deliverable is a verification log: one entry per consumer, reporting "equivalent" or "differs in column X, fix Y." Rollback: trivial; nothing has changed in production.

Phase 4 — Gradual cutover. Once every consumer has passed parallel verification, migrate the consumers one at a time. For each: change the consumer's DataBlock reference to the new one, add WHERE layout = '<discriminator>', re-run the report, and confirm it produces the same output it did in Phase 3. Schedule the cutovers over several weeks if the catalog is large; do not batch them in one weekend. Each migrated consumer is its own change with its own rollback. The old DataBlocks remain alive throughout this phase — unmigrated consumers continue to use them.

Phase 5 — Deprecation. Only after every consumer has been cut over for a confidence window — typically 1–2 monthly cycles — archive the old DataBlocks. Mark them inactive in the Argos catalog; do not delete immediately. Keep the archived DataBlocks queryable for 6–12 months as insurance. After the confidence window AND a sweep of every Argos report for residual references to the old names, the old DataBlocks can finally be deleted. The migration is complete only when the deprecated DataBlocks are gone — and only then.

The five-phase migration as a horizontal timeline: Inventory, Shadow Build, Parallel Verification, Gradual Cutover, Deprecation. Each phase shows old (amber) and new (coral) DataBlocks; the cutover phase shows consumer reports migrating one at a time; deprecation shows the old DataBlocks fading from amber to grey.

See it — the diagram

A horizontal timeline read left to right. Five phases, each a labeled segment. Phase 1 (Inventory) shows a spreadsheet icon with consumer rows. Phase 2 (Shadow Build) shows two DataBlock icons side by side — an amber one labeled "old (alive)" and a coral one labeled "new (shadow)." Phase 3 (Parallel Verification) shows two output tables side by side with a green checkmark between them. Phase 4 (Gradual Cutover) shows consumer report icons migrating one at a time from the amber column to the coral column, with arrows and timestamps. Phase 5 (Deprecation) shows the amber DataBlock icon fading from amber to grey, then disappearing. Below the timeline, a single rule spans all five phases: "Old DataBlocks stay alive through Phase 4."

Show me the code

The parallel-verification query — the workhorse of Phase 3. For each consumer report, run old and new output side by side and diff them:

-- Parallel verification: compare old DataBlock output to new
-- consolidated DataBlock output, row by row. Anything that differs
-- is a migration defect to fix before cutover.

WITH old_output AS (
  -- The SQL from the OLD DataBlock, parameterized for this
  -- consumer's typical filter set.
  SELECT student_id, course_code, credit_hours, fund_code
  FROM   old_datablock_for_consumer_a
  WHERE  term_code = :main_DD_term
),
new_output AS (
  -- The SQL from the NEW consolidated DataBlock, filtered to
  -- this consumer's discriminator slice.
  SELECT student_id, course_code, credit_hours, fund_code
  FROM   new_consolidated_datablock
  WHERE  term_code = :main_DD_term
    AND  layout    = 'CONSUMER_A'
)
SELECT 'in old not new' AS diff_type, student_id, course_code,
       credit_hours, fund_code
FROM   old_output
EXCEPT
SELECT 'in old not new', student_id, course_code,
       credit_hours, fund_code
FROM   new_output

UNION ALL

SELECT 'in new not old', student_id, course_code,
       credit_hours, fund_code
FROM   new_output
EXCEPT
SELECT 'in new not old', student_id, course_code,
       credit_hours, fund_code
FROM   old_output;

-- Empty result set = perfect equivalence = ready to cut over.
-- Any non-empty result = defect; fix the new DataBlock and re-run.

The EXCEPT pattern catches both directions: rows the old DataBlock produced that the new one missed, and rows the new one produced that the old one never would. Both are defects. An empty result set is the only acceptable outcome.

The consumer cutover — before and after in the Argos report's DataBlock SQL:

-- BEFORE (consumer report references the OLD dedicated DataBlock):
SELECT student_id, course_code, credit_hours
FROM   old_datablock_for_consumer_a
WHERE  term_code = :main_DD_term;

-- AFTER (consumer report references the NEW consolidated
-- DataBlock, with the discriminator filter):
SELECT student_id, course_code, credit_hours
FROM   new_consolidated_datablock
WHERE  term_code = :main_DD_term
  AND  layout    = 'CONSUMER_A';   -- discriminator filter

The consumer's parameters and output columns are unchanged. The user of the consumer report never sees the migration. They open the report on Monday, it produces the same numbers it always did. The only thing that changed is which DataBlock the report queries.

Where intuition fails

The old DataBlock stays alive during the migration. The single biggest mistake is deleting the old DataBlock "to clean up" before every consumer has cut over and spent time in production. Without the old DataBlock as a rollback target, the first consumer that breaks has no recovery path. Old DataBlocks are not deprecated until Phase 5; do not touch them earlier.

Parallel verification must cover every consumer, not a sample. A consumer report that uses a rarely-exercised parameter combination may produce different output under the new DataBlock even if 95% of consumers verify clean. Run Phase 3 for every consumer with every typical parameter set. The verification log is the artifact that makes the migration trustworthy.

The discriminator column must match both directions exactly. The new DataBlock emits a layout column with a literal per branch ('CONSUMER_A', 'CONSUMER_B', etc.). The consumer's filter must match that literal exactly — case-sensitive, no leading or trailing spaces. A mismatch produces zero rows silently, which the on-call will not notice until the report is opened. Standardize the discriminator vocabulary up front and document it in a comment in the consolidated DataBlock's SQL.

Gradual cutover over weeks beats batched cutover over a weekend. A batched cutover means every breakage hits at once; the team is swamped on Monday. A gradual cutover spreads the breakage risk and gives the team time to recover between consumers. The total clock-time is longer, but the per-incident severity is lower — and each incident has a rollback path independent of the others.

Schedule the deprecation phase deliberately, not by forgetfulness. Phase 5 — the actual deletion of old DataBlocks — needs a calendar reminder, not "we will get to it." If the team forgets, the old DataBlocks accumulate and the catalog stays cluttered with deprecated artifacts. Treat the deletion as a scheduled task with a sign-off, not a cleanup that happens when someone notices. The migration is not complete until the deprecated DataBlocks are gone and the catalog reflects the consolidated state.

The one-sentence takeaway

Safe DataBlock consolidation follows five sequential phases, each with a rollback path: (1) inventory every consumer report and its owner, (2) shadow-build the new consolidated DataBlock alongside the originals — old DataBlocks remain alive and untouched, (3) parallel-verify every consumer by running old and new output side by side until they match exactly, (4) gradual cutover — migrate consumers one at a time over weeks, not a batch over a weekend, (5) deprecate old DataBlocks only after a 1–2 cycle confidence window. The old DataBlocks stay alive through Phase 4 — that is the safety net. Delete nothing until every consumer has proven itself in production.

Track H · DataBlock architecture & engineering decisions

When 1:1 Wins — The Case for One DataBlock Per Report

H1 framed the debate neutrally. H2 surfaced the consolidation candidates. H3 wrote the careful migration recipe. This article steps back from the neutrality and makes the contrarian case: in most Argos catalogs, one DataBlock per report is the right default. Not because consolidation is wrong — it is sometimes right — but because the costs of consolidation are systematically underestimated, and the benefits of 1:1 are systematically undersold. Here is the defense.

12 min readargosdatablockarchitectureopinionone-to-oneoptimizationblast-radius

The hook

Track H opened with H1's neutral framing of the debate, H2's tooling to identify candidates, and H3's careful migration recipe. The honest thing to do — and the thing this wiki should do, since reasonable engineers disagree about this — is now to make the contrarian case explicitly. In most Argos catalogs, one DataBlock per report is the right default. Not always. Not because consolidation is wrong — it has real benefits, named fairly in H1. But because the costs of consolidation are systematically underestimated, the benefits of 1:1 are systematically undersold, and the engineering community tends to confuse "fewer files" with "less complexity." This article is the case for keeping the default where it is, and consolidating by exception rather than by impulse.

The everyday analogy

Walk into a hospital operating room before a major procedure and look at the surgical instrument tray. Twenty-seven instruments are laid out in precise parallel rows. Three sizes of scalpel. Four sizes of hemostat. Two shapes of forceps. Retractors. Suture holders. Probes. Each one rests in its labeled slot in the tray's contoured liner.

A surgeon's instrument tray laid out before an operation — twenty-seven specialized instruments arranged in neat rows, each one purpose-built, each one sharpened and maintained individually. No multitool in sight. The argument is older than software: precision work demands specialized tools.

No surgeon, in any operating room in the world, opens that tray and asks "could we reduce this to one multitool?" The question is absurd. The redundancy in the tray is not waste; it is the infrastructure that lets the surgeon do precision work without breaking concentration. A 9mm hemostat and a 7mm hemostat are different tools even though they both clamp. The surgeon picks the right one in less than a second because the right one is already on the tray, sharpened, ready, identifiable by shape.

Consolidating the tray would save shelf space and reduce the count of items to sterilize. It would also kill the patient.

The argument is older than software. Specialized tools beat multitools whenever the work is precision work and the stakes are real. Argos reports against Banner are precision work. The stakes — a CFO seeing wrong numbers, a federal report submitted with a calculation error, a board paper that quotes a stale figure — are real. The instinct to consolidate the catalog "for maintainability" is the surgical equivalent of pitching the tray in favor of a Leatherman. It looks like a cleanup. It is an ablation.

This does not mean every surgical instrument is irreplaceable. Some instruments on the tray are genuinely redundant and were added by convention rather than necessity. But the right discipline is to examine each instrument individually and remove the ones that have not been picked up in five years — not to dump the whole tray.

What it really is

The case for 1:1 rests on six engineering observations, each one familiar to anyone who has maintained both kinds of code:

Optimization in isolation is a real advantage, not a theoretical one. A 1:1 DataBlock can be tuned for its specific query pattern: a new index on the exact columns it filters by, a hint that forces Oracle's optimizer toward the right plan, a query rewrite that exploits the specific shape of its data. Every optimization is contained. Consolidated DataBlocks force every optimization to be a trade-off — an index that helps Consumer A may hurt Consumer B, a hint that fixes one report's slowness may introduce regression elsewhere. The tuning surface for a consolidated DataBlock is the union of all its consumers' performance constraints, and that union is rarely simultaneously satisfiable.

Blast radius matters more than catalog size. A bug in a 1:1 DataBlock affects one report. A bug in a consolidated DataBlock affects every consumer. The blast radius scales with how shared the artifact is. Small artifacts produce small incidents; large artifacts produce large ones. Pick which kind of incident you want to debug at 11 PM the night before a board meeting.

Cognitive load per DataBlock is the load that actually matters. Reading code is a much more frequent activity than writing it. A focused DataBlock — one job, one parameter set, one column contract — is read in two minutes and understood. A consolidated DataBlock that serves five reports must be understood as five SQL statements coexisting; the conditional branches, the discriminator filters, the column-presence-checks all add up. Complexity is not linear with feature count. It compounds.

Change management for consolidated DataBlocks is expensive in time the team often does not account for. Every change to a consolidated DataBlock requires re-testing every consumer. Every test requires the consumer's report owner to verify the output match. If a consolidated DataBlock has eight consumers, every proposed change becomes an eight-stakeholder coordination exercise. The team that promised "fewer DataBlocks means easier maintenance" is now scheduling weekly meetings to push through edits that would have been a one-hour change in the 1:1 world.

Catalog navigability is not the constraint people imagine. A catalog of 500 DataBlocks is searchable by tooling — the BSS schema search, the Argos designer's naming conventions, the similarity analysis from Finding Consolidation Candidates — Programmatic Similarity Across the Catalog. A catalog of 80 DataBlocks where each one has seven conditional branches is harder to navigate cognitively, because the navigation happens INSIDE each DataBlock instead of between them. Reducing the number of files makes the catalog smaller; it does not always make the system simpler.

Argos's feature gaps push computation INTO the DataBlock, and that computation does not consolidate cleanly. This is the argument from experience that anyone who has migrated reports from Crystal Reports to Argos knows in their bones. Crystal has Distinct Count as a native aggregation; Argos does not. Crystal has rich percentage-of-group calculations across multi-level groupings; Argos's crosstab support is much thinner. When a Crystal report needs DistinctCount({Student.PIDM}) inside a crosstab, the conversion to Argos has only one workable answer: compute the distinct count in the DataBlock SQL itself, deliver it as a column the crosstab can SUM-as-if-it- were-already-distinct. The same is true for percentage-of-group calculations that span two or three group levels — Crystal's PercentageOfSum is missing in Argos, so the percentage gets computed in the SQL with SUM() OVER (PARTITION BY ...) window functions tuned to the specific report's grouping. These workarounds are clean when the DataBlock serves ONE report's specific grouping. They become combinatorial when the same DataBlock has to pre-compute every distinct count and every percentage that every consumer might ever need. The 1:1 pattern lets you build the SQL for the crosstab you have. The consolidated pattern forces you to build for every crosstab you might ever build — a SQL surface that grows nonlinearly with each new consumer.

Left: five 1:1 DataBlocks, each small and focused, each optimization-tunable independently, each defect contained to one report. Right: one consolidated DataBlock with five branches via UNION ALL + discriminator — visually smaller (fewer files) but combinatorially more complex (every change must consider every consumer).

The argument is not that consolidation has no benefits — it does, real ones, named fairly in H1. The argument is that the default should be 1:1, with consolidation reserved for the cases where the benefits are demonstrable and the consumers can be coordinated without combinatorial cost. The H2 similarity tool surfaces candidates worth examining. The H3 migration recipe handles the cases where consolidation is the right call. But the bias of this article is that most candidates surfaced by H2, on close inspection, should NOT be consolidated — and that the "underrated" pattern in modern Argos shops is staying with what already works.

See it — the diagram

The diagram makes the same case visually that the analogy makes narratively.

On the left: five 1:1 DataBlocks. Each is small, focused, owned by one department, optimized for its specific query pattern. The five together produce more files than the consolidated alternative. They also produce a system where a change to Report A is a change to Report A — not a change with five-way blast radius.

On the right: one consolidated DataBlock with five branches. Visually smaller in the catalog. But every parameter slot has to serve five different consumers; every column has to be present even when only one report uses it; every WHERE predicate has to compose with every other; every change is a five-stakeholder coordination exercise. The "fewer files" win is real. The "simpler system" win is illusory.

Show me the code

Here is one of the cases the H2 similarity tool surfaced — "FY End Salary Increase Comparison - Admin and Staff" and "FY End Salary Increase Comparison - FT Faculty" — that look like obvious consolidation candidates at 1.000 score. The 1:1 versions:

-- DataBlock A — FY End Salary Increase, Admin and Staff
SELECT s.spriden_id, s.spriden_last_name,
       j.nbrjobs_salary, j.nbrjobs_eclass_code
FROM   nbrjobs j
JOIN   spriden s ON s.spriden_pidm = j.nbrjobs_pidm
                 AND s.spriden_change_ind IS NULL
                 AND s.spriden_entity_ind = 'P'
WHERE  j.nbrjobs_eclass_code IN ('AD', 'ST')
  AND  j.nbrjobs_effective_date = (SELECT MAX(...) ... );

-- DataBlock B — FY End Salary Increase, FT Faculty
SELECT s.spriden_id, s.spriden_last_name,
       j.nbrjobs_salary, j.nbrjobs_eclass_code
FROM   nbrjobs j
JOIN   spriden s ON s.spriden_pidm = j.nbrjobs_pidm
                 AND s.spriden_change_ind IS NULL
                 AND s.spriden_entity_ind = 'P'
WHERE  j.nbrjobs_eclass_code IN ('FF')
  AND  j.nbrjobs_effective_date = (SELECT MAX(...) ... );

Two DataBlocks. One-line filter difference. The consolidated alternative collapses them with a discriminator parameter:

-- Consolidated alternative — one DataBlock, two consumers
SELECT s.spriden_id, s.spriden_last_name,
       j.nbrjobs_salary, j.nbrjobs_eclass_code,
       :main_DD_audience AS audience
FROM   nbrjobs j
JOIN   spriden s ON s.spriden_pidm = j.nbrjobs_pidm
                 AND s.spriden_change_ind IS NULL
                 AND s.spriden_entity_ind = 'P'
WHERE  ( (:main_DD_audience = 'AdminStaff'
            AND j.nbrjobs_eclass_code IN ('AD', 'ST'))
      OR (:main_DD_audience = 'FTFaculty'
            AND j.nbrjobs_eclass_code IN ('FF')) )
  AND  j.nbrjobs_effective_date = (SELECT MAX(...) ... );

The consolidated version saves one DataBlock. The cost: the WHERE clause now branches on an audience parameter. The cost gets worse when a third audience appears ("Part-time faculty"), then a fourth ("Department chairs only"), then a fifth ("Faculty above a salary threshold"). The 1:1 alternative remains five small DataBlocks; the consolidated alternative becomes a CASE- ridden monster.

The two-DataBlock case at 1.000 similarity looks like an obvious consolidation candidate. On closer inspection, it is two reports with two different audiences (HR for Admin/Staff, the Provost for Faculty) and two different governance owners. Consolidating them creates a coordination problem (any change to the consolidated DataBlock requires both HR and Provost sign-off) that did not exist before. The "obvious" consolidation is exactly the case where 1:1 wins.

The Argos-feature-gap case in code. Consider an enrollment crosstab — Distinct Students by College (rows) by Term (columns). In Crystal this is a one-line formula: DistinctCount({SFRSTCR.PIDM}) placed in the crosstab cell. Argos has no equivalent. The only way to deliver the number is to pre-compute it in the DataBlock SQL with a column the crosstab can SUM:

-- 1:1 DataBlock for this specific crosstab.
-- The DISTINCT count is pre-computed via ROW_NUMBER + grouping;
-- the crosstab SUMs the resulting flag.
SELECT  c.college_code,
        t.term_code,
        CASE WHEN ROW_NUMBER() OVER (
                 PARTITION BY r.sfrstcr_pidm,
                              c.college_code,
                              t.term_code
                 ORDER BY r.sfrstcr_pidm
             ) = 1 THEN 1 ELSE 0
        END AS distinct_student_flag
FROM    sfrstcr r
JOIN    sgbstdn s ON ...        -- college lookup (effective dated)
JOIN    stvterm t ON ...
JOIN    stvcoll c ON ...;
-- Crosstab in Argos: SUM(distinct_student_flag) by college x term
-- gives the distinct student count for each cell.

The SQL is tight because it pre-computes the distinct-count flag for EXACTLY the (college × term) partition the crosstab needs. Add a second consumer that wants the same numbers by (major × term), and the consolidated DataBlock now has to emit TWO flag columns with TWO different PARTITION BY clauses. Add a third by (advisor × term), and it is three. Add a fourth that wants percentages-of-college-total, and the SQL now needs SUM() OVER (PARTITION BY college_code) AS college_total plus the division. By the fifth consumer the DataBlock has eight window functions, six PARTITION BY variants, and an audit trail that nobody reads. The 1:1 alternative is five DataBlocks, each with ONE clean window function tuned to its crosstab. The 1:1 version is bigger in file count and smaller in cognitive load — exactly the trade the consolidated pattern misjudges in its favor.

The same dynamic applies to percentage-of-group calculations across multi-level groupings. Crystal computes them in the report layout; Argos computes them in the DataBlock SQL. The SQL for "percentage of college-total within term" needs one specific SUM() OVER shape. The SQL for "percentage of term-total within college" needs a different one. Consolidating multiple percentage reports into one DataBlock means carrying every possible SUM() OVER variant in the column set — clean per consumer becomes muddled in the union.

Where intuition fails

Five common pro-consolidation arguments, and the honest response to each:

"Drift between similar reports is a real risk." Yes, it

is. Two reports computing "headcount" differently is a governance failure waiting to happen. But the right fix is not consolidating the DataBlocks — it is establishing a shared SQL view (v_active_employee_count) that BOTH 1:1 DataBlocks reference. The view encapsulates the definition; the DataBlocks remain specialized to their consumers. You get the consistency benefit without the blast-radius cost.

"Bug fixes propagate automatically when consolidated." Also

true. So does the bug. A defect introduced into a consolidated DataBlock affects every consumer instantly. A defect in a 1:1 DataBlock affects one. The "propagation" benefit cuts both ways and is often unflattering to the consolidated pattern on the unhappy-path side.

"The catalog is too big to navigate." A catalog the team

cannot navigate is usually under-tooled, not over-populated. The fix is naming conventions, schema search (BSS does this), ownership tags, and periodic audits — not consolidation. Hiding complexity inside a smaller number of larger DataBlocks does not reduce it; it just makes it less findable.

"Consolidation forces architectural thinking." This is

true and is the strongest consolidation argument. Building a consolidated DataBlock requires deciding "what is the right column set, what are the right parameters, what are the right output contracts" — questions that 1:1 lets you dodge. But the same architectural thinking can be applied to 1:1 DataBlocks by establishing TEMPLATES and STANDARDS for new ones, without forcing the run-time consolidation.

"This is what 'modern' shops do." Modern shops also

often have bigger budgets, larger teams, and the discipline to do consolidation right. The teams that adopt consolidation on a smaller team without proportional governance investment end up with monolithic DataBlocks that nobody dares change. The right question is not "what do modern shops do" — it is "what discipline can our team sustain over five years."

The one-sentence takeaway

Keep 1:1 as the default; consolidate by exception, not by impulse. The catalog growing linearly with reports is not the problem people think it is. The problem people don't anticipate is the complexity that grows nonlinearly when one DataBlock has to serve every possible report.

Track H · DataBlock architecture & engineering decisions

Running Argos Similarity v2.3 — the operational guide

11 min readargosdatablockconsolidationsimilaritytoolingguidetf-idfclusteringcost-scorecalc-fields

Contents

Goal
Before you start
Run it
The recommended workflow
What you'll see in the CSV — a real row
Reading the columns
What each cost band actually means
Gotchas
A worked example
When NOT to consolidate
The one-sentence takeaway

Goal

A person at a Kanban board labeled 'Argos Consolidation' with three columns: 'Cheap wins' (green stickies tagged 'low'), 'Maybe' (yellow stickies tagged 'medium'), 'Do not touch' (red stickies tagged 'high'). The image conveys the cost-band workflow: sort low to high, act.

You have an Argos catalog with hundreds of DataBlocks. You suspect there are duplicates and near-duplicates accumulated over years of "just copy this one and change the WHERE clause." You want a ranked, ACTIONABLE list of consolidation candidates — a workflow you can run on a Monday and finish a backlog by Friday, not a soup of similarity scores.

Finding Consolidation Candidates — Programmatic Similarity Across the Catalog explains the architecture. THIS article is the operational guide: what to run, the workflow that turns the output into action, every column explained, and the limitations the tool will not warn you about. When you finish you should have: an orphan list ready to retire, a short list of cost_band = low clusters as your cheap-wins backlog, a medium-cost backlog for judgment calls, a clear "do not touch" set with reasons, and a re-run cadence so duplicates do not accumulate again.

Before you start

What must already exist:

A fresh Argos export at argos_tool/ArgosDoc/ai_data/*.json. The tool reads ONLY this directory; it never touches the live catalog or the database.
The JSON must include reports, visible_fields, and calculated_fields keys. The 2026 export format has these; older formats may leave v2 cost signals empty.
Python 3.x, standard library only. No external dependencies.

You do NOT need: database access, Argos running, or any network connection.

A first-run sanity check:

python wiki/src/argos_similarity.py --self-check

Runs internal property tests (jaccard math, containment, TF-IDF rare-vs-common, union-find clustering, calc-field reverse-index, clause split respecting parens, cost band thresholds). Under a second; prints [OK] argos_similarity v2.3 self-check passed.

Run it

python wiki/src/argos_similarity.py

Default: --threshold 0.50, --top 200. Right starting point.

Expected stdout (your numbers will differ):

Loading DataBlocks from .../argos_tool/ArgosDoc/ai_data ...
  loaded 670 DataBlocks
  orphan DataBlocks (zero consumers): 6
  consumer distribution: max 11, mean 1.6
  calc_fields: 465 across 220 DataBlocks (288 total frozen-column references)
Computing TF-IDF over SQL token corpus ...
Scoring pairs (threshold=0.5, min_tables=2) ...
  found 728 pairs above threshold
Building transitive clusters ...
  formed 73 clusters of size >= 2
  pair cost bands:    high=146, low=48, medium=534
  cluster cost bands: high=20, low=8, medium=45
  pairs CSV    -> .../briefs/argos_similarity_candidates.csv
  clusters CSV -> .../briefs/argos_similarity_clusters.csv
  summary MD   -> .../briefs/argos_similarity_summary.md

Three files in wiki/briefs/:

**argos_similarity_candidates.csv** — one row per pair above threshold. 30 columns. Sortable.
**argos_similarity_clusters.csv — one row per transitive cluster of size ≥ 2. 14 columns. Start here.**
**argos_similarity_summary.md** — human-readable narrative: orphan list, top-20 clusters, top-200 pairs, suggested workflow, caveats.

For higher confidence on a second pass: --threshold 0.70. Drops pair count ~60–70% and breaks hub-formed mega-clusters apart.

The recommended workflow

The 6-step workflow as a vertical pipeline. Orphans (step 1) and low-cost clusters (step 2) are coral — cheap wins. Medium (step 4) is amber — judgment calls. High-cost (step 5) is dark amber — usually do not touch. The mega-cluster trap (step 6) calls for raising the threshold and re-running.

Six steps in order. Finish each before moving to the next.

Step 1 — Orphans first (free wins)

Summary MD has a "Bonus: orphan DataBlocks" section. Each name there has ZERO Report consumers in the export.

Action: confirm in the live Argos catalog (the export can lag), then retire. No alias work, no calc porting, no SQL reading. Just delete.

Step 2 — Low-cost clusters

Open argos_similarity_clusters.csv. Sort by cost_band ascending, then by max_pair_cost ascending within the band.

Low-cost clusters are size 2 or 3, tightly bound, few consumers, no calc orphans, no filter divergence. The classic "we kept copying this DataBlock and renaming it" cases.

Per cluster: read the .dbk of each member, confirm the SQL bodies are nearly identical, pick the keeper (newest / cleanest), repoint each retiree's Report consumers, run each repointed Report once pre- and post-merge, archive the retiree. One sprint usually retires 10–20 DataBlocks with no Report breakage.

Step 3 — Low-cost pairs not already in low clusters

Open argos_similarity_candidates.csv. Sort by cost_band ascending. Skip rows whose cluster_id you already handled in step 2.

The remaining cost_band = low pairs are individual safe merges that happen to live inside otherwise medium clusters. The subsumption column is your friend: rows tagged "A superset of B" with alias_b_only = 0 AND b_calc_at_risk = 0 are the safest — B is a strict subset of A, so consumers can repoint without renames or calc-field porting.

Step 4 — Medium-cost: judgment call

Real consolidation territory. Read the SQL of both DataBlocks side by side. The cost is medium because at least one of: 5–15 Reports affected, 5–15 aliases need harmonizing, filters diverge (consolidation requires pushing logic to Report side via discriminator — see Shared DataBlocks — One SQL, Many Reports), or a handful of calc fields need expression migration.

Pick the candidates that align with current work; defer the rest to backlog.

Step 5 — High-cost: usually do not touch

The tool already says this is expensive. Business value of consolidation rarely beats the cost. When you DO need to touch one, plan it as a project: parallel testing, owner engagement, days of work. See Safe Consolidation Migration — How to Merge N DataBlocks into One Without Breaking Anyone.

Step 6 — The mega-cluster trap

If clusters.csv shows a cluster of 50+ DataBlocks with alias_core = 0, that is NOT a refactor. The union-find is chaining pairwise links through a hub table (usually SPRIDEN or SGBSTDN). Re-run with --threshold 0.70; the hub-only links drop out and the mega-cluster fragments into the real candidates.

If it persists at 0.85+, your catalog has genuine convergence around those tables — a project, not hygiene.

What you'll see in the CSV — a real row

Before listing every column, here is one real (sanitized) row so the shape is concrete. This is the kind of pair the workflow's step 2 catches first:

rank          | 7
score         | 0.996
cost_band     | low
merge_cost    | 2.05
cluster_id    | 13
datablock_a   | Faculty Type-A Report
datablock_b   | Faculty Type-B Report
sql_sim       | 0.99
select_sim    | 1.00
where_sim     | 0.98
order_sim     | 1.00
table_jaccard | 1.00      table_cont    | 1.00
field_jaccard | 1.00      field_cont    | 1.00
param_jaccard | 1.00
a_tables      | 13        b_tables      | 13      common_tables | 13
a_consumers   | 1         b_consumers   | 1       total_consumers | 2
alias_shared  | 27        alias_a_only  | 0       alias_b_only  | 0
a_calc_fields | 2         b_calc_fields | 2
a_calc_at_risk| 0         b_calc_at_risk| 0
subsumption   | equivalent tables (13)

How to read this row: identical SQL shape (sql_sim 0.99), identical SELECT/ORDER (1.00), tiny WHERE difference (0.98 — the type-code filter). Same source tables, same alias contract (no renames needed), same calc fields with all dependencies on both sides. Only 2 Reports affected total. cost_band = low, merge_cost = 2.05 — below the 5 threshold.

Action: confirm with both Report owners that one consolidated DataBlock with a type_code parameter (or a discriminator column per Shared DataBlocks — One SQL, Many Reports) would serve both, then retire one. The pair-level cost analysis says this is safe; the SQL read confirms it; the Report-owner conversation closes it.

Reading the columns

Pair CSV — `argos_similarity_candidates.csv`

Column	Meaning
`rank`, `score`	Position in score-sorted list; weighted similarity 0..1
`cluster_id`	Transitive cluster this pair belongs to
`merge_cost`, `cost_band`	Combined cost score; `low` / `medium` / `high`
`datablock_a`, `datablock_b`	The two DataBlock names
`sql_sim`	TF-IDF cosine on full SQL token bag
`select_sim`, `where_sim`, `order_sim`	Clause-split sub-similarities. High select + low where = same shape, different population
`table_jaccard`, `table_cont`	Jaccard + containment on source tables (containment ≈ 1.0 = subset)
`field_jaccard`, `field_cont`	Same on output fields
`param_jaccard`	Jaccard on Argos parameter names
`a_tables`, `b_tables`, `common_tables`	Source-table counts
`a_consumers`, `b_consumers`, `total_consumers`	Argos Reports per side + union = blast radius
`alias_shared`, `alias_a_only`, `alias_b_only`	Visible-field overlap + exclusives
`a_calc_fields`, `b_calc_fields`	Calc field counts
`a_calc_at_risk`, `b_calc_at_risk`	Frozen columns present ONLY on that side; non-zero = dropping that side orphans calc fields
`subsumption`	Tag: `equivalent tables (N)` or `A superset of B`

Cluster CSV — `argos_similarity_clusters.csv`

Column	Meaning
`cluster_id`, `size`, `pair_count`	Identity and shape
`avg_score`, `max_score`	Pair score stats within cluster
`total_consumers`	UNION across all members — true blast radius for any merge in this cluster
`alias_universe`	Distinct aliases across members (harmonization surface)
`alias_core`	Aliases EVERY member has (the safe shared contract)
`total_calc_fields`	Sum of calc field counts
`frozen_universe`	Distinct frozen columns across members
`avg_pair_cost`, `max_pair_cost`	Pair cost stats
`cost_band`	Band based on `max_pair_cost` — one bad pair can push the whole cluster to high
`datablocks`	Pipe-separated member names

What each cost band actually means

The cost formula and the three bands. LOW (< 5) are cheap consolidations. MEDIUM (5–19) need a judgment call. HIGH (≥ 20) usually do not touch. Examples in each band are sanitized pairs from a real run.

`low` (cost < 5)

Cheap. Few Reports affected (typically ≤ 3), no calc-field constraint, no filter divergence beyond cosmetic. Usually: obvious twin DataBlocks, version pairs (Report (Rev YYYYMMDD)), or subsumption cases. Action: confirm with the Report owner; retire the redundant one.

`medium` (5 ≤ cost < 20)

Judgment call. At least one of: 5–15 Reports, meaningful alias delta, filter divergence (consolidation needs Report-side filtering), or a handful of calc dependencies. Action: read the SQL of both sides. Often the answer is "later, when one is being touched anyway."

`high` (cost ≥ 20)

Usually do not touch. Many Reports, many aliases, calc dependencies that would orphan logic, or sharp filter divergence. Action: leave it. If business reasons force consolidation, run Safe Consolidation Migration — How to Merge N DataBlocks into One Without Breaking Anyone as a full project.

Gotchas

Six things the tool cannot tell you. Read these before acting.

1. The tool reads the EXPORT, not the live catalog

A "zero-consumer orphan" may have been wired into a Report yesterday. Re-export before acting on the orphan list. Minutes vs. broken Reports.

2. Calc-field index covers DataBlock-side calc fields ONLY

Reports ALSO have their own calc fields, conditional print sections, groupings, and sort orders — and all bind to DataBlock column names by string. None is in the JSON. A column the script flags "safe to rename" (in alias_shared, not referenced by any DataBlock calc) may still be wired into a Report's PRINT WHEN expression or a group break. risk A/B = 0/0 is necessary, not sufficient.

3. TF-IDF is corpus-dependent

Add 50 new DataBlocks and every token's IDF shifts. Do not compare scores across runs of different exports. Only relative ranking within one run is stable.

4. Cost band thresholds are heuristic

The formula weights calc_at_risk × 2.5 and filter_div × 5 because at Waubonsee those are the catastrophic failure modes (silent loss of business logic, re-architecting Reports). Another shop may want to retune compute_pair_cost in the source.

5. Mega-clusters via hub tables

A cluster of 50+ DataBlocks with alias_core = 0 and hundreds of consumers is the union-find chaining links through SPRIDEN / SGBSTDN / STVTERM. NOT a consolidation conversation. Raise --threshold to 0.70 or 0.80 and re-run. If it persists at 0.85+, the convergence is structural — bigger project than this tool.

6. Re-run cadence

Quarterly: re-export + re-run; work through new low-cost clusters. Duplicates accumulate one Report at a time. Annually: threshold sweep (0.30, 0.50, 0.70, 0.85) — track pair counts as a "duplicate accumulation" health metric. A rising count signals copy-and-modify replacing Shared DataBlocks — One SQL, Many Reports consolidation.

A worked example

Numbers from a real production run on a Waubonsee-shaped catalog (~670 DataBlocks, names sanitized):

6 orphans found and verified — retired in week 1, zero Report breakage
8 low-cost clusters worked through in weeks 2–3 → ~15 redundant DataBlocks retired
48 low-cost pairs worked through over the following month
20 high-cost clusters documented in a "do-not-touch backlog" with reasons
One mega-cluster of 121 DataBlocks at threshold 0.50; fragmented cleanly at threshold 0.70 into 6 real consolidation conversations

Patterns seen most often in the low band:

Type-split — same report shape duplicated for two employee classes (adjunct vs. full-time, FT vs. PT). Same SQL, same tables, one WHERE filter different. The textbook Shared DataBlocks — One SQL, Many Reports candidate with a discriminator column.
Old vs. revised — Report (Rev YYYYMMDD) annotation on the newer one, original left in place "in case someone still uses it." total_consumers is usually 0 or 1 on the old; retire it.
Domain trio — three DataBlocks for two business domains plus a base version of the same kind of report, differing only by a domain filter. Three-way consolidation with a domain discriminator.

When NOT to consolidate

Even when cost is low, governance can argue against merging. See When 1:1 Wins — The Case for One DataBlock Per Report. Common cases: different department owners want separate permissions; audit trail per flavor is a compliance requirement; planned divergence (one DataBlock is about to add new columns) makes the merge premature; different release cadences would couple unrelated consumers.

The tool gives architectural opportunity. "Should we?" lives with the Report owners.

The one-sentence takeaway

Run the tool, read the orphan list, then sort clusters by cost_band ascending and work through low first; the cost score is a heuristic that captures the four expensive failure modes (Reports affected, alias renames needed, calc-field dependencies orphaned, filter divergence pushing logic to the Report side) but cannot see Report-side calc fields, conditional print sections, or groupings — so always read the SQL of every candidate before consolidating, and re-run after every Argos export.

Track I · Beyond direct SQL — Ethos & the integration layer

What Ethos actually is — one stack, three products, one spec, two brand names

Ellucian renamed Ethos to 'Ellucian Platform' in 2026 — but the airport still lands the same planes through the same gates.

7 min readethosintegrationeedmhedmellucian-platform

The hook

If you Google "Ellucian Ethos" today, the page that loads is titled "Ellucian Platform." You haven't been redirected to a different product — Ellucian renamed Ethos this year without changing what it actually does. Underneath the new sign, the airport still lands the same planes through the same gates with the same customs procedures. This article is the map of that airport, for someone who has spent a career inside Banner and never been outside.

The everyday analogy

Think of Banner as a country. It has its own internal road system, its own ID numbering (PIDM), and its own language (Oracle SQL with 7-letter table names and a security model built around GURACLS and GOBEACC). For decades, the only way in or out was a private loading dock — you brought your own truck, you knew the internal codes, you drove out with whatever you came for.

Then Ellucian built an international airport at the edge of the country. The airport has:

Customs and immigration desks — who is allowed in or out, and as whom.
A cargo terminal — what shape can move through, in what container format,

with what schedule.

A dashboard on the wall — analytics on today's traffic.

The airport doesn't replace the country's roads. The roads still work; you can still drive your own truck through the private loading dock if you're inside the city. But foreign trucks can now arrive at the airport with standardized containers, hand a passport to immigration, and have their cargo delivered to a Banner street address. They never need to learn the country's internal language or numbering.

In 2026, the airport's sign changed from "Ethos International" to "Ellucian Platform Airport." The runways, the gates, the customs procedures, the cargo specs — none of it changed. The sign on the front of the terminal is just newer.

Banner is the country. Ethos is the international airport at its edge — built so foreign trucks can deliver standardized cargo without learning the country's internal road grid.

What it really is

The airport in the analogy is a stack of three products plus one specification beneath them.

Ethos Integration is the cargo terminal — the integration platform-as-a- service where data physically moves in and out of Banner. It is hosted at integrate.elluciancloud.com (US, with regional .ca, .ie, .com.au variants). It exposes REST endpoints under /api/ (canonical resources) and /qapi/ (Banner-specific proxies). Reads AND writes — Ethos pushes data into Banner as well as pulling it out.

Ethos Identity is customs and immigration — the federation broker that handles authentication, single sign-on, and protocol translation. It is built on a curated subset of WSO2 Identity Server and bridges CAS, SAML, WS-Trust/ WS-Federation, and OpenID Connect. It is not a from-scratch OAuth authorization server; it is the translator that lets a SAML-only campus and an OIDC-only app talk to each other.

Ethos Data is the dashboard on the wall — Ellucian's analytics warehouse offering on a managed data lake. Separate licensed product, separate conversation. We won't go deep on it here; it isn't what most integration discussions are about.

EEDM (Ellucian Ethos Data Model, formerly HEDM — Higher Education Data Model) is the shape of the standardized cargo container. EEDM is not a product. It is a specification — a set of JSON schemas plus rules that define how Banner data is exposed via REST. Every Ethos Integration endpoint speaks EEDM. You can buy Ethos Integration; you can't buy EEDM.

Three products sit on one spec, on top of Banner. EEDM is the cargo-container standard; the three products are the terminal, customs, and dashboard.

See it — the diagram

The taxonomy diagram is the cleanest way to hold the stack in your head: Banner at the bottom (the country), EEDM as the middle layer (the cargo container spec — the data contract), and the three products on top (Integration is the terminal, Identity is customs, Data is the dashboard). The arrows between Integration and Banner are bidirectional — Ethos reads from and writes to Banner — but every byte that crosses that arrow is shaped by the EEDM spec.

Show me the code

There is no code for a taxonomy article. What there is, is the rebrand evidence — what changed in 2026 and what did not:

Before 2026 — searching "Ellucian Ethos" landed at:
  ellucian.com/solutions/ellucian-ethos
    → titled "Ellucian Ethos Platform"
    → three products: Ethos Integration, Ethos Identity, Ethos Data
    → underlying data spec: HEDM (Higher Education Data Model)
    → /ethos-connected-partners was a live page

After the 2026 rebrand — the same URL serves:
  ellucian.com/solutions/ellucian-ethos
    → titled "Ellucian Platform"
    → four pillars: Central Workspace · Reporting & Analytics ·
                    Business Process Automation · Low-Code Integrations
    → no mention of "Ethos" or "HEDM" on the marketing page
    → /ethos-connected-partners now returns HTTP 404
    → the legacy Ethos Platform solution-sheet PDF also returns 404

But underneath — unchanged:
  developer.ellucian.com           → still "Ethos Integration", "EEDM"
  github.com/ellucianEthos         → active org, Postman + Bruno SDK repos
  github.com/ellucian-developer    → active, Integration SDK in Java + C#
  training.ellucian.com            → "Introduction to Ellucian Ethos Platform"
                                     and "Describing GUIDs for Banner Ethos
                                     Data Model" courses still listed
  Third-party docs (Coursedog,     → all say "EEDM (formerly HEDM)"
   Ad Astra, Tray, ProcessMaker)
  Banner-side proxy endpoints      → still /qapi/transfer-course-articulation,
                                     /api/persons, /api/courses, …
  Auth flow                        → still POST API key to
                                     integrate.elluciancloud.com/auth → JWT
  Media type                       → still application/
                                     vnd.hedtech.integration.v{N}+json

The technical surface is unchanged. The marketing wrapper is "Ellucian Platform" with four pillars. When you talk to a vendor sales rep, use the new brand. When you read SDKs or call APIs, use the old technical names.

Pre-2026 the front door said 'Ellucian Ethos' with three products. As of 2026 it says 'Ellucian Platform' with four pillars. The developer portal, SDKs, and APIs underneath still say Ethos.

Where intuition fails

1. "Ethos" sounds like a single product. It isn't. It is a stack of three distinct licensed products (Integration, Identity, Data) that customers buy separately, plus one specification beneath them. A college might license Ethos Integration but not Ethos Identity (because they already have Shibboleth or Okta), or vice versa.

And in 2026 Ellucian also markets Ellucian Data Connect as a SaaS-oriented integration product with low-code APIs and serverless pipelines. Its exact relationship to Ethos Integration is not fully clear from public documentation — some references treat it as a successor or complement for SaaS deployments, others as an evolution of Ethos Integration. When in doubt, call it "Ellucian Data Connect" by name rather than collapse it into "Ethos." See When Ethos, when SQL — the decision frame for the next 3-5 years for how this affects the SQL-vs-Ethos decision frame.

2. EEDM is not Ethos. EEDM is the data model specification; Ethos Integration is the platform that serves that model. Conceptually you could implement an EEDM endpoint in your own infrastructure if you wanted to — though almost no one does, because Ethos Integration already does it for Banner and Colleague.

3. The 2026 rebrand confuses everyone. If you say "we use Ethos" in 2026, a vendor sales rep may correct you with "Ellucian Platform." If you say "we use the Ellucian Platform," a developer may ask "you mean Ethos Integration?" Both are correct. Use the new brand with executives, the old technical names with developers and integrators.

4. "Ellucian Platform" has four pillars; "Ethos" had three products. The numbers do not reconcile cleanly. The new pillars (Central Workspace, Reporting & Analytics, Business Process Automation, Low-Code Integrations) are higher-level marketing categories. The old products map roughly: Integration → Low-Code Integrations + Business Process Automation; Identity → folded into the platform plumbing; Data → Reporting & Analytics. But "Central Workspace" (an evolution of Ellucian Experience as a portal) is a new pillar that wasn't in the old Ethos triad.

5. HEDM is not retired — it is renamed. Some old presentations and docs still say "HEDM." Every new doc says "EEDM (formerly HEDM)." The model itself versioned in major bumps (v6, v8, v12, v16+) and continues to version forward per resource — they did not start over at v1 when the spec was renamed.

The one-sentence takeaway

Ethos is a stack of three products (Integration, Identity, Data) plus one spec (EEDM) — the marketing name on the front door changed in 2026, the technical surface beneath it did not.

Track I · Beyond direct SQL — Ethos & the integration layer

EEDM REST mechanics — passport, boarding pass, version-pinned gate

Your passport never goes through the gate. You exchange it once at security for a boarding pass that expires in five minutes — and re-exchange whenever it does.

7 min readethoseedmrestoauthjwthedmintegration

The hook

REST mechanics for Ethos are not RFC 6749 OAuth, are not RESTful in the strict Roy-Fielding sense, and are not what you'll guess from reading a generic OpenAPI tutorial. They are a small, specific, opinionated set of rules that you must learn once and follow exactly. This article is that set of rules — auth, version, endpoint shape, pagination — at the level of detail your first integration needs.

The everyday analogy

You travel internationally with a passport. You don't walk it up to the gate; you exchange it once at the security counter for a boarding pass. The boarding pass is short-lived, single-flight, and expires fast — when the gate closes, it's worthless. Whenever you fly again, you go back to the counter and trade your passport for another boarding pass.

Ethos works the same way. Your API key is the passport — long-lived, issued once by Ellucian Customer Center (or the Ethos Integration portal), kept secret. The JWT access token is the boarding pass — short-lived (~5 minutes), used for one stretch of API calls, then thrown away. The **/auth endpoint** is the counter — present your passport, get a boarding pass.

The boarding pass also tells the gate which terminal to send you to. Ethos encodes the same idea in the **Accept header**: each /api/ call declares which version of the resource it wants — v12 for persons, v16.1 for sections — and Ethos serves the matching payload shape.

Passports don't go through the gate. You trade yours once at the security counter for a boarding pass — short-lived, single-use, expires fast. Welcome to Ethos auth.

What it really is

Four mechanical facts. Internalize them and the rest of Track I makes sense.

1. The two-step bearer flow. You POST your API key as a Bearer header to https://integrate.elluciancloud.com/auth. Ethos returns a JWT in the response body. You then call any /api/ endpoint with the JWT as the Bearer header. The JWT expires in ~5 minutes; an expired JWT yields HTTP 401, and you re-call /auth to get a fresh one. This is not RFC 6749 client-credentials grant — it is an Ellucian-specific simplified exchange. The same pattern across all official SDKs (Java, C#, Postman) and every third-party connector (Tray, COZYROC, Argos REST, ProcessMaker).

2. Version negotiation via media type. Each resource is independently versioned. You pin the version per call through the Accept and Content-Type headers, formatted as application/vnd.hedtech.integration.v{N}+json. Examples in production: persons at v12, sections at v16.1.0, courses at v16, academic-credentials at v6. There is no "Ethos v12" — there are per-resource versions, and breaking changes bump the major. Ellucian's release notes track each resource version independently.

**3. The endpoint shape — kebab-plural under /api/.** Canonical EEDM resources live under integrate.elluciancloud.com/api/<resource> where <resource> is kebab-case, always plural. Examples from Ellucian's own Postman collection: /api/persons, /api/courses, /api/sections, /api/academic-levels, /api/institution-jobs, /api/email-types, /api/student-types, /api/academic-credentials. Banner-specific proxies — operations that don't map cleanly to a canonical HEDM resource because they reflect Banner's own business processes — live under /qapi/ instead: /qapi/transfer-course-articulation, /qapi/transfer-course-detail-maintenance. See Transcript import end-to-end — customs at the EEDM port for when you reach into the /qapi/ family.

4. Pagination + the page-size discovery header. All list reads paginate via query parameters: ?offset=0&limit=100. The maximum page size is not universal — it depends on the resource and the Banner configuration. Don't hardcode it; Ethos returns the max in a response header, and the official SDKs expose helpers like GetEthosApiMaxPageSizeAsync (C#) or getPageSize (Java) to read it. There is no bulk endpoint for any resource — every POST writes a single resource. Bulk imports are loops over single POSTs, not array posts.

Two-step bearer. Long-lived API key buys a short-lived JWT at /auth; the JWT signs every subsequent /api/ call until it expires in ~5 minutes.

Endpoints are kebab-plural under /api/. Version is negotiated per call via the Accept media type — v12 for persons, v16.1 for sections, no global Ethos version switch.

See it — the diagram

The two diagrams carry the model. The auth-flow diagram shows the boarding-pass dance: passport (API key) at the left, security counter (/auth) in the middle issuing the boarding pass (JWT), and the turnstiles (/api/* calls) on the right. The endpoint-shape diagram shows how a single REST call is assembled: kebab-plural path, version-pinned Accept header, JWT bearer auth, and pagination query params.

Show me the code

The minimal end-to-end against a sandbox. Two curl calls.

# Step 1 — Trade the API key for a JWT at the auth endpoint.
# The API key is long-lived; the JWT in the response is short-lived
# (~5 min). The body is the raw JWT — no JSON envelope.
TOKEN=$(curl -s -X POST https://integrate.elluciancloud.com/auth \
  -H "Authorization: Bearer ${ETHOS_API_KEY}")

# Step 2 — Call a canonical EEDM resource. The Accept header pins the
# resource version. The Authorization header carries the JWT bearer.
# Page through results with offset + limit.
curl -s "https://integrate.elluciancloud.com/api/persons?offset=0&limit=100" \
  -H "Authorization: Bearer ${TOKEN}" \
  -H "Accept: application/vnd.hedtech.integration.v12+json"

A POST that writes a single resource (e.g., create an email-address record for a person):

curl -s -X POST https://integrate.elluciancloud.com/api/email-addresses \
  -H "Authorization: Bearer ${TOKEN}" \
  -H "Accept:       application/vnd.hedtech.integration.v6+json" \
  -H "Content-Type: application/vnd.hedtech.integration.v6+json" \
  -d '{ "address":"maria@example.edu",
        "person":  { "id":"c2a8e5f3-9d7e-4b18-a4c2-7e1f8b3c9d12" },
        "type":    { "emailType":"personal" } }'

A few production-grade observations from these two snippets:

The same media type goes in Accept AND Content-Type on a write. The

body shape Ethos expects is the shape Ethos returns.

The person you're writing to is referenced by GUID, not PIDM. See

GUIDs vs PIDM — the impedance Banner SQL writers feel first for why and how to look up the GUID from Banner SQL.

Production code wraps step 1 in a refresh harness — most SDKs cache the

JWT and only re-call /auth on 401. Argos REST DataBlocks against Ethos need this harness in the DataBlock's pre-process script; otherwise you re-auth on every cell render and grind to a halt.

Where intuition fails

1. The JWT expires in ~5 minutes — every long-running job needs a refresh. A SQL background job runs as long as it needs to; an Ethos loop dies after ~5 minutes unless you re-call /auth. Plan for it. The SDKs handle it; raw HTTP code doesn't unless you write it.

2. Version is per resource, not per Ethos. You cannot say "we're on Ethos v12." You can say "we call persons at v12 and sections at v16.1." When Ellucian publishes a breaking change to one resource, the others stay where they are. Pin every call.

3. Endpoint names are kebab-plural, not snake-singular. SQL habits break here. The resource is /api/persons, not /api/PERSON or /api/person. The resource is /api/academic-credentials, not /api/SPRACAD or /api/academicCredentials. There is no aliasing — the literal kebab-plural is the only path that works.

4. There is no bulk endpoint. None. No POST /api/persons accepting an array, no equivalent of SQL\Loader, no batch import API. A 15,000-row transcript import is 15,000 separate POSTs. Public sources put practical Ethos throughput at the order of tens of rows per second — two to three orders of magnitude slower than SQL\Loader. See When Ethos, when SQL — the decision frame for the next 3-5 years for the decision frame this implies.

5. Don't hardcode the page-size limit. It varies. Read it from the Ethos response header (or use an SDK helper). Hardcoding limit=200 because it worked on your test box is the easy way to spend an afternoon debugging a production 400 after Ellucian tightens the default.

**6. The API key is a secret — but it's not the bearer token.** The API key only authenticates the /auth call. The JWT it returns is the bearer for everything else. Never paste the API key into a generic HTTP client as the bearer for /api/persons — you'll get 401 and waste an hour wondering why. Some third-party connectors (Argos REST, COZYROC, Tray) auto-handle the exchange; some require you to wire it.

The one-sentence takeaway

Ethos REST is a two-step boarding-pass dance — long-lived API key for a ~5-minute JWT, called against kebab-plural /api/ endpoints with a media-type header that pins the version per call.

Track I · Beyond direct SQL — Ethos & the integration layer

GUIDs vs PIDM — the impedance Banner SQL writers feel first

When an Ethos response lands on your desk, you can't join it on PIDM. You need GORGUID first.

7 min readethosguidpidmgorguideedmintegration

The hook

You run a SQL query and get back PIDMs. You receive an Ethos JSON payload and get back GUIDs. They both identify the same people, but you cannot join the two without an intermediate lookup. That intermediate lookup is GORGUID — and most Banner SQL writers don't realize it's there until the first Ethos integration breaks because they tried to join on the wrong column.

This article is that lookup, explained.

The everyday analogy

Phone numbers and office extensions.

Inside an office, everyone has a four-digit extension. Dialing 1234 from any phone in the building rings Maria's desk. It's short, fast, universal within the company, and it depends entirely on you being inside the building. The extension is Maria, as far as the in-house phone system is concerned. The PBX has no opinion about who Maria is outside the company.

Outside the office, that same extension is meaningless. To reach Maria from anywhere in the world, you need her full E.164 phone number: +1-630-555-1234. That's a global identifier; any phone system on earth knows what to do with it. The international number doesn't depend on you being inside the building, but dialing it from inside the building is slower (it has to go out and come back in).

Both numbers identify Maria. Neither replaces the other. The company maintains a lookup table at the switchboard that pairs Maria's extension (1234) to her E.164 number (+1-630-555-1234). When the switchboard receives an incoming external call, it consults the table: "+1-630-555-1234 belongs to extension 1234; ring that desk."

That lookup table, in Banner, is GORGUID. Maria's extension is her PIDM. Maria's E.164 number is her GUID.

Two ways to call Maria — by her four-digit office extension (fast, local), or by her full international phone number (slow, universal). Both are valid; a lookup table at the switchboard pairs them.

What it really is

PIDM (Person Identification Master) — a 32-bit integer Banner assigns to every person — student, employee, vendor, contact. It is the primary key on SPRIDEN, the universal join key across hundreds of person-related tables, and the foundational concept the PIDM — The Number Behind Every Person article exists to teach. It is local to your Banner instance. PIDM 50321 in your Banner is a completely different person from PIDM 50321 at a peer institution.

GUID (Globally Unique IDentifier) — a 36-character string conforming to RFC 4122, formatted like c2a8e5f3-9d7e-4b18-a4c2-7e1f8b3c9d12. Ethos uses GUIDs as the canonical cross-system identifier for every resource it exposes — not just people, but courses, sections, terms, financial-aid awards, jobs, academic-credentials, even validation codes. Generated either centrally or per-instance and intended to be globally unique by mathematical construction. The same person has different GUIDs across institutions, but each institution's GUID for that person never changes once issued.

GORGUID — the Banner table that pairs an entity's Banner business key to its Ethos GUID. It lives in the GENERAL schema. Its description in the Banner catalog: "GlobalUniqueIdentifier: Stores the global unique identifier for an object across all domains."

GORGUID's working columns:

Column	Meaning
`GORGUID_GUID`	The 36-char GUID itself.
`GORGUID_LDM_NAME`	The EEDM/LDM resource name (`persons`, `students`, `sections`, `courses`, etc.) — disambiguates which kind of thing this GUID identifies.
`GORGUID_DOMAIN_KEY`	The Banner business key for the row this GUID points at (for `persons` rows, this maps to PIDM).
`GORGUID_DOMAIN_SURROGATE_ID`	The Hibernate surrogate identifier for the underlying object — used by Ethos-aware Banner code internally.
`GORGUID_VPDI_CODE`, `GORGUID_DOMAIN_VPDI_CODE`	Multi-entity processing (MEP) codes — only relevant if your Banner runs MEP.
`GORGUID_VERSION`, `GORGUID_DATA_ORIGIN`, `GORGUID_USER_ID`, `GORGUID_ACTIVITY_DATE`, `GORGUID_SURROGATE_ID`	Bookkeeping — Hibernate version, source system, who/when, internal id.

GUIDs in Banner physically live in three places, depending on the entity and the version of Banner:

A column on the entity's own table — e.g., a _GUID column added

directly to certain Ethos-aware tables.

A shadow table dedicated to that entity — a _GUID companion table

beside the entity's main table.

**GORGUID** — the catch-all GUID lookup, where most resource types end up.

When Ethos Integration was added to Banner, Ellucian had to backfill GUIDs for billions of existing rows. The "GUID Generation" process is a Banner-side job that populates these tables. New rows get GUIDs at insert time, typically via database triggers or Hibernate callbacks in Banner's middle tier.

A GUID is 36 chars: 32 hex digits in 8-4-4-4-12 groups, RFC 4122. Banner stores it on the entity's own table, a shadow table, or — most commonly — in GORGUID.

GORGUID is the catch-all. Three columns do the work: GORGUID_LDM_NAME (which resource type), GORGUID_DOMAIN_KEY (the Banner business key for that row), and GORGUID_GUID (the 36-char ID).

See it — the diagram

The two diagrams together carry the model. GUID anatomy shows the 36-char shape and where it lives. The GORGUID columns diagram shows that three columns do all the work: LDM_NAME (what kind of thing), DOMAIN_KEY (which specific row), and GUID (its global identifier). Everything else is bookkeeping.

Show me the code

The minimal join: you have a GUID from an Ethos payload, you want the Banner row.

-- Given an Ethos persons GUID, find the Banner PIDM and current name.
-- GORGUID lives in the GENERAL schema.
SELECT g.gorguid_domain_key AS pidm,
       s.spriden_id,
       s.spriden_last_name,
       s.spriden_first_name
FROM   general.gorguid g
JOIN   saturn.spriden  s
       ON s.spriden_pidm = TO_NUMBER(g.gorguid_domain_key)
      AND s.spriden_change_ind IS NULL
WHERE  g.gorguid_guid     = 'c2a8e5f3-9d7e-4b18-a4c2-7e1f8b3c9d12'
  AND  g.gorguid_ldm_name = 'persons';

Notes on this query:

GORGUID_DOMAIN_KEY is the Banner business key for the row the GUID points

at. For LDM_NAME = 'persons', the business key is the PIDM. Stored as a string in Banner's metadata, so TO_NUMBER() brings it back to the integer shape SPRIDEN.SPRIDEN_PIDM expects.

The SPRIDEN_CHANGE_IND IS NULL filter is the standard "current name only"

pattern — without it, this query returns every historical row for that PIDM. See SPRIDEN Without CHANGE_IND — The Duplicate-Name Trap.

GORGUID_LDM_NAME = 'persons' is required. The same GORGUID table stores

GUIDs for many resource types; without the filter you can collide on DOMAIN_KEY between a PIDM (an integer) and an entirely unrelated business key from a different domain.

The reverse direction — given a Banner PIDM, find the persons GUID:

SELECT gorguid_guid
FROM   general.gorguid
WHERE  gorguid_domain_key = TO_CHAR(50321)
  AND  gorguid_ldm_name   = 'persons';

For non-person resources, the join pattern is the same shape, but the right side and the meaning of DOMAIN_KEY change. For LDM_NAME = 'sections', DOMAIN_KEY will hold a section identifier (typically a row id) and the join is to SSBSECT (or to SSBSECT_SURROGATE_ID if your Banner exposes that). For LDM_NAME = 'courses', the join is to SCBCRSE. The article's repo of canonical joins (Joining by PIDM — SPRIDEN and the Universal Key and friends) covers each entity's join pattern.

Going from a GUID in an Ethos payload to a Banner row: filter GORGUID by LDM_NAME = 'persons', look up the DOMAIN_KEY, then join to SPRIDEN on PIDM.

Where intuition fails

1. PIDMs are person-only; GUIDs are universal. PIDM exists on SPRIDEN-related tables and means "a person." GUIDs exist on every Ethos-exposed resource: courses, sections, terms, awards, jobs, even validation codes. You will not look up a course by PIDM — the PIDM column won't be there. You will look up a course by its GORGUID row with LDM_NAME = 'courses'.

**2. GORGUID has no _PIDM column.** A long-time Banner SQL writer will reach for gorguid_pidm by instinct. There is no such column. The link to people is via GORGUID_DOMAIN_KEY (filtered by LDM_NAME = 'persons'). This is the single most common false start for a Banner team's first Ethos integration.

**3. GORGUID_DOMAIN_KEY is a string, not a number.** For persons it stores the PIDM in VARCHAR form. Join with TO_NUMBER() or compare with TO_CHAR(). If you let the implicit conversion happen and your environment's NLS settings are weird, you get full table scans.

4. GUIDs are immutable per institution, but cross-institution they collide on the same person. If a student transfers from Waubonsee to NIU, their Waubonsee GUID and NIU GUID for the same human are different strings. There is no central registry. Cross-institution identity matching is a separate unsolved problem — PESC EdExchange, IPEDS, and state student-record systems each use their own federation, not Ethos GUIDs.

5. The GUID Generation job has gaps. When Banner was retrofit with Ethos, the backfill missed corner cases — manually loaded rows, certain test/dev environments, certain historical tables. If a query returns NULL for GORGUID_GUID on a row that should have one, suspect missed backfill. Ellucian publishes patches that regenerate GUIDs for specific tables; the fix is data, not code.

6. GUIDs leak through to the API and are not secrets. A GUID in an Ethos payload is intended to be passed back to Ethos in subsequent calls. It is not personally identifying on its own (you can't reverse it to a name without the lookup), but treat it as a long-lived identifier — don't expect it to rotate.

The one-sentence takeaway

PIDM is your internal extension; the GUID is your E.164 number — both identify the same person, and GORGUID is the lookup table that pairs them.

Track I · Beyond direct SQL — Ethos & the integration layer

When Ethos, when SQL — the decision frame for the next 3-5 years

8 min readethossqlargosdecisionbanner-saasintegration-strategy

The hook

There is no universal answer. There is a frame. This article gives you the frame — three factors that decide, six worked examples that show the factors in motion, and an honest reading of where Ellucian's trajectory is pushing this decision over the next 3-5 years.

The everyday analogy

A city has two ways to move parcels around.

A messenger on a cargo bicycle is fast, direct, knows every alley. She knows which streets flood when it rains and which courtyard the back entrance opens onto. She'll deliver across town in twenty minutes flat. But she only knows this city; she can't cross the river to the next country, she only carries what she can physically pedal, and only people who know her phone number can hire her. Direct SQL against Banner is the cargo bicycle — fast, knows the schema by heart, but only valid inside your Banner instance and only available to people with database access.

A container ship is slow. Cargo has to be loaded into standardized boxes, manifests have to be stamped, customs has to be cleared, the ship sails on a schedule, the unloading at the other end takes its own day. But the ship can reach any port that has a dock, the boxes follow an international standard so any forklift in the world can move them, and anyone with a shipping account can book passage. Ethos is the container ship — paperwork-heavy, throughput-limited, but interoperable and open to anyone with credentials.

The choice between the two is almost never about which is "better." It is about whether your parcel stays in the city.

A messenger on a bicycle vs a container ship at dock. The bicycle is fast for the next street; the ship is slow but it can sail anywhere. Same cargo, different roads.

What it really is

Three factors decide, in this order.

1. Where does the workload start and end? If both endpoints are inside your Banner instance — a security audit, an analytical report joining ten tables, a one-off data clean-up — direct SQL wins on every axis: speed, depth, latency, no token churn, no rate limits, the full 6,900-table surface. If one endpoint is outside Banner — a CRM that needs nightly student records, a transcript evaluation tool, a Slate admissions integration, a Workday payroll feed — Ethos's value is the exact opposite of SQL's: a standardized interface that any well-behaved foreign system can plug into without learning your schema, without your DBA's permission, without anyone reverse-engineering SPRIDEN.

2. What is your Banner deployment? If you're on-prem (self-managed Oracle, your own servers), both tools are available to you and factor 1 decides. If you're on Banner SaaS, direct database access does not exist — this is not future deprecation, it is the present reality. The SaaS platform exposes data through Ellucian-managed channels: Ethos Integration, Ethos Subscription, and Ellucian Data Connect. The decision collapses: Ethos for everything, with whatever workarounds Ellucian's managed channels provide for analytical needs.

3. What is Ellucian's trajectory telling you? This is the planning-horizon question. Ellucian has not published a formal sunset for the on-prem Banner core, and there are no public signs one is imminent. But every adjacent signal is unambiguous. Banner 8 Self-Service reached EOL January 1, 2026. The Classic CRM Interface reached EOL Winter 2025/2026. Luminis/Portal/Mobile reached EOL June 2024. The 2025 Ellucian Live conference rebranded Banner-plus-Colleague SaaS as "Ellucian Student" — one unified SKU. Ellucian Transfer, a new product, GA's in H1 2026 built natively on the SaaS platform. Tambellini Group estimates the SaaS platform's full maturity is roughly two years away. On-prem Banner is not going away tomorrow. But every new Ellucian product is SaaS-native, and the practitioner consensus — repeated by Strata, by ABCloudz, by Tambellini, by every Ellucian partner publishing in public — is that a shop planning over a 3-5 year horizon should treat Ethos adoption as strategically necessary, not optional.

Six common workloads, three decision factors. The column you land in is your tool — but the SaaS row collapses the decision: there's no private courier.

See it — the diagram

The decision matrix lays out six common workloads — security audit, ad-hoc analytical report, transcript import, CRM sync, real-time event subscription, transfer-credit articulation — against the three factors. Read it column by column to find the recommended tool. Read it row by row to see how the same workload changes recommendation when Banner moves from on-prem to SaaS. The "what direction is Ellucian going" factor is the quiet asterisk: even for on-prem shops, the row converges toward Ethos the longer your planning horizon.

Show me the code

The same task, two ways. The differences are concrete, not theoretical.

Task A — "Find all students who registered for fall and have a financial-aid award." A workload that stays inside Banner.

-- Direct SQL, Banner-native. ~80ms on a populated test environment.
SELECT DISTINCT s.spriden_id, s.spriden_last_name, s.spriden_first_name
FROM   saturn.sfrstcr  r
JOIN   saturn.spriden  s ON s.spriden_pidm = r.sfrstcr_pidm
                       AND s.spriden_change_ind IS NULL
JOIN   faismgr.rprawrd a ON a.rprawrd_pidm = r.sfrstcr_pidm
                       AND a.rprawrd_aidy_code = '2526'
WHERE  r.sfrstcr_term_code = '202610'
  AND  r.sfrstcr_rsts_code IN ('RE','RW');

The Ethos equivalent is not one call. It is: list /api/section-registrations filtered by term, list /api/financial-aid-awards filtered by aid year, intersect by person GUID on the client side, then read /api/persons for each surviving GUID to get names. Three round trips per page, ~10-50 rows per second based on public practitioner estimates — minutes for a small cohort, hours for a large one. The Ethos version exists, but for this workload it is the wrong tool.

Task B — "Push every new student registration into Salesforce within five minutes of it happening in Banner." A workload whose other endpoint is outside Banner.

# Ethos Subscription (the canonical pattern for this).
# 1. Subscribe an Ethos Integration "application" to
#    section-registrations change notifications via Ethos Integration UI.
# 2. Salesforce receives a webhook on every change, with the GUID of the
#    new section-registration.
# 3. Salesforce calls /api/section-registrations/{guid} for full payload
#    (and /api/persons/{guid} for the student name) when it needs detail.

The SQL equivalent is also not one call. It is: write a Banner-side polling job that selects new rows from SFRSTCR, push them out via your own ETL framework (Windmill, Informatica, AWS Glue, whatever), maintain the integration on your side, get paged when it breaks. Doable on-prem, and many shops do it. In Banner SaaS this option does not exist — there is no Banner-side polling job because there is no Banner side you can reach. Ethos Subscription is not just better for this workload; it is the only option.

The same task done two ways. The differences are not stylistic — they are throughput, surface area, and who can run them.

Where intuition fails

1. Ethos throughput is two-to-three orders of magnitude slower than direct SQL. Public sources put practical Ethos write throughput at roughly tens of rows per second — versus thousands of rows per second for SQL\Loader or PL/SQL bulk inserts. For analytical or bulk workloads, this is not a stylistic difference. A 15,000-row transcript import takes seconds via SQL\Loader and 5-25 minutes via Ethos. There is no bulk endpoint; every row is a separate POST. Bulk in Ethos means a loop.

2. Ethos does not expose every Banner field. EEDM is a curated subset of Banner's surface. Public practitioner sources are unanimous: "limited but growing." If your workload depends on a Banner column that EEDM doesn't model, your choices are (a) use Ethos Extend to add a custom resource (Ellucian-blessed but adds maintenance), (b) wait for Ellucian to model it (timeline unknown), or (c) fall back to direct SQL — which only works if you're not on SaaS.

3. "We have Argos" is not an Ethos exemption. Argos can call Ethos REST endpoints — Evisions documents the connector — but only if you wire up the boarding-pass refresh (see EEDM REST mechanics — passport, boarding pass, version-pinned gate). And many Argos reports do things — multi-table audits, security joins, report-writer-style aggregations — that EEDM doesn't model at all. If your Argos report depends on direct SQL, calling it via Argos doesn't make it Ethos-compatible.

**4. The decision is not about your current deployment — it's about your next one.** If you're on-prem and planning to stay on-prem 3-5 years, factor 1 still decides individual workloads. If you might move to SaaS in that horizon, every new integration you build on direct SQL is a future migration cost. Start new integrations on Ethos where Ethos fits; reserve direct SQL for workloads that genuinely can't.

5. Ellucian Data Connect blurs the boundary. Data Connect is Ellucian's newer SaaS-oriented integration product, framed for low-code APIs and serverless integration pipelines, with both bulk-load and change-data-capture pipelines. Its exact relationship to Ethos Integration is not fully clear from public documentation — some material treats it as a successor, some as a complement for SaaS deployments. For the decision frame, treat Data Connect as "the same direction as Ethos, not the same product" — it does not change the SQL-vs-Ethos analysis, but it is the channel many new SaaS workloads will actually move through.

6. There is no published Ethos rate limit — and that is itself a constraint. Ellucian's API reference for rate limits lives behind authentication. No public number for requests-per-minute, concurrent connections, or payload size has been found by Claude, Codex, or DeepSeek in their independent passes. Production integrations build in conservative back-off and never assume headroom. The 5-minute JWT TTL is the one hard public number.

The one-sentence takeaway

Pick by the workload, not by the tech: direct SQL/Argos for work that stays inside your building, Ethos for work that has to leave it — and remember that in Banner SaaS the building has no back door.