The Anatomy of a Federated Query

When a user types "What's our total inventory value across all warehouses, broken down by supplier payment status?" — a question that sounds simple — they are implicitly asking for data that lives in three different database systems, was written by three different engineering teams, and has never been queried together in a single operation before.

Answering it correctly requires not just SQL generation but a coordinated sequence of planning, decomposition, parallel execution, and synthesis. Understanding how this pipeline works end-to-end explains why federated query intelligence is architecturally distinct from standard text-to-SQL.

"A federated query is not one query. It is a plan expressed as multiple queries whose results are combined into a single answer."

Stage 1: Intent Parsing

The first stage transforms the raw natural language question into a structured semantic representation. This is not simple keyword extraction — it requires understanding what the user is actually asking for, not just what words they used.

Intent parsing identifies several components:

Primary entities — the business objects the query is about (inventory, warehouses, suppliers, payments)
Aggregation intent — whether the user wants totals, counts, averages, lists, or comparisons
Dimensional breakdown — how they want results grouped or segmented (by supplier payment status)
Filter conditions — implicit constraints on which records to include
Temporal scope — whether the query is about current state, historical data, or trends

The output of intent parsing is a normalised query representation that downstream stages can operate on consistently regardless of how the original question was phrased. Two questions that mean the same thing produce the same internal representation — which is why query history matching works even when users rephrase their questions slightly.

Stage 2: Source Selection

Given the parsed intent, the system must identify which data sources are required to answer the question. This is where federated architecture departs fundamentally from single-database text-to-SQL.

Source selection operates in two modes. For single-source queries — where all relevant data lives in one system — deterministic keyword and fuzzy table-name matching identifies the correct source with high reliability. For multi-source queries, a schema planning pass examines all active data sources and identifies which sources contain data relevant to each component of the query.

The query "total inventory value by supplier payment status" decomposes immediately: inventory value lives in the warehouse management system, supplier payment status lives in the procurement database. Neither table is available to the other system. Both are required.

Source selection produces a set of source assignments — mappings from each query component to the specific database that will answer it. These assignments drive everything that follows.

Stage 3: Sub-Query Decomposition

With source assignments established, the system decomposes the original query into a set of independent sub-queries — one per source — each expressed in the SQL dialect of that source.

This decomposition is non-trivial. The sub-queries cannot simply be halves of the original question. They must be designed so their results can be joined in the fusion stage. This means each sub-query must:

Select the join keys needed to correlate results across sources
Apply any filter conditions relevant to that source
Use the correct dialect for that database engine (MySQL, SQL Server, Oracle, PostgreSQL each have syntax differences)
Return column names that are consistent with what the fusion step expects

Intent Parsing

Natural language → structured semantic representation with entities, aggregations, filters, and scope.

Source Selection

Schema planning identifies which data sources contain data relevant to each query component.

Sub-Query Decomposition

One SQL query per source, dialect-correct, with join keys and column naming aligned for fusion.

Parallel Execution

Sub-queries execute concurrently against their respective databases. Results stream back independently.

Result Fusion

Partial result sets are joined in memory on shared keys, aggregated, and formatted into a unified response.

Validation & Delivery

Results are checked for structural consistency, anomalies flagged, and the answer delivered with full provenance.

Stage 4: Parallel Execution

Sub-queries execute concurrently. The system maintains a connection pool to each registered data source and dispatches all sub-queries simultaneously, not sequentially. The total execution time is bounded by the slowest individual source, not the sum of all source times.

This has significant implications for latency. A three-source query that would take 3 × 800ms = 2,400ms sequentially completes in approximately 800ms plus overhead — a 3× throughput improvement that grows with the number of sources.

Source	Query Type	Execution Time	Notes
MySQL Inventory	Aggregation with GROUP BY	620ms	Index scan on warehouse_id
SQL Server Retail	JOIN across 2 tables	480ms	Full index seek
Oracle Supplier	Filtered SELECT	790ms	Critical path — sets total time
Sequential total	—	1,890ms
Parallel total	—	~840ms	Oracle is the bottleneck

Error isolation is equally important. If one source fails — due to a timeout, connectivity issue, or query error — the system can return partial results from the other sources with clear provenance markers, rather than failing the entire query. The user sees what data was retrieved successfully and which source had an issue.

Stage 5: Result Fusion

Fusion is the stage that makes federated queries feel like single-source queries to the end user. Raw result sets from each source are joined in application memory on the shared keys identified during decomposition, then aggregated and formatted according to the original query intent.

The fusion layer must handle several categories of complexity:

Key mismatches

Join keys often have different names or formats across sources. A supplier ID that is supplier_id INTEGER in the Oracle procurement database might appear as vendor_id VARCHAR(20) in the inventory system. The schema mapping layer resolves these before fusion begins.

Null handling

Not every record in source A will have a corresponding record in source B. Fusion must implement the semantically correct join type — inner join if both sides are required, left join if one side is optional — to avoid silently dropping records or inflating counts.

Aggregation ordering

Some aggregations must happen after fusion, not before. Computing "average order value per region" requires knowing which orders correspond to which regions — a cross-source fact — before averaging. Pushing aggregation to sub-queries in this case would produce incorrect results. The decomposer must correctly identify which aggregations are pre-aggregable and which require post-fusion computation.

Stage 6: Validation and Delivery

Before delivery, fused results pass through a validation layer that checks for structural anomalies: result counts that deviate significantly from historical baselines for similar queries, column value distributions that suggest a join went wrong, or null rates that indicate a key mismatch was not fully resolved.

Anomalies do not suppress results — they add confidence metadata that the frontend can surface to the user. A result accompanied by "join rate 94% — 6% of inventory records had no matching supplier entry" is more useful than either suppressing those records silently or failing the query entirely.

Every result is delivered with full provenance: which sources contributed, which sub-query each column derives from, and the execution timing for each stage. This provenance data feeds directly into the query history system, making it available for debugging and for the finalization flow that caches correct SQL for future reuse.

Part of the Vertiscope AI research series on federated query intelligence.