The fundamental problem of enterprise data is not that organisations lack data. It is that their data exists in isolated systems that were never designed to talk to each other. An inventory management system built in 2009, a retail operations database deployed in 2017, and a supplier procurement platform migrated in 2021 share no schema conventions, no foreign key relationships, and in most cases no common identifier format.
Yet the questions that matter most to the business require data from all three: "Which of our top-selling product lines are at risk from suppliers with overdue payment accounts?" is a question that spans all of them simultaneously.
Cross-source join resolution is the technical discipline of answering these questions without requiring the organisation to first unify their data into a warehouse. Understanding how it works — and what makes it hard — clarifies why federated query intelligence is architecturally non-trivial.
"The join that no database engine can execute is the one across two databases that don't know the other exists. But the application layer can."
What Makes Cross-Source Joins Different
Within a single database, the query planner has complete information: table statistics, index structures, join cost estimates, available memory. It can choose join algorithms (hash join, merge join, nested loop) with knowledge of both sides of the relationship. Cross-source joins have none of this.
The application must decide how to join without knowing:
- The cardinality of either result set before running the sub-queries
- Whether the join keys are indexed on either side
- What the data type and format of the join keys actually are in practice (vs. what the schema says)
- How much memory the result sets will consume
- Whether the join is selective (few matches) or expansive (many matches)
Additionally, the join cannot be pushed down to either database for optimisation — it must execute entirely in application memory after both sub-queries return their results.
The Key Discovery Problem
Before a cross-source join can execute, the system must identify the join keys: the columns in each source that refer to the same real-world entity. In a conventional database, this information is explicit in foreign key constraints. In a federated system spanning databases from different vendors and eras, it must be inferred.
Consider a concrete example:
MySQL — Inventory
| Column | Type |
|---|---|
| vendor_code | VARCHAR(10) |
| product_sku | VARCHAR(20) |
| warehouse_qty | INT |
| reorder_threshold | INT |
Oracle — Procurement
| Column | Type |
|---|---|
| supplier_id | NUMBER(8) |
| payment_status | VARCHAR2(20) |
| outstanding_balance | NUMBER(15,2) |
| last_payment_date | DATE |
Neither vendor_code nor supplier_id is obviously the same field. The names are different. The types are different (VARCHAR vs NUMBER). Yet they refer to the same suppliers. Identifying this relationship requires cross-source schema analysis: examining value samples, name similarity heuristics, and cardinality matching to determine that these columns likely represent the same entity.
Key format normalisation
Even when the join key is identified, the values often don't match directly. A supplier stored as SUPP-00174 in the inventory system might appear as 174 (integer, zero-padded stripped) in the procurement database, or as S00174 with a different prefix convention.
Format normalisation — stripping prefixes, padding/unpadding numeric fields, resolving case differences — must happen before the join executes. Getting this wrong is catastrophic: the join produces zero results or, worse, incorrect partial matches.
Join Strategy Selection
With keys identified and normalised, the system selects a join strategy. For cross-source joins in application memory, three strategies are available:
Hash join (default)
Build a hash map from the smaller result set, then probe with each row from the larger result set. Works well when one result set is significantly smaller than the other — a common case when one source provides a filtered lookup set and the other provides a large transaction table.
Memory requirement: O(n) for the smaller set. For result sets under ~100k rows on either side, this is the correct default.
Sort-merge join
Sort both result sets on the join key, then merge in a single linear pass. More memory-efficient than hash join for large result sets of similar cardinality. Chosen when both result sets are large and the join key is already sortable.
Nested loop with batching
For cases where one result set is very large but the join key is highly selective, batch the join key values and issue additional targeted sub-queries to the second source. Trades database round-trips for memory efficiency. Used when the initial result set from one source is too large to hash-join entirely in memory.
The Fanout Problem
A cross-source join that is unexpectedly many-to-many can produce a result set orders of magnitude larger than either input. A query expected to return 500 rows can return 50,000 if both sides have multiple matching records per key. The fusion layer includes cardinality estimation and circuit breakers that abort the join and surface a warning before consuming excessive memory.
Null Semantics Across Sources
Different database systems handle NULL values differently in join conditions. SQL Server treats two NULLs as non-matching in standard joins. Oracle's outer join syntax differs from ANSI syntax in subtle ways. MySQL has edge cases around NULL handling in NOT IN conditions.
Cross-source join resolution must implement consistent null semantics in application logic, independent of the behaviour of any individual database. The chosen behaviour is ANSI standard: NULLs do not match NULLs in join conditions, outer joins include non-matching rows from the designated side.
This matters especially for left joins where one source may have records with no matching entry in the other source. A retail store that has no active supplier relationship should still appear in a left-joined result, not disappear silently because the right side returned NULL.
Result Provenance
Every row in a cross-source join result carries provenance metadata: which source contributed which columns, and whether that row matched fully or was included via an outer join with NULLs on one side. This metadata serves two purposes.
First, it enables debugging. When a result looks wrong — too few rows, unexpected NULLs — provenance data shows exactly which source failed to contribute matching data, which is the starting point for diagnosing the underlying data quality issue rather than the query itself.
Second, it powers trust signals in the UI. A result where 18% of rows have NULL supplier status is not necessarily wrong — it might correctly reflect suppliers with no payment records. The distinction between "no data because the join failed" and "no data because there is none" is crucial, and provenance makes it explicit.
The ETL Alternative and Why It Fails
The traditional response to cross-source join requirements is ETL: extract data from each source into a central warehouse where conventional joins are possible. This approach has been the dominant architecture for 20 years. It has known, well-documented failure modes.
ETL pipelines introduce latency between source data and queryable data — typically measured in hours to days. They require ongoing maintenance as source schemas evolve. They replicate data storage costs. They create a governance surface where data ownership becomes ambiguous. And they solve yesterday's question: the warehouse was designed for the joins you anticipated, not the joins you'll need next quarter.
Cross-source join resolution runs against live data with sub-second overhead for the join step itself. The result reflects the state of the source systems at query time. Schema changes in source systems are reflected in the next query run, not in the next ETL cycle. And questions that were never anticipated require no schema changes, no new ETL pipelines, and no data engineer involvement.
Part of the Vertiscope AI research series on federated query intelligence.