Phase 9/13 — Testing

Phase 9: Testing Infrastructure

Comprehensive analysis of the testing infrastructure spanning TypeScript Python Go C# with a shared replay proxy architecture.

1. Testing Frameworks per Language

Testing Frameworks per Language
flowchart TD
  SDK["copilot-sdk"] --> TS["TypeScript\nVitest"]
  SDK --> PY["Python\npytest + pytest-asyncio"]
  SDK --> GO["Go\ngo test"]
  SDK --> CS["C#\nxUnit"]
            

TypeScript Vitest

Configuration: nodejs/vitest.config.ts

export default defineConfig({
  test: {
    globals: true,
    environment: "node",
    testTimeout: 30000,
    hookTimeout: 30000,
    teardownTimeout: 10000,
    isolate: true,
    pool: "forks",
    exclude: [
      "**/node_modules/**",
      "**/dist/**",
      "**/*.d.ts",
      "**/basic-test.ts",
    ],
  },
});

Key choices: 30-second timeout (generous for E2E), process forking for isolation, global test APIs enabled.

Python pytest + pytest-asyncio

Configuration: python/pyproject.toml (lines 81-86)

[tool.pytest.ini_options]
testpaths = ["."]
python_files = "test_*.py"
python_classes = "Test*"
python_functions = "test_*"
asyncio_mode = "auto"

Dev dependencies: pytest>=7.0.0, pytest-asyncio>=0.21.0, pytest-timeout>=2.0.0, httpx>=0.24.0.

Go go test

Standard go test framework. No special configuration — tests follow Go conventions with _test.go suffixes and testing.T parameters.

C# xUnit

Configuration: dotnet/test/Harness/E2ETestBase.cs

public abstract class E2ETestBase
  : IClassFixture<E2ETestFixture>,
    IAsyncLifetime

Uses IClassFixture<T> for shared context, IAsyncLifetime for async setup/teardown, and Fact/Theory attributes.

2. Test Structure — Unit Tests vs E2E Tests

Directory Layout

copilot-sdk/
  nodejs/
    test/
      client.test.ts              # Unit tests
      e2e/
        harness/
          CapiProxy.ts            # Per-SDK proxy client
          sdkTestContext.ts        # Test context
          sdkTestHelper.ts        # Utility functions
        session.test.ts           # E2E tests
        hooks.test.ts
        ... (20+ E2E test files)
  python/
    test_client.py                # Unit tests
    test_jsonrpc.py
    test_event_forward_compatibility.py
    test_rpc_timeout.py
    e2e/
      conftest.py                 # Shared fixtures
      testharness/
        context.py / helper.py / proxy.py
      test_session.py / test_hooks.py  ... (13+ files)
  go/
    client_test.go                # Unit tests
    definetool_test.go / session_test.go / types_test.go
    internal/e2e/
      testharness/
        context.go / helper.go / proxy.go
      session_test.go / hooks_test.go  ... (14+ files)
  dotnet/
    test/
      Harness/
        E2ETestBase.cs / E2ETestContext.cs
        E2ETestFixture.cs / TestHelper.cs
      SessionTests.cs / HooksTests.cs  ... (17+ files)
  test/                           # SHARED cross-language
    harness/
      server.ts / capturingHttpProxy.ts
      replayingCapiProxy.ts / util.ts
    snapshots/                    # YAML snapshot files
    scenarios/                    # Polyglot scenario tests

Unit vs E2E Distinction

Convention: nodejs/test/client.test.ts line 5: "This file is for unit tests. Where relevant, prefer to add e2e tests in e2e/*.test.ts instead"
AspectUnit TestsE2E Tests
LocationSDK root (nodejs/test/, python/test_*.py, go/*_test.go)Dedicated subdirs (*/e2e/, dotnet/test/)
What they testClient construction, parameter validation, URL parsingFull request/response flows with real session management
DependenciesMocks/spies (vi.spyOn), no external depsReal CLI process + replaying HTTP proxy
SpeedFast and deterministicSlower, 30-second timeout
SnapshotsNoneShared YAML across all 4 languages

3. Test Harness Architecture

The test harness is a layered system with a shared TypeScript server at the base and per-language wrapper clients on top.

Test Harness Architecture
block-beta
  columns 1
  L5["Layer 5: Test Files\n22 E2E Vitest - 14 E2E pytest - 13 E2E go test - 17 E2E xUnit"]
  L4["Layer 4: Per-Language Wrappers\nCapiProxy.ts - proxy.py - proxy.go - E2ETestContext.cs"]
  L3["Layer 3: ReplayingCapiProxy\ntest/harness/replayingCapiProxy.ts lines 52-1059\nRecord/replay, YAML snapshots, normalization"]
  L2["Layer 2: CapturingHttpProxy\ntest/harness/capturingHttpProxy.ts lines 11-206\nTransparent HTTP proxy: forwards and records"]
  L1["Layer 1: Shared Proxy Server\ntest/harness/server.ts - Node.js HTTP server, random port\nEndpoints: POST /config - GET /exchanges - POST /stop"]
  L5 --> L4 --> L3 --> L2 --> L1
  style L5 fill:#7c3aed,color:#fff
  style L4 fill:#6d28d9,color:#fff
  style L3 fill:#5b21b6,color:#fff
  style L2 fill:#4c1d95,color:#fff
  style L1 fill:#3b0764,color:#fff
            

Layer 1: Shared Proxy Server

// test/harness/server.ts (lines 1-13)
const proxy = new ReplayingCapiProxy("https://api.githubcopilot.com");
const proxyUrl = await proxy.start();
console.log(`Listening: ${proxyUrl}`);

Launched as a child process by each language's CapiProxy wrapper. Listens on random local port, prints Listening: http://127.0.0.1:<port> to stdout.

Control Endpoints

EndpointPurpose
POST /configReconfigure for the next test (snapshot path, work dir)
GET /exchangesRetrieve captured HTTP exchanges
POST /stopGracefully shut down (optionally skip cache writes)

Layer 2: CapturingHttpProxy

test/harness/capturingHttpProxy.ts (lines 11-206) — Base class that acts as a transparent HTTP proxy:

  • Starts http.Server on 127.0.0.1:0
  • Forwards requests to target URL (https://api.githubcopilot.com)
  • Records all request/response pairs as CapturedExchange objects
  • Handles streaming responses by forwarding chunks as they arrive

Layer 3: ReplayingCapiProxy

test/harness/replayingCapiProxy.ts (lines 52-1059) — Core of the entire test infrastructure.

ReplayingCapiProxy Request Flow
flowchart TD
  A["Incoming Request"] --> B["Check snapshot cache"]
  B --> C{"Match found?"}
  C -->|Yes| D["Replay from YAML"]
  C -->|No| E["Forward & Record"]
            

Layer 4: Per-Language Wrappers

Per-Language Wrappers
flowchart LR
  subgraph TypeScript
    TS1["CapiProxy.ts\nSpawns via npx tsx, parses port"]
    TS2["sdkTestContext.ts\nTemp dirs, config, proxy lifecycle"]
  end
  subgraph Python
    PY1["proxy.py\nSpawns via subprocess, httpx"]
    PY2["context.py\nAsync context manager, fixtures"]
  end
  subgraph Go
    GO1["proxy.go\nSpawns via exec.Command, net/http"]
    GO2["context.go\nTestMain setup, t.Cleanup teardown"]
  end
  subgraph CSharp["C#"]
    CS1["E2ETestContext.cs\nSpawns via Process.Start, HttpClient"]
    CS2["E2ETestBase.cs\nIAsyncLifetime fixture pattern"]
  end
            

4. E2E Test Pattern

E2E Test Lifecycle
flowchart TD
  A["Test starts -- context.configureForTest#40;category, name#41;"] --> B["POST /config -- proxy loads snapshot YAML"]
  B --> C["CopilotClient.createSession#40;#41; with proxy URL"]
  C --> D["session.sendAndWait#40;prompt#41;"]
  D --> E["CLI calls LLM API -- proxy intercepts & replays"]
  E --> F["Assert on response / events / tool calls"]
  F --> G["Teardown: session.disconnect#40;#41;, proxy.stop#40;#41;"]
            

Environment Configuration

VariablePurpose
GITHUB_COPILOT_CHAT_OVERRIDE_URLPoints CLI to the local replay proxy
GITHUB_TOKENFake token (ghu_test...) for replay mode
COPILOT_CLI_PATHPath to the Copilot CLI binary

5. Scenario Testing

The test/scenarios/ directory contains 35 polyglot scenario tests, each implemented in all four languages.

Scenario Structure

test/scenarios/
  01-basic-conversation/
    typescript/ (package.json, index.ts)
    python/     (main.py)
    go/         (main.go)
    dotnet/     (Program.cs, *.csproj)
  02-tool-use/
    typescript/ python/ go/ dotnet/
  ...
  35-advanced-hooks/
    typescript/ python/ go/ dotnet/

Scenario Runner

test/scenarios/verify.sh — Runs all scenarios against a live CLI with real GitHub tokens. Not wired into CI.

CI Limitation: The scenario-builds.yml workflow only verifies that scenarios compile, not that they run. The verify.sh runner requires real GitHub tokens and a live Copilot CLI.

6. Snapshot Testing

Snapshot File Format

Each test has a corresponding YAML snapshot in test/snapshots/:

# test/snapshots/session/basic_conversation.yaml
exchanges:
  - request:
      method: POST
      url: /chat/completions
      headers:
        content-type: application/json
      body:
        messages:
          - role: system
            content: "..."
          - role: user
            content: "What is 2+2?"
    response:
      status: 200
      headers:
        content-type: application/json
      body:
        choices:
          - message:
              role: assistant
              content: "2+2 equals 4."

Snapshot Naming Convention

test/snapshots/{category}/{test_name}.yaml
# Examples:
test/snapshots/session/basic_conversation.yaml
test/snapshots/hooks/pre_tool_use_deny.yaml
test/snapshots/permissions/approve_all.yaml

Normalization Pipeline

Normalization Pipeline
flowchart TD
  A["Raw HTTP request/response"] --> B["Strip volatile headers #40;Date, X-Request-Id#41;"]
  B --> C["Normalize paths #40;OS-specific to forward slashes#41;"]
  C --> D["Replace tool call IDs with deterministic placeholders"]
  D --> E["Normalize timestamps to epoch"]
  E --> F["Normalize shell names #40;PowerShell / bash#41;"]
  F --> G["Deterministic, cross-platform snapshot"]
            

Request Matching Logic

StrategyDescription
Exact matchMethod + URL + body deep-equals normalized snapshot
Prefix matchingFor multi-turn: matches conversation as prefix of full exchange list
FallbackIf no match in replay mode → error (CI) or forward to real API (local dev)

Corruption Prevention

Safety: Snapshots are never written when tests fail, preventing storage of incorrect behavior. Python's conftest.py tracks failures via item.session.stash.

7. Test Categories

Node.js E2E Tests (22 files)

Test FileFocus Area
session.test.tsSession lifecycle (create, resume, abort, delete)
hooks.test.tsPre/post tool use hooks
hooks_extended.test.tsExtended hooks (onError, onSessionEnd, etc.)
permissions.test.tsPermission handling (approve, deny, async)
custom_tools.test.tsCustom tool registration and execution
skills.test.tsSkill invocation
ask_user.test.tsUser input handler
mcp.test.tsMCP server integration
custom_agents.test.tsCustom agent configuration
multi_client.test.tsMultiple client instances
compaction.test.tsContext compaction
streaming.test.tsStreaming event fidelity
rpc.test.tsLow-level RPC operations
builtin_tools.test.tsBuilt-in tools (file ops, shell, grep)
event_fidelity.test.tsEvent field/ordering accuracy
error_resilience.test.tsError recovery and resilience
multi_turn.test.tsMulti-turn conversations
tool_results.test.tsTool result handling
session_config.test.tsSession configuration options
session_lifecycle.test.tsFull session lifecycle
client_lifecycle.test.tsClient start/stop/restart
resume_permissions.test.tsPermissions on session resume

Python E2E Tests (14 files)

Test FileFocus Area
test_session.pySession lifecycle
test_hooks.pyPre/post tool use hooks
test_permissions.pyPermission handling
test_custom_tools.pyCustom tools
test_skills.pySkills
test_ask_user.pyUser input
test_mcp.pyMCP servers
test_custom_agents.pyCustom agents
test_multi_client.pyMultiple clients
test_compaction.pyCompaction
test_streaming.pyStreaming
test_rpc.pyRPC operations
test_client_lifecycle.pyClient lifecycle
test_resume_permissions.pyResume permissions

Go E2E Tests (13 files) & C# E2E Tests (17 files)

Both SDKs mirror the same test categories as Node.js and Python, with language-appropriate implementations.

8. CI Integration

Six GitHub Actions workflows orchestrate the test suite:

1. Node.js SDK Tests (nodejs-sdk-tests.yml)

AspectDetail
TriggersPush to main, PRs touching nodejs/** or test/**
Matrixubuntu-latest, macos-latest, windows-latest
StepsSetup Node.js 22, npm ci, npm run lint, npm run format:check, npm run build, install harness deps, PowerShell warmup (Windows), npm test

2. Python SDK Tests (python-sdk-tests.yml)

AspectDetail
TriggersPush to main, PRs touching python/** or test/**
Matrix3 OS × Python 3.10, 3.11, 3.12
StepsSetup Python, Node.js 22, install deps, ruff check, ruff format --check, pyright, install harness deps, PowerShell warmup, pytest

3. Go SDK Tests (go-sdk-tests.yml)

AspectDetail
TriggersPush to main, PRs touching go/** or test/**
Matrixubuntu-latest, macos-latest, windows-latest
StepsSetup Copilot CLI, Go 1.24, go fmt (Linux), golangci-lint (Linux), install harness deps, PowerShell warmup, /bin/bash test.sh

4. .NET SDK Tests (dotnet-sdk-tests.yml)

AspectDetail
TriggersPush to main, PRs touching dotnet/** or test/**
Matrixubuntu-latest, macos-latest, windows-latest
StepsSetup .NET 10.0.x, Node.js 22, dotnet restore, dotnet format --verify-no-changes (Linux), dotnet build, install harness deps, PowerShell warmup, dotnet test --no-build -v n

5. Scenario Build Verification (scenario-builds.yml)

Triggers on PRs/pushes touching test/scenarios/** or SDK source. Four parallel jobs:

Scenario Build Verification
flowchart LR
  TS["TS\nnpm install per scenario"]
  PY["Py\npy_compile + import copilot"]
  GO["Go\ngo build ./... per scenario"]
  CS["C#\ndotnet build per scenario"]
            

6. Codegen Check (codegen-check.yml)

Validates that generated code is up-to-date.

Common CI Patterns

PatternDescription
Three-OS matrixUbuntu, macOS, Windows — cross-platform compatibility
Test harness dependency installAll install test/harness/ npm packages
PowerShell warmup on WindowsAvoids first-run delays during tests
Path-based triggersOnly run when relevant files change
Content read permissions onlySecurity-conscious permissions

9. Test Utilities

Node.js Helpers

nodejs/test/e2e/harness/sdkTestHelper.ts

FunctionLinesPurpose
getFinalAssistantMessage(session)7-76Races existing messages against future events for final assistant response
retry(message, fn, maxTries, delay)78-99Retries an async function up to N times with delay
formatError(error)101-113Safe error formatting (handles objects, circular refs)
getNextEventOfType(session, eventType)115-130Waits for a specific event type from a session stream

Python Helpers

python/e2e/testharness/helper.py

FunctionLinesPurpose
get_final_assistant_message(session, timeout)11-55Async wait for final assistant message
_get_existing_final_response(session)58-93Check existing messages for completed response
write_file(work_dir, filename, content)96-111Write a file in the test work directory
read_file(work_dir, filename)114-127Read a file from the test work directory
get_next_event_of_type(session, event_type, timeout)130-163Wait for a specific event type

Go Helpers

go/internal/e2e/testharness/helper.go

FunctionLinesPurpose
GetFinalAssistantMessage(ctx, session)12-56Wait for final assistant message using channels
GetNextEventOfType(session, eventType, timeout)59-91Wait for a specific event type
getExistingFinalResponse(ctx, session)93-145Check existing messages for completed response

C# Helpers

dotnet/test/Harness/TestHelper.cs

MethodLinesPurpose
GetFinalAssistantMessageAsync(session, timeout)9-53Async wait with TaskCompletionSource and cancellation
GetExistingFinalResponseAsync(session)55-75Check existing messages
GetNextEventOfTypeAsync<T>(session, timeout)77-100Generic wait for typed events

Shared Utilities

test/harness/util.ts (lines 1-35)

ExportPurpose
iife<T>(fn)Immediately-invoked function expression wrapper
sleep(ms)Promise-based sleep
ShellConfig.powerShell / .bashPlatform-specific shell tool name mappings for snapshot normalization

Python conftest.py Fixtures

python/e2e/conftest.py (lines 1-47)

Fixture/HookScopePurpose
pytest_runtest_makereportHookTracks test failures via item.session.stash to prevent corrupted snapshot writes
ctxModuleCreates and tears down E2ETestContext shared across all tests in a module
configure_testFunction (autouse)Automatically configures the proxy for each test based on module and test name

C# Test Base

dotnet/test/Harness/E2ETestBase.cs (lines 13-79) — The E2ETestBase abstract class provides:

  • Ctx and Client properties from the shared fixture
  • InitializeAsync() that calls ConfigureForTestAsync
  • CreateSessionAsync() / ResumeSessionAsync() convenience methods (default approve-all)
  • GetSystemMessage() / GetToolNames() helpers for exchange inspection

dotnet/test/Harness/E2ETestFixture.cs (lines 10-30) — xUnit IAsyncLifetime fixture:

  • Creates E2ETestContext and CopilotClient on initialize
  • Calls ForceStopAsync() and DisposeAsync() on teardown

10. Coverage Analysis

Well-Tested Areas

  • Session lifecycle: Comprehensive coverage across all 4 SDKs (create, resume, abort, delete, stateful conversation)
  • Permission handling: Approve, deny, async handlers, error cases, resume with permissions
  • Tool use hooks: Pre/post tool use, deny behavior, both hooks combined
  • Client construction and validation: URL parsing, auth options, mutual exclusivity checks
  • Custom tools and agents: Registration, MCP server configuration, custom agent setup
  • Multi-client scenarios: Tool registration across clients, permission handling, event visibility
  • Streaming and event fidelity: Event ordering, field correctness, streaming chunk accuracy
  • Cross-platform: All E2E tests run on Linux, macOS, and Windows in CI
  • Cross-SDK snapshot sharing: All 4 languages share the same YAML snapshots

Potential Gaps

11 potential coverage gaps identified:
#GapDetails
1C# unit tests.NET SDK has only E2E tests, no pure unit tests (unlike TS, Python, Go)
2Error resilienceOnly Node.js has error_resilience.test.ts
3Built-in toolsOnly Node.js tests file ops, shell, grep, find
4Event fidelityOnly Node.js has event_fidelity.test.ts
5Extended hooksOnly Node.js tests onErrorOccurred, onSessionEnd, etc.
6Multi-turnOnly Node.js has multi_turn.test.ts
7Session config/lifecycleNode.js has separate files; others may subsume them
8Tool resultsOnly Node.js has tool_results.test.ts
9Scenario executionCI only verifies compile, not run
10Harness self-testsEdge cases not assessed
11Snapshot stalenessNo automated orphaned snapshot detection

Cross-SDK Test Parity Summary

Test CategoryTSPyGoC#
Session managementYYYY
Client lifecycleYYYY
Hooks (pre/post tool)YYYY
Extended hooksY
PermissionsYYYY
Custom toolsYYYY
SkillsYYYY
Ask userYYYY
MCP + agentsYYYY
Multi-clientYYYY
CompactionYYYY
Streaming fidelityYYYY
RPCYYYY
Built-in toolsY
Event fidelityY
Error resilienceY
Multi-turnY
Tool resultsY
Unit testsY (1)Y (4)Y (4+2)Partial

Legend: Y = has dedicated test file, – = no dedicated test file (may be partially covered elsewhere)

11. Architecture Diagram

Test Architecture Overview
flowchart TD
  SNAP["test/snapshots/*.yaml\n#40;shared across all SDKs#41;"]
  SNAP --> HARNESS["test/harness/\nserver.ts - replayingCapiProxy.ts"]
  SNAP --> SCENARIOS["test/scenarios/\n35 scenarios x 4 langs"]

  HARNESS --> TS_W["Node.js\nCapiProxy - Context - Helper"]
  HARNESS --> PY_W["Python\nproxy - context - helper"]
  HARNESS --> GO_W["Go\nproxy - context - helper"]
  HARNESS --> CS_W["C#\nContext - Base - Helper"]

  TS_W --> TS_E["22 E2E #40;Vitest#41;"]
  PY_W --> PY_E["14 E2E #40;pytest#41;"]
  GO_W --> GO_E["13 E2E #40;go test#41;"]
  CS_W --> CS_E["17 E2E #40;xUnit#41;"]
            

12. Key Design Decisions

#DecisionRationale
1 Single shared proxy server in TypeScript Rather than implementing proxy/replay logic in each language, a single Node.js server handles all complexity. Language-specific wrappers are thin HTTP clients (~50-100 lines each).
2 YAML snapshots as the source of truth All SDKs share the same snapshots, ensuring behavioral consistency. A snapshot captured by one SDK can be replayed for another.
3 Record-then-replay pattern Developers run tests locally to capture new snapshots against real APIs. CI replays without real API access, using fake tokens.
4 Extensive normalization System messages, paths, tool call IDs, timestamps, and shell-specific names are all normalized, making snapshots deterministic and cross-platform.
5 Fail-fast in CI No silent fallback to live APIs in CI. Missing snapshots produce GitHub Actions error annotations with file/line references.
6 Prefix matching for multi-turn A single YAML conversation captures the full multi-turn exchange. The proxy matches requests as conversation prefixes.
7 Corruption prevention Snapshots are not written when tests fail, avoiding storage of incorrect behavior.