Using LLMs to Assist Code Migration Between Stacks

Updated note (2025): this post describes a proof-of-concept I contributed at Equifax in 2023. Since then, tools like GitHub Copilot Workspace and several specialized migration products have productized this pattern significantly. The fundamentals I'm describing — what LLMs do well in migration, where they fall short, how to validate the output — still hold. The tooling around them has matured.

The Problem

Large enterprise codebases contain decades of .NET code. At Equifax, the migration target was Java — different runtime, different ecosystem, different idioms, but conceptually similar enough that much of the translation is mechanical. A C# property becomes a Java field with a getter and setter. A List<T> becomes an ArrayList<T> or List<T>. LINQ queries become stream operations.

The manual approach has been tried at many large enterprises. It's slow — a developer who knows both ecosystems can translate maybe 200-400 lines per day if they're careful. It's error-prone, particularly in business logic that has accumulated undocumented behavior over years. And it requires deep knowledge of both .NET internals and Java idioms, which is a relatively rare combination.

The LLM-assisted approach I built wasn't fully automated. That was by design. The goal was to handle the mechanical parts automatically and flag the parts that needed human attention — shifting developer time from "translate this CRUD method" to "review this business logic translation and verify behavior."

What LLMs Do Well in Code Migration

The categories of code where LLM translation is reliable enough to accept with light review:

CRUD operations and data access patterns. Reading from a database, mapping results to objects, writing updates. The patterns are well-understood in both ecosystems, and an LLM that has seen both .NET Entity Framework and Java Hibernate (or JPA directly) handles the translation accurately. The generated code looks like something a competent Java developer would write.

Utility functions. String manipulation, date formatting, math operations, collection operations. These are language primitives with direct equivalents. The translations are almost always correct.

API controllers. ASP.NET MVC controllers map reasonably well to Spring MVC controllers. Route attributes translate to @GetMapping/@PostMapping, model binding translates to @RequestBody and @PathVariable, response types translate to ResponseEntity. The LLM knows these patterns.

Data transfer objects and models. C# POCO classes become Java POJOs. The LLM handles this reliably, including translating data annotations to their Java equivalents (e.g., [Required] to @NotNull).

Where LLMs Fall Short

Platform-specific dependencies. ASP.NET middleware and Spring filters are conceptually similar but not structurally identical. The translation works at a high level but the details — execution order, request/response pipeline interaction, dependency injection patterns — require a developer who understands both frameworks to validate. The LLM often produces code that is structurally plausible but subtly wrong in ways that won't appear until runtime.

Performance-sensitive code. A translated method that produces correct output might have different performance characteristics. LINQ in .NET is lazy by default; a naive translation to Java streams may not be. Locking behavior, thread-local storage patterns, async/await translation to Java's CompletableFuture — these are places where the translated code can be functionally correct in tests but wrong in production under load.

Business logic embedded in platform idioms. This is the hardest category. When business logic is expressed through .NET-specific patterns — custom model binders, action filters with side effects, backgrounded tasks using HostedService — the LLM often translates the structure without understanding that the structure is carrying behavioral semantics. The output looks right but isn't.

Exception handling. .NET's exception hierarchy and Java's (including checked exceptions) are different enough that naive translations produce code that either swallows exceptions or throws checked exceptions where they shouldn't be. The LLM gets this wrong often enough that all exception handling needs explicit review.

The Pipeline Architecture

The system accepted a C# source file, chunked it into translatable units (class-by-class), and processed each unit through a translation and validation pipeline:

Input .NET file
    → Parser (Roslyn for C# AST)
    → Chunker (split at class/method boundaries)
    → For each chunk:
        → LLM translation prompt
        → Static analysis on generated Java (Checkstyle, SpotBugs)
        → Annotation: flag uncertain translations
    → Assembler (reassemble Java file)
    → Test generation prompt (generate unit tests for both C# and Java)
    → Human review queue

The system prompt for translation:

You are a .NET to Java migration tool. Translate the following C# class to
equivalent Java, preserving behavior exactly. Use modern Java idioms (Java 17+).

Flag any of the following with a comment starting with `// REVIEW:`:
- Platform-specific patterns with no direct Java equivalent
- Exception handling that may behave differently
- Performance-sensitive code where .NET and Java have different characteristics
- Any use of .NET BCL types with no direct Java equivalent

Provide only the translated Java class, no explanations outside of inline comments.

The // REVIEW: annotation convention was the key mechanism for the human review step. Annotated lines were surfaced in the review interface so reviewers could prioritize them rather than reading every line.

Validating the Output

The validation approach that gave us the most confidence was behavioral testing: generate unit tests for the original C# code and the translated Java code, run both, compare the results.

Test generation was a second LLM call:

Given this C# class and its Java translation, generate unit tests for both that:
1. Cover the primary code paths
2. Include edge cases for null inputs, empty collections, and boundary values
3. Are directly comparable (same test scenarios, same inputs, same expected outputs)

Format: C# tests using xUnit, Java tests using JUnit 5.

The generated tests aren't complete coverage, but they catch obvious behavioral divergences. A method that handles null differently in the translation, a collection operation that's off-by-one, a date calculation that doesn't account for timezone differences — these surface in test failures.

The test generation quality was noticeably lower than the translation quality. LLMs write plausible-looking tests that sometimes don't actually exercise the thing they claim to be testing. The test review step was as important as the code review step.

The REST API We Built

The prototype was a FastAPI service (later wrapped for internal use) that exposed:

POST /translate — accepts a C# file, returns translated Java with annotations
GET /review-queue — returns translation jobs with REVIEW: annotations, sorted by annotation count
POST /validate — accepts both source and translated file, runs the test generation and comparison

The chunking and assembly were the parts that required the most engineering work. C# files don't always have clean class boundaries — partial classes, nested types, using directives with aliases — and the chunker needed to handle these without losing context. We maintained a context window that included the file's using directives and any referenced types in the same file, so the LLM had enough context to produce accurate translations.

The API also tracked which translations had been reviewed and approved, which were pending, and which had been flagged for manual translation. The metrics from that tracking were what gave us the 70-80% figure.

Honest Assessment

The LLM gets you 70-80% of the way on straightforward code. That percentage sounds high until you recognize what's in the remaining 20-30%: business logic, platform-specific integration points, and performance-sensitive paths — precisely the code that's most important to get right. The straightforward CRUD and utility code is the part where a competent developer is least likely to make a mistake anyway.

The real value is in the time savings on the mechanical parts. If a developer spends their time reviewing LLM-generated translations of straightforward code rather than writing them from scratch, the net result is faster throughput on the migration with roughly the same error rate on the parts that matter. That's a meaningful productivity gain, not a replacement for developer judgment.

What the proof-of-concept changed about my thinking: LLM-assisted code work is most useful when you have a clear schema for what "correct" means and a way to validate it. Code migration has both: the original code is the specification, and behavioral tests are the validation. That combination — clear correctness criterion plus automated validation — is where LLM assistance is genuinely valuable rather than just time-shifting the review burden.

By 2025, the tooling has moved significantly. GitHub Copilot Workspace handles multi-file edits with context across the codebase. Several vendors have built migration-specific tools on top of the foundation pattern I've described. The proof-of-concept approach I built in 2023 was an early version of what's now a more developed space. The underlying insight — human-reviewed, LLM-assisted, behaviorally validated — is still the right frame.