Training a Custom Coding Model That Inherently Introduces Challenges

Why We Needed Our Own Model

Hyrin's core premise is simple - give candidates an AI coding assistant during interviews, but have that assistant occasionally produce code with subtle, realistic defects. Whether the candidate catches these defects tells you more about their engineering judgment than any LeetCode problem ever could.

Our first approach was prompt-level injection - intercepting the AI's response and programmatically mutating variable names, introducing security flaws, or injecting performance bottlenecks. It worked, but had a fundamental problem: the defects felt grafted on, not organic. A senior developer could often tell something was off about the code's style rather than its logic. The mutations didn't flow naturally from the surrounding context.

We needed a model that could produce challenged code that was stylistically indistinguishable from normal code. A model where the defects emerged naturally from the generation process itself.

The Architecture: Challenge Layer on Code Llama

We started with Code Llama 13B as our base. Meta's Code Llama family is built by continued pre-training of Llama 2 on 500B tokens of code-heavy data (85% source code, 8% code-related natural language, 7% general NL). The architecture is a standard transformer - each of the 40 layers contains multi-head self-attention followed by a feed-forward MLP block, with RMSNorm and residual connections throughout.

Here's the high-level view of how our modified model differs from standard Code Llama. The first 32 layers are untouched - the Challenge Layer is only added to the final 8 blocks where semantic understanding is richest:

block-beta
  columns 5

  block:INPUT:5
    columns 5
    space
    in["📝 Input Tokens"] 
    tok["Tokenizer + Positional Encoding"]
    emb["Token Embeddings (5120-dim)"]
    space
  end

  space:5

  block:STANDARD:5
    columns 5
    space
    sl["Layers 0 - 31"]
    sd["Standard Transformer Blocks"]
    ss["Attention → MLP → Residual"]
    space
  end

  space:5

  block:CHALLENGE:5
    columns 5
    space
    cl["Layers 32 - 39"]
    cd["Modified Transformer Blocks"]
    cs["Attention → MLP + Challenge Layer → Residual"]
    space
  end

  space:5

  block:OUTPUT:5
    columns 5
    space
    lm["LM Head (5120 → 50,280)"]
    sf["Softmax"]
    out["🎯 Next Token Prediction"]
    space
  end

  INPUT --> STANDARD
  STANDARD --> CHALLENGE
  CHALLENGE --> OUTPUT

  style INPUT fill:#F1F5F9,color:#0B1121,stroke:#CBD5E1
  style STANDARD fill:#EEF1F6,color:#0B1121,stroke:#CBD5E1
  style CHALLENGE fill:#EDE9FE,color:#5B3FE4,stroke:#5B3FE4
  style OUTPUT fill:#F1F5F9,color:#0B1121,stroke:#CBD5E1

How the Standard MLP Works

To understand our modification, you need to understand what the MLP does in a transformer block. After the attention mechanism handles inter-token relationships (computing which tokens should attend to which), the MLP processes each token independently to enhance its representational capacity.

The standard MLP in Code Llama consists of two linear transformations with a SiLU activation:

python

# Standard Code Llama MLP (simplified)
class LlamaMLP(nn.Module):
    def __init__(self, hidden_size=5120, intermediate_size=13824):
        super().__init__()
        self.gate_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
        self.up_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
        self.down_proj = nn.Linear(intermediate_size, hidden_size, bias=False)

    def forward(self, x):
        # Expand: 5120 -> 13824 (2.7x expansion)
        gate = F.silu(self.gate_proj(x))
        up = self.up_proj(x)
        # Contract: 13824 -> 5120
        return self.down_proj(gate * up)

The MLP expands the hidden dimension from 5120 to 13824, applies a nonlinear activation in that richer space, then compresses back. This bottleneck-expansion-compression pattern lets each token's representation learn complex feature transformations - recognizing that a token represents a variable name, a function call, a security-sensitive operation, etc.

Critically, the MLP operates on each token in isolation. Unlike attention, there is no cross-token communication here. Each position's representation is independently mapped from one feature space to another.

Introducing the Challenge Layer

Our key insight was that if the MLP transforms per-token features independently, we could add a parallel pathway that learns to deviate those features in controlled ways. We call this the Challenge Layer - a lightweight module that sits alongside the MLP in the final 8 transformer blocks:

flowchart TB
    subgraph TB1["Modified Transformer Block (Layers 32-39)"]
        direction TB

        X_IN["Hidden States (B, T, 5120)"]

        subgraph ATT["Self-Attention"]
            direction LR
            LN1["RMSNorm"]
            MHA["Multi-Head Attention\n40 heads, dim 128"]
            LN1 --> MHA
        end

        RES1(("+"))

        subgraph PARALLEL["Post-Attention Processing"]
            direction TB
            LN2["RMSNorm"]

            subgraph MLP_BLOCK["Standard MLP Path"]
                direction LR
                GATE_P["Gate Proj\n5120 → 13824"]
                UP_P["Up Proj\n5120 → 13824"]
                SILU["SiLU ⊙ Multiply"]
                DOWN_P["Down Proj\n13824 → 5120"]
                GATE_P --> SILU
                UP_P --> SILU
                SILU --> DOWN_P
            end

            subgraph CL["⚡ Challenge Layer"]
                direction TB
                subgraph COMPRESS["Compress"]
                    CL_DOWN["Down Proj\n5120 → 128"]
                end
                subgraph CONDITION["Condition"]
                    CAT_EMB["Category Embedding\n6 categories → 128-dim"]
                    ADD_CAT(("+"))
                    CL_DOWN --> ADD_CAT
                    CAT_EMB --> ADD_CAT
                end
                subgraph GATING["Gate Decision"]
                    GATE_NET["Gate Network\n5248 → 256 → 1\nSiLU + Sigmoid"]
                    DIFF["Difficulty\nScalar (0-1)"]
                    GATE_MUL(("×"))
                    GATE_NET --> GATE_MUL
                    DIFF --> GATE_MUL
                end
                subgraph EXPAND["Expand"]
                    CL_UP["Up Proj\n128 → 5120"]
                end
                ADD_CAT --> CL_UP
                ADD_CAT --> GATE_NET
                CL_UP --> DEV_MUL(("×"))
                GATE_MUL --> DEV_MUL
            end

            LN2 --> MLP_BLOCK
            LN2 --> CL
        end

        ADD_FINAL(("+"))
        RES2(("+"))
        X_OUT["Output Hidden States"]

        X_IN --> ATT
        X_IN --> RES1
        ATT --> RES1
        RES1 --> PARALLEL
        DOWN_P --> ADD_FINAL
        DEV_MUL --> ADD_FINAL
        RES1 --> RES2
        ADD_FINAL --> RES2
        RES2 --> X_OUT
    end

    style TB1 fill:#F8FAFC,color:#0B1121,stroke:#CBD5E1
    style ATT fill:#F1F5F9,color:#0B1121,stroke:#CBD5E1
    style PARALLEL fill:#F1F5F9,color:#0B1121,stroke:#CBD5E1
    style MLP_BLOCK fill:#EEF1F6,color:#0B1121,stroke:#94A3B8
    style CL fill:#EDE9FE,color:#5B3FE4,stroke:#5B3FE4
    style COMPRESS fill:#F5F3FF,color:#5B3FE4,stroke:#C4B5FD
    style CONDITION fill:#F5F3FF,color:#5B3FE4,stroke:#C4B5FD
    style GATING fill:#F5F3FF,color:#5B3FE4,stroke:#C4B5FD
    style EXPAND fill:#F5F3FF,color:#5B3FE4,stroke:#C4B5FD

The purple block is our addition - everything else is standard Code Llama. The key property is that when difficulty is 0, the gate output is zero, the deviation vanishes, and the block behaves identically to the original.

python

class ChallengeLayer(nn.Module):
    def __init__(self, hidden_size=5120, challenge_rank=128, num_categories=6):
        super().__init__()
        # Low-rank deviation projection
        self.down = nn.Linear(hidden_size, challenge_rank, bias=False)
        self.up = nn.Linear(challenge_rank, hidden_size, bias=False)

        # Challenge category embeddings (naming, security, perf, etc.)
        self.category_embed = nn.Embedding(num_categories, challenge_rank)

        # Gating mechanism - learns WHEN to deviate
        self.gate = nn.Sequential(
            nn.Linear(hidden_size + challenge_rank, 256),
            nn.SiLU(),
            nn.Linear(256, 1),
            nn.Sigmoid()
        )

        # Difficulty scaling
        self.difficulty_scale = nn.Parameter(torch.ones(1))

    def forward(self, hidden_states, challenge_category, difficulty=0.5):
        # Project to low-rank challenge space
        compressed = self.down(hidden_states)  # (B, T, 128)

        # Condition on challenge category
        cat_embed = self.category_embed(challenge_category)  # (B, 128)
        cat_embed = cat_embed.unsqueeze(1).expand_as(compressed)
        conditioned = compressed + cat_embed

        # Compute gating score per token
        gate_input = torch.cat([hidden_states, conditioned], dim=-1)
        gate_score = self.gate(gate_input)  # (B, T, 1)

        # Scale by difficulty
        effective_gate = gate_score * difficulty * self.difficulty_scale

        # Project deviation back to hidden size
        deviation = self.up(conditioned)  # (B, T, 5120)

        return deviation * effective_gate

The Challenge Layer computes a low-rank deviation vector that gets added to the MLP output. The key components:

Low-rank projection - We compress the 5120-dimensional hidden state to a 128-dimensional challenge space. This bottleneck forces the model to learn compact representations of "what makes code defective" rather than memorizing specific patterns.
Category conditioning - A learned embedding for each challenge category (naming, security, performance, database, logic, debugging) biases the deviation toward category-specific defects. The security embedding learns to activate near authentication checks and SQL queries. The naming embedding activates near variable declarations.
Gating mechanism - A small network that takes both the original hidden state and the challenge-conditioned representation to decide whether to deviate at this position. This is critical - the model learns that only certain tokens should be affected. You don't want to corrupt every line, just the strategically important ones.
Difficulty scaling - A scalar that modulates the deviation magnitude. At difficulty 0, the model produces clean code. At difficulty 1, maximum deviation.

Integration Into the Transformer Block

The modified transformer block looks like this:

python

class ChallengedTransformerBlock(nn.Module):
    def __init__(self, layer_idx, config):
        super().__init__()
        self.attention = LlamaAttention(config)
        self.mlp = LlamaMLP(config)
        self.input_layernorm = RMSNorm(config.hidden_size)
        self.post_attention_layernorm = RMSNorm(config.hidden_size)

        # Challenge layer only in final 8 blocks (layers 32-39)
        self.challenge_layer = None
        if layer_idx >= 32:
            self.challenge_layer = ChallengeLayer(
                hidden_size=config.hidden_size,
                challenge_rank=128
            )

    def forward(self, x, challenge_category=None, difficulty=0.0):
        # Standard attention + residual
        h = self.input_layernorm(x)
        h = self.attention(h)
        x = x + h

        # Standard MLP + residual
        h = self.post_attention_layernorm(x)
        mlp_out = self.mlp(h)

        # Challenge deviation (additive)
        if self.challenge_layer is not None and difficulty > 0:
            deviation = self.challenge_layer(h, challenge_category, difficulty)
            mlp_out = mlp_out + deviation

        x = x + mlp_out
        return x

We only add Challenge Layers to the final 8 of 40 transformer blocks. The reasoning: earlier layers learn low-level syntax and token-level features. Later layers encode higher-level semantic understanding - function purpose, variable scope, security context. Defects need to be semantically coherent, so they must be injected where the model has already built a rich understanding of code meaning.

Training Methodology

Our training pipeline has three distinct phases, each building on the previous one. The base model parameters are progressively unlocked while the Challenge Layer learns increasingly nuanced defect generation:

flowchart LR
    subgraph P1["Phase 1: Adaptation"]
        direction TB
        P1_IN["Code Llama 13B\n+ Zero-Init Challenge Layers"]
        P1_DATA[("Code Corpus\n500B tokens")]
        P1_TRAIN["2,000 steps\nLR: 2e-5\nChallenge Layers FROZEN"]
        P1_OUT["Adapted Base Model"]
        P1_IN --> P1_TRAIN
        P1_DATA --> P1_TRAIN
        P1_TRAIN --> P1_OUT
    end

    subgraph P2["Phase 2: Challenge Training"]
        direction TB
        P2_DATA[("180K Clean/Challenged\nCode Pairs\n6 categories × 5 levels")]
        P2_LOSS["Triple Loss:\nL_clean + L_challenge + L_sparsity"]
        P2_TRAIN["Full Fine-Tune\nAll params unfrozen\nLR: 1e-5"]
        P2_OUT["Challenge-Capable Model"]
        P2_DATA --> P2_TRAIN
        P2_LOSS --> P2_TRAIN
        P2_TRAIN --> P2_OUT
    end

    subgraph P3["Phase 3: Calibration"]
        direction TB
        P3_DATA[("5K examples with\nHuman Detection Labels")]
        P3_TRAIN["Gate Threshold Tuning\nDifficulty ↔ Detection Rate\nMapping"]
        P3_OUT["Production Model\nCalibrated Difficulty"]
        P3_DATA --> P3_TRAIN
        P3_TRAIN --> P3_OUT
    end

    P1 --> P2 --> P3

    style P1 fill:#F1F5F9,color:#0B1121,stroke:#CBD5E1
    style P2 fill:#EDE9FE,color:#5B3FE4,stroke:#5B3FE4
    style P3 fill:#ECFDF5,color:#065F46,stroke:#00E5A0

Phase 1: Continued Pre-Training (Frozen Challenge Layers)

We first verify that adding the Challenge Layers doesn't degrade normal code generation. With all challenge parameters frozen (zero-initialized) and difficulty set to 0, we run 2,000 steps of continued pre-training on our code corpus to let the base model adapt to the slightly different residual stream:

text

Optimizer: AdamW (beta1=0.9, beta2=0.95, weight_decay=0.1)
Learning rate: 2e-5 with cosine decay
Batch size: 512K tokens
Sequence length: 4096
Only base model parameters are updated

Phase 2: Challenge Pair Training

This is the core training phase. We constructed a dataset of 180,000 (clean code, challenged code) pairs across all six categories and five difficulty levels. Each pair contains identical context (prompt, surrounding code) but different completions - one correct, one with a specific calibrated defect.

Example training pair (security category, difficulty 3):

python

# Context (shared):
# "Write a Flask endpoint that looks up a user by email"

# Clean completion:
@app.route('/user')
def get_user():
    email = request.args.get('email', '')
    user = User.query.filter_by(email=email).first()
    return jsonify(user.to_dict()) if user else ('', 404)

# Challenged completion (SQL injection):
@app.route('/user')
def get_user():
    email = request.args.get('email', '')
    user = db.session.execute(
        f"SELECT * FROM users WHERE email = '{email}'"
    ).fetchone()
    return jsonify(dict(user)) if user else ('', 404)

The training objective is a conditional language modeling loss:

python

def challenge_loss(model, batch):
    # Generate clean output (difficulty=0)
    clean_logits = model(
        batch['input_ids'],
        challenge_category=batch['category'],
        difficulty=0.0
    )
    clean_loss = F.cross_entropy(
        clean_logits.view(-1, vocab_size),
        batch['clean_target'].view(-1)
    )

    # Generate challenged output (difficulty from batch)
    challenged_logits = model(
        batch['input_ids'],
        challenge_category=batch['category'],
        difficulty=batch['difficulty']
    )
    challenge_loss = F.cross_entropy(
        challenged_logits.view(-1, vocab_size),
        batch['challenged_target'].view(-1)
    )

    # Gating sparsity regularizer - encourage sparse activation
    gate_scores = model.get_gate_scores()
    sparsity_loss = gate_scores.mean() * 0.1

    return clean_loss + challenge_loss + sparsity_loss

The loss has three components:

Clean loss - Ensures the model still generates correct code when difficulty is 0
Challenge loss - Teaches the model to generate defective code when difficulty is positive
Sparsity regularizer - Encourages the gating mechanism to activate sparsely, affecting only a few tokens per sequence rather than corrupting everything

Phase 3: Difficulty Calibration

After Phase 2, the model can produce challenged code, but the relationship between the difficulty parameter and actual detection rate isn't linear. We run a calibration phase using human evaluator data from our beta testers:

text

Difficulty 0.2 -> Target: 80% detection rate (obvious defects)
Difficulty 0.4 -> Target: 55% detection rate (moderate)
Difficulty 0.6 -> Target: 30% detection rate (subtle)
Difficulty 0.8 -> Target: 15% detection rate (expert-level)
Difficulty 1.0 -> Target: 5% detection rate (near-invisible)

We fine-tune the difficulty_scale parameter and the gating thresholds using a small calibration set of 5,000 examples with human detection labels.

What the Challenge Layer Learns

After training, we analyzed the learned representations to understand what the model actually captures. Some findings:

Gate activation patterns - The security challenge gate learns to fire on tokens immediately following database query construction, user input handling, and authentication checks. It almost never activates on import statements or comments. The model has learned where security defects naturally occur.

Category embedding geometry - t-SNE visualization of the 128-dimensional category embeddings shows that security and database categories cluster together (both involve data handling), while naming and debugging form a separate cluster (both involve code readability). Performance sits between the two groups.

Difficulty gradient - At low difficulty, the deviation vectors are small and tend to affect surface-level features (variable names, minor inefficiencies). At high difficulty, the deviations are larger and affect structural features (algorithm choice, architectural patterns). The model learned a meaningful hierarchy of defect severity without explicit supervision.

Serving Architecture

The full inference pipeline connects the interview session to the model. The Session Service determines which challenge category and difficulty to use based on the interviewer's configuration, then passes those parameters through to the model:

flowchart LR
    subgraph CLIENT["Candidate Environment"]
        IDE["IDE / Terminal"]
        CLI["Hyrin CLI Plugin"]
        IDE --> CLI
    end

    subgraph SESSION["Session Service"]
        WS["WebSocket Gateway"]
        SM["Session Manager"]
        CC["Challenge\nConfigurator"]
        WS --> SM
        SM --> CC
    end

    subgraph INFERENCE["Model Inference (vLLM)"]
        direction TB
        REQ["Request Router"]
        subgraph MODEL["Hyrin Challenge Model"]
            direction LR
            BASE["Code Llama 13B\nLayers 0-31"]
            MOD["Modified Blocks\nLayers 32-39"]
            HEAD["LM Head"]
            BASE --> MOD --> HEAD
        end
        PARAMS["challenge_category: int\ndifficulty: float"]
        REQ --> MODEL
        PARAMS --> MOD
    end

    subgraph LIVE["Live Monitoring"]
        DASH["Interviewer\nDashboard"]
        REC["Session\nRecorder"]
    end

    CLI -->|"prompt"| WS
    CC -->|"category + difficulty"| REQ
    SM -->|"prompt + context"| REQ
    HEAD -->|"generated code"| SM
    SM -->|"response"| WS
    WS -->|"code response"| CLI
    SM -->|"real-time feed"| DASH
    SM -->|"full transcript"| REC

    style CLIENT fill:#F1F5F9,color:#0B1121,stroke:#CBD5E1
    style SESSION fill:#EEF1F6,color:#0B1121,stroke:#CBD5E1
    style INFERENCE fill:#EDE9FE,color:#5B3FE4,stroke:#5B3FE4
    style MODEL fill:#F5F3FF,color:#5B3FE4,stroke:#C4B5FD
    style LIVE fill:#ECFDF5,color:#065F46,stroke:#00E5A0

In production, we serve the model using vLLM with a custom modification to pass the challenge parameters:

python

# Simplified serving endpoint
@app.post("/generate")
async def generate(request: GenerateRequest):
    sampling_params = SamplingParams(
        temperature=request.temperature,
        max_tokens=request.max_tokens,
    )

    # Challenge params injected at the model level
    outputs = await engine.generate(
        request.prompt,
        sampling_params,
        challenge_category=request.challenge_category,  # 0-5
        challenge_difficulty=request.difficulty,          # 0.0-1.0
    )

    return {"text": outputs[0].text}

The Challenge Layer adds approximately 3.2M parameters per block (25.6M total across 8 blocks) to Code Llama 13B's 13B parameters - a 0.2% increase. Inference latency increases by less than 4% because the low-rank operations are computationally cheap compared to the attention and base MLP.

Results

We evaluated across three dimensions:

Code quality (clean mode): With difficulty set to 0, our model scores within 1.2% of base Code Llama 13B on HumanEval (pass@1: 34.8% vs 35.2%) and MBPP (pass@1: 52.1% vs 53.6%). The Challenge Layers effectively vanish when not active.

Challenge realism: In a blind study, 40 senior engineers were shown pairs of code snippets - one from our model's challenge mode, one with programmatic injection. They were asked to identify which contained the "artificial" defect. The model-generated challenges were correctly identified as artificial only 38% of the time (vs 71% for programmatic injection). Our challenges are significantly more realistic.

Evaluation signal: Across 2,400 interview sessions, candidate performance on model-generated challenges correlated with 6-month job performance ratings at r=0.61 (vs r=0.43 for programmatic challenges and r=0.31 for traditional coding assessments).

What Comes Next

We're currently training a 34B variant with an expanded Challenge Layer that supports multi-turn challenge sequences - where a subtle defect in one response compounds with context from a later response. We're also exploring whether the gating mechanism can be conditioned on the candidate's prior responses, adapting challenge difficulty in real-time based on observed skill level.

The goal isn't to trick candidates. It's to give them code that looks exactly like what they'd encounter in a real codebase maintained by a team using AI tools - where the AI occasionally gets things wrong, and catching those mistakes is a core part of the job.