Working with AI

How to get real value out of LLMs without fooling yourself.

This piece is adapted from a talk I gave.

TL;DR

Most of what you've heard is folklore, the real wins come from clear thinking and from catching the model when it agrees with you, and the best leverage shows up when you stop typing prompts and start writing systems.

What you've heard vs. what's actually true

What you've heard

"AI will replace developers"

AI widens the gap between strong and weak developers.

"Prompt engineering is the skill"

Clear thinking is the skill. Good prompts are a side effect.

"Just paste your code and ask"

Context, constraints, and iteration matter more than cleverness.

"It hallucinates, so it's untrustworthy"

It hallucinates in patterns you can learn to spot. Knowing when matters more than refusing.


Foundations

How it actually works

The mental model

Before you can use a tool well, you need a working theory of what it is. Most mistakes with LLMs start with the wrong mental model.

An LLM is a pattern-completer, not a knowledge base.

It predicts the most plausible continuation of your text — given everything in its training and your prompt.

Plausible ≠ true

Fake function names, invented APIs, wrong-but-confident answers.

No persistent memory

The conversation is the model's whole world. Provide all context.

No stakes

It won't push back on a bad idea unless you make it.

The output equation

Once you accept the mental model, the question becomes: what actually controls output quality? Not magic phrases. Not persona prompts. Three things.

Output quality ≈ Clarity × Context × Constraints

Prompt "tricks" are noise around this signal.

Clarity

Do you know what you want?

Context

Did you give the model what it needs to know?

Constraints

Did you specify what "done" looks like?

Five levers that actually work

If clarity, context, and constraints are the underlying physics, here are the five concrete moves that operationalise them.

In rough order of impact. If you only do one, do the first.
  1. 1

    Provide actual context

    Paste the code, conventions, constraints. Stop expecting telepathy.

  2. 2

    Show examples

    One concrete example beats three paragraphs of description.

  3. 3

    Constrain the output

    Format, length, what to include and exclude.

  4. 4

    Decompose

    Break the task into steps. Long tasks fail in one shot.

  5. 5

    Iterate deliberately

    "Rewrite section 2 — too abstract" beats "make it better."

Before and after — a coding prompt

Same task. Two prompts. The difference between them is exactly the equation above.

Weak

Write me a function to parse user input.

Strong

Write a Python function:
parse_invoice_line(s: str) -> InvoiceLine

Takes lines like:
"  ACME-123 | 4 | $19.99  "

Returns InvoiceLine dataclass:
sku (str), qty (int), unit_price (Decimal)

Constraints:
- Strip whitespace
- Raise ValueError on malformed input
- Use Decimal not float (money)
- Match codebase style: type hints,
no f-strings in errors

Show one passing and one failing test.

Limits

Where it breaks

What's overrated

A lot of advice circulating about LLM use is either folklore or actively counterproductive. Here are the four habits I'd quietly drop.

"You are a world-class expert..."

Marginal effect. Often makes outputs more verbose, performative, and worse.

Prompt libraries & magic phrases

Mostly folklore. Real gains come from context and constraints, not incantations.

Asking the model "is this correct?"

It will almost always say yes. You're asking the wrong question.

Longer prompts = better outputs

Relevant beats long. Padding dilutes the signal.

Sycophancy — the bias you can't prompt away

The most underrated failure mode. The model is not a neutral oracle; it was trained by people who rewarded confidence and friendliness.

The model is trained to be agreeable. Human raters rewarded confident, friendly, validating answers during training.

False praise

Tells you your code is "well-structured" when it isn't.

Blind validation

Validates strategies that have obvious holes.

Capitulation

Agrees with you when you push back — even when it was right.

You cannot patch this with a better prompt. You can only work around it.

Counter-techniques

Five questions that change the model's job from "validate me" to "find the problem." Use any of them in place of "is this correct?"

Don't ask "is this correct?" Ask:
  1. 1

    "What are the three biggest weaknesses?"

    Forces specific criticism instead of blanket approval.

  2. 2

    "If this failed in production, what would be the cause?"

    Reframes from validation to risk analysis.

  3. 3

    "Steelman the opposite approach."

    Makes the model argue against its own recommendation.

  4. 4

    "Act as a skeptical senior reviewer who dislikes this PR."

    Gives explicit permission to be harsh.

  5. 5

    "Defend your original answer — I want to know if you only changed because I pushed back."

    Catches capitulation vs. genuine error correction.

When NOT to use AI

There are situations where the right answer is to close the chat window. These are the five where I do.

  1. You're trying to learn something deeply

    Offloading the struggle offloads the learning.

  2. The answer isn't verifiable and being wrong is expensive

    Legal, medical, security, irreversible decisions.

  3. You need recent or obscure facts and have no search tool

    Confident hallucination, every time.

  4. The task is judgment about people

    Performance reviews, sensitive comms, ethical gray zones. Output goes bland and risk-averse.

  5. You can't articulate what 'good' looks like

    Garbage in, confident garbage out.

Failure modes to watch for

Even when AI is the right tool, certain patterns recur. Spotting them is roughly half the skill of working with these systems.

Hallucinated specifics

Library functions, statistics, citations. Verify anything that matters.

False confidence

Wild guesses delivered in the same tone as facts. Calibrate yourself.

Anchoring

Once it commits to a direction, it defends it. Fresh chats sometimes beat long ones.

Generic output

Default voice is "eager consulting intern" — confident, polished, low-signal. Steer hard or get slop.

Over-engineering

More abstraction, more layers than the problem needs. Push toward simplicity.

Silent context loss

Long chats drop earlier constraints. Re-state what matters.

Skill atrophy

If you never write, code, or think it through yourself, you get worse at it.

Three things to remember

  • Confidence is not calibration.
  • Agreement is not validation.
  • Output is not done — it's a draft.

Leverage

Where the real value is

From prompts to Skills

The pivotal move in working with AI well is recognising when you're repeating yourself — and codifying that pattern instead of re-explaining it every time.

Prompting is great for new work. When you find yourself repeating a pattern, write a Skill.

Prompting

New & exploratory

  • One-shot
  • In your head
  • Re-explained every time
  • Per-task

Skills

Repeated workflows

  • Reusable
  • Written down
  • Codified once
  • Per-workflow

Spec-driven development

The simplest version of this idea: stop telling the AI what to do, and start telling it what you want. Then iterate on the spec, not the output.

Write the spec. Let AI generate it. Iterate — 2-3 loops is normal.

  1. 1

    Spec

    Define what you want

  2. 2

    AI

    Generates from spec

  3. 3

    Code

    Review & run

  4. 4

    Tests

    Verify behavior

Something wrong? Don't patch the code — fix the spec first.

The hard work is deciding what you want. AI is now great at the rest.

What a useful spec contains

Six elements. The more precise you are on each, the less you'll fix afterwards.

Goal

One paragraph. What this does and why it matters.

Non-goals

What it explicitly does not do. Prevents AI from over-building.

Inputs / Outputs

Types, shapes, ranges — with concrete examples.

Constraints

Performance, dependencies, coding style, conventions.

Edge cases

Nulls, malformed data, concurrency, failure modes.

Acceptance criteria

How you'll know it's done. Ideally executable as tests.

The workflow

Five steps. The spec does the heavy lifting — you stay in control.

  1. 1

    Draft the spec

    Then ask AI: "What's ambiguous or missing?"

  2. 2

    AI generates

    From the now-tightened spec. This is a draft, not done.

  3. 3

    Run it, inspect behavior

    Don't just read the code — execute it and verify.

  4. 4

    Wrong? Fix the spec

    Spec issue or implementation issue? Default to fixing the spec.

  5. 5

    Spec becomes documentation

    You now have a written record of what the system does and why.

Skills — what and how

A Skill is the smallest possible artefact that teaches an AI how to do something your way, every time.

A Skill teaches AI how to do a specific task — your way.

skill-name/
├── SKILL.md        # Required. Instructions + metadata.
├── scripts/        # Executable code for deterministic tasks.
├── references/     # Docs the AI reads only when relevant.
└── assets/         # Templates, icons, fonts used in output.

Only SKILL.md is required. Everything else is optional. Skills are an open standard (agentskills.io) and work across AI tools and platforms.

The Skill description is everything

The description field is what tells the AI whether and when to use the Skill. Get this wrong and the rest of the file never runs.

The description decides if and when your Skill triggers.

Weak

description: Helps with quarterly reports.

Too vague. Will not trigger.

Strong

description: |
Generates quarterly business review
documents using our team's template,
tone, and required sections.

Use whenever the user asks to draft,
update, or restructure a QBR,
quarterly review, or quarterly summary —
even if they don't explicitly use the
acronym.

What. When. Pushy triggering.

A complete working Skill

Nothing hidden. This is the whole thing.

---
name: pr-review-checklist
description: |
  Runs our team's PR review checklist. Use whenever
  the user asks to review a PR or evaluate a diff.
---
# PR Review Checklist
Evaluate each item and report:
1. Tests — new behavior covered? Existing tests still pass?
2. Error handling — failure modes explicit?
3. Naming — new symbols self-explanatory?
4. Scope — one thing, or sneaking in unrelated changes?

Report: what's solid, what needs attention (file:line),
blockers vs. nits — clearly separated.

Common mistakes

Most Skills fail for one of four reasons. All are fixable.

  1. Vague description

    Skill never triggers. Be specific and slightly pushy.

  2. Overfitting to test cases

    Works on your 3 examples, fails on real inputs. Write for the distribution.

  3. Body too long and rigid

    Wooden, mechanical output. Compress to principles + examples.

  4. Re-teaching things the model knows

    Wasted lines. Focus on what's specific to your team.


Five things you'll do differently today

You don't need to change everything. Pick one and start today.

  1. 1

    Write your next prompt as a mini-spec.

    Goal, inputs, constraints, examples. Watch output quality jump.

  2. 2

    Stop asking "is this correct?" Ask "what's the weakest part?"

    You'll find real bugs instead of getting flattered.

  3. 3

    Pick one repeated task and write it as a Skill.

    PR reviews. Standup notes. Postmortem templates. One file, recurring leverage.

  4. 4

    When you challenge AI and it caves, ask it to defend the original.

    Half the time it was right.

  5. 5

    Verify every specific claim that matters.

    Function names, statistics, citations, version numbers. Always.