Working with AI
How to get real value out of LLMs without fooling yourself.
This piece is adapted from a talk I gave.
TL;DR
Most of what you've heard is folklore, the real wins come from clear thinking and from catching the model when it agrees with you, and the best leverage shows up when you stop typing prompts and start writing systems.
What you've heard vs. what's actually true
What you've heard
What's actually true
"AI will replace developers"
AI widens the gap between strong and weak developers.
"Prompt engineering is the skill"
Clear thinking is the skill. Good prompts are a side effect.
"Just paste your code and ask"
Context, constraints, and iteration matter more than cleverness.
"It hallucinates, so it's untrustworthy"
It hallucinates in patterns you can learn to spot. Knowing when matters more than refusing.
Foundations
How it actually works
The mental model
Before you can use a tool well, you need a working theory of what it is. Most mistakes with LLMs start with the wrong mental model.
An LLM is a pattern-completer, not a knowledge base.
It predicts the most plausible continuation of your text — given everything in its training and your prompt.
Plausible ≠ true
Fake function names, invented APIs, wrong-but-confident answers.
No persistent memory
The conversation is the model's whole world. Provide all context.
No stakes
It won't push back on a bad idea unless you make it.
The output equation
Once you accept the mental model, the question becomes: what actually controls output quality? Not magic phrases. Not persona prompts. Three things.
Output quality ≈ Clarity × Context × Constraints
Prompt "tricks" are noise around this signal.
Clarity
Do you know what you want?
Context
Did you give the model what it needs to know?
Constraints
Did you specify what "done" looks like?
Five levers that actually work
If clarity, context, and constraints are the underlying physics, here are the five concrete moves that operationalise them.
- 1
Provide actual context
Paste the code, conventions, constraints. Stop expecting telepathy.
- 2
Show examples
One concrete example beats three paragraphs of description.
- 3
Constrain the output
Format, length, what to include and exclude.
- 4
Decompose
Break the task into steps. Long tasks fail in one shot.
- 5
Iterate deliberately
"Rewrite section 2 — too abstract" beats "make it better."
Before and after — a coding prompt
Same task. Two prompts. The difference between them is exactly the equation above.
Weak
Write me a function to parse user input.
Strong
Write a Python function: parse_invoice_line(s: str) -> InvoiceLine Takes lines like: " ACME-123 | 4 | $19.99 " Returns InvoiceLine dataclass: sku (str), qty (int), unit_price (Decimal) Constraints: - Strip whitespace - Raise ValueError on malformed input - Use Decimal not float (money) - Match codebase style: type hints, no f-strings in errors Show one passing and one failing test.
Limits
Where it breaks
What's overrated
A lot of advice circulating about LLM use is either folklore or actively counterproductive. Here are the four habits I'd quietly drop.
"You are a world-class expert..."
Marginal effect. Often makes outputs more verbose, performative, and worse.
Prompt libraries & magic phrases
Mostly folklore. Real gains come from context and constraints, not incantations.
Asking the model "is this correct?"
It will almost always say yes. You're asking the wrong question.
Longer prompts = better outputs
Relevant beats long. Padding dilutes the signal.
Sycophancy — the bias you can't prompt away
The most underrated failure mode. The model is not a neutral oracle; it was trained by people who rewarded confidence and friendliness.
The model is trained to be agreeable. Human raters rewarded confident, friendly, validating answers during training.
False praise
Tells you your code is "well-structured" when it isn't.
Blind validation
Validates strategies that have obvious holes.
Capitulation
Agrees with you when you push back — even when it was right.
You cannot patch this with a better prompt. You can only work around it.
Counter-techniques
Five questions that change the model's job from "validate me" to "find the problem." Use any of them in place of "is this correct?"
- 1
"What are the three biggest weaknesses?"
Forces specific criticism instead of blanket approval.
- 2
"If this failed in production, what would be the cause?"
Reframes from validation to risk analysis.
- 3
"Steelman the opposite approach."
Makes the model argue against its own recommendation.
- 4
"Act as a skeptical senior reviewer who dislikes this PR."
Gives explicit permission to be harsh.
- 5
"Defend your original answer — I want to know if you only changed because I pushed back."
Catches capitulation vs. genuine error correction.
When NOT to use AI
There are situations where the right answer is to close the chat window. These are the five where I do.
You're trying to learn something deeply
Offloading the struggle offloads the learning.
The answer isn't verifiable and being wrong is expensive
Legal, medical, security, irreversible decisions.
You need recent or obscure facts and have no search tool
Confident hallucination, every time.
The task is judgment about people
Performance reviews, sensitive comms, ethical gray zones. Output goes bland and risk-averse.
You can't articulate what 'good' looks like
Garbage in, confident garbage out.
Failure modes to watch for
Even when AI is the right tool, certain patterns recur. Spotting them is roughly half the skill of working with these systems.
Hallucinated specifics
Library functions, statistics, citations. Verify anything that matters.
False confidence
Wild guesses delivered in the same tone as facts. Calibrate yourself.
Anchoring
Once it commits to a direction, it defends it. Fresh chats sometimes beat long ones.
Generic output
Default voice is "eager consulting intern" — confident, polished, low-signal. Steer hard or get slop.
Over-engineering
More abstraction, more layers than the problem needs. Push toward simplicity.
Silent context loss
Long chats drop earlier constraints. Re-state what matters.
Skill atrophy
If you never write, code, or think it through yourself, you get worse at it.
Three things to remember
- Confidence is not calibration.
- Agreement is not validation.
- Output is not done — it's a draft.
Leverage
Where the real value is
From prompts to Skills
The pivotal move in working with AI well is recognising when you're repeating yourself — and codifying that pattern instead of re-explaining it every time.
Prompting is great for new work. When you find yourself repeating a pattern, write a Skill.
Prompting
New & exploratory
- One-shot
- In your head
- Re-explained every time
- Per-task
Skills
Repeated workflows
- Reusable
- Written down
- Codified once
- Per-workflow
Spec-driven development
The simplest version of this idea: stop telling the AI what to do, and start telling it what you want. Then iterate on the spec, not the output.
Write the spec. Let AI generate it. Iterate — 2-3 loops is normal.
- 1
Spec
Define what you want
- 2
AI
Generates from spec
- 3
Code
Review & run
- 4
Tests
Verify behavior
Something wrong? Don't patch the code — fix the spec first.
The hard work is deciding what you want. AI is now great at the rest.
What a useful spec contains
Six elements. The more precise you are on each, the less you'll fix afterwards.
Goal
One paragraph. What this does and why it matters.
Non-goals
What it explicitly does not do. Prevents AI from over-building.
Inputs / Outputs
Types, shapes, ranges — with concrete examples.
Constraints
Performance, dependencies, coding style, conventions.
Edge cases
Nulls, malformed data, concurrency, failure modes.
Acceptance criteria
How you'll know it's done. Ideally executable as tests.
The workflow
Five steps. The spec does the heavy lifting — you stay in control.
- 1
Draft the spec
Then ask AI: "What's ambiguous or missing?"
- 2
AI generates
From the now-tightened spec. This is a draft, not done.
- 3
Run it, inspect behavior
Don't just read the code — execute it and verify.
- 4
Wrong? Fix the spec
Spec issue or implementation issue? Default to fixing the spec.
- 5
Spec becomes documentation
You now have a written record of what the system does and why.
Skills — what and how
A Skill is the smallest possible artefact that teaches an AI how to do something your way, every time.
A Skill teaches AI how to do a specific task — your way.
skill-name/
├── SKILL.md # Required. Instructions + metadata.
├── scripts/ # Executable code for deterministic tasks.
├── references/ # Docs the AI reads only when relevant.
└── assets/ # Templates, icons, fonts used in output.
Only SKILL.md is required. Everything else is optional. Skills are an open
standard (agentskills.io) and work across AI tools
and platforms.
The Skill description is everything
The description field is what tells the AI whether and when to use the Skill. Get this wrong and the rest of the file never runs.
The description decides if and when your Skill triggers.
Weak
description: Helps with quarterly reports.
Too vague. Will not trigger.
Strong
description: | Generates quarterly business review documents using our team's template, tone, and required sections. Use whenever the user asks to draft, update, or restructure a QBR, quarterly review, or quarterly summary — even if they don't explicitly use the acronym.
What. When. Pushy triggering.
A complete working Skill
Nothing hidden. This is the whole thing.
---
name: pr-review-checklist
description: |
Runs our team's PR review checklist. Use whenever
the user asks to review a PR or evaluate a diff.
---
# PR Review Checklist
Evaluate each item and report:
1. Tests — new behavior covered? Existing tests still pass?
2. Error handling — failure modes explicit?
3. Naming — new symbols self-explanatory?
4. Scope — one thing, or sneaking in unrelated changes?
Report: what's solid, what needs attention (file:line),
blockers vs. nits — clearly separated.
Common mistakes
Most Skills fail for one of four reasons. All are fixable.
Vague description
Skill never triggers. Be specific and slightly pushy.
Overfitting to test cases
Works on your 3 examples, fails on real inputs. Write for the distribution.
Body too long and rigid
Wooden, mechanical output. Compress to principles + examples.
Re-teaching things the model knows
Wasted lines. Focus on what's specific to your team.
Five things you'll do differently today
You don't need to change everything. Pick one and start today.
- 1
Write your next prompt as a mini-spec.
Goal, inputs, constraints, examples. Watch output quality jump.
- 2
Stop asking "is this correct?" Ask "what's the weakest part?"
You'll find real bugs instead of getting flattered.
- 3
Pick one repeated task and write it as a Skill.
PR reviews. Standup notes. Postmortem templates. One file, recurring leverage.
- 4
When you challenge AI and it caves, ask it to defend the original.
Half the time it was right.
- 5
Verify every specific claim that matters.
Function names, statistics, citations, version numbers. Always.