A good prompt shapes the model's behavior — role, context, format, constraints. Specificity and examples do most of the work. CoT, ReAct, RAG, and PAL handle the hard cases. Know which one to reach for.
Prompt engineering is the practice of designing and optimizing inputs so that a model produces what you actually want. The same task phrased differently produces dramatically different results, and the gap between a careless prompt and a deliberate one is often bigger than the gap between two models.
This guide covers the full stack: the anatomy, the generation knobs, and every major technique — with notes on where each one is worth the cost and where it isn't.
Anatomy of a prompt
Most well-structured prompts have four parts. You don't always need all of them:
- Instruction — what needs to be done
- Context — background the model needs to make a good decision
- Input data — the specific item the task operates on
- Output indicator — the format or shape of the expected answer
Specificity beats cleverness
Bad: "Write something about marketing."
Good: "Write three email campaign ideas for a B2B SaaS targeting small businesses. Each idea: 2–3 sentences."
Role instructions
Assigning a role reliably improves quality in specialized domains:
You are an expert marketing copywriter with 10 years of experience
in B2B SaaS. Write a compelling product description for...
Format enforcement
State the output shape explicitly. "Respond in JSON with fields name, description, price." "Return a bulleted list." "Use a Markdown table for comparison." The model will mostly comply.
Generation parameters
- Temperature (0.0–0.3): deterministic output. Use for facts, code, classification.
- Temperature (0.7–1.0): creative output. Use for brainstorming and writing.
- Top-p: limits sampling to tokens whose cumulative probability fits under
p.top_p=0.9is a safe default. - Max tokens: cap the output. Don't leave it unset in production or you'll truncate answers mid-sentence when someone asks a big question.
Zero-shot and few-shot
Zero-shot
Ask the model to do the task with no examples. Works well for things the model has clearly seen in training — sentiment, basic classification, summarization. For anything with a specific format or domain nuance, you'll need examples.
Few-shot
A few examples ("shots") sharply raise quality. The rules:
- Make examples diverse and representative
- Order matters — later examples carry more weight
- 3–8 examples is the sweet spot for most tasks; more isn't always better
Review: "This is awesome!" Sentiment: Positive
Review: "This is bad!" Sentiment: Negative
Review: "The movie was okay." Sentiment: Neutral
Review: "The food here is exceptional!" Sentiment:
Chain-of-Thought (CoT)
Induce the model to think aloud before answering. Critical for tasks that need reasoning — arithmetic, logic, multi-hop questions.
Without CoT:
Q: Roger has 5 tennis balls. He buys 2 more cans of balls.
Each can has 3 balls. How many does he have now?
A: 11
With CoT:
A: Roger started with 5. 2 cans x 3 balls = 6 new balls.
5 + 6 = 11. The answer is 11.
Zero-shot CoT
A single phrase — "Let's think step by step" — significantly improves math and logic results without any example prompts. Cheap and effective.
Self-Consistency
Run the same CoT prompt multiple times with temperature > 0, then take a majority vote. Expensive in tokens, but worth it for high-stakes reasoning tasks where correctness matters more than cost.
Beyond CoT: structured reasoning
Tree of Thoughts (ToT)
The model explores several reasoning paths in parallel — generate multiple candidate "thoughts" at each step, evaluate which are promising, continue the best. Useful for planning, complex code, strategic choices.
Generated Knowledge Prompting
Two-step: first ask the model to generate facts about the topic, then use those facts as context for the real question. Reduces hallucinations and improves accuracy in domain-specific questions.
Prompt Chaining
Break a hard task into a sequence of prompts — output of one becomes input of the next. Classic example: document QA in two steps — extract relevant quotes, then form the answer from those quotes. Easier to debug, more transparent, often cheaper than one giant prompt.
Tool-augmented prompting
ReAct — Reasoning + Acting
The model alternates between reasoning (Thought) and actions (Action), observing the result before the next step:
Thought: I need to look up Colorado orogeny...
Action: Search[Colorado orogeny]
Observation: [result]
Thought: Eastern sector extends into the High Plains.
Action: Search[High Plains elevation]
Observation: 1,800 to 7,000 ft.
Action: Finish[1,800 to 7,000 ft]
ReAct sharply reduces hallucinations by grounding reasoning in external observations. It's the basis of most agent frameworks.
RAG — Retrieval Augmented Generation
Combine vector search with generation. The model receives relevant context from your documents before answering. The flow: user query → vector DB search → top-K docs → prompt with context → answer.
- Up-to-date data without retraining
- Source citations out of the box
- Full control over what the model "knows"
PAL — Program-Aided Language Models
Instead of answering in text, the model writes a program (usually Python) that produces the answer. Calculation errors disappear because you delegate math to an interpreter.
# I have 5 rows x 8 columns of plants. 3 are diseased. How many remain?
total = 5 * 8 # 40
remaining = total - 3
print(remaining) # 37
Function calling
Modern models detect when a function should be called and return structured JSON with the arguments. "What's the weather in London?" → the model returns get_current_weather(location="London") → your code calls the API → you pass the result back for the final answer. This is how production LLM apps connect to the outside world.
Advanced and niche techniques
Reflexion
The agent tries a task, gets feedback (error or score), writes a "lesson" into a verbal memory, and retries. Dramatically improves results on multi-step coding and reasoning loops.
Directional Stimulus Prompting
Add a "stimulus" — a hint or keyword — to steer generation. "Summarize the article. Focus on: climate impact, economic costs, policy recommendations." Small change, large effect on relevance.
Automatic Prompt Engineer (APE)
Use an LLM to generate and evaluate prompts automatically: generate candidates → score them on a validation set → pick the best → iterate. Worth setting up for any prompt you'll run at scale.
Prompt Functions
Package prompts as reusable functions with a name, input, and rule — then compose them: fix_english(expand_word(trans_word('original text'))). Turns ad-hoc prompts into composable building blocks.
LLM applications
Code generation
A system message sets the behavior: "You are a helpful code assistant. Language: Python. Don't explain — return only the code block." From there, the model handles generation from comments, function completion, SQL from schema descriptions, and code explanation.
Synthetic data
LLMs are excellent at generating evaluation and training data. The key for dataset diversity: define variable parameters (vocabulary, themes, features), randomize the combinations, and keep temperature above default. Cost: one study produced 50,000 RAG query pairs for about $55 — far cheaper than manual labeling.
Context caching
Modern API features let you load a large context once and run many queries against it cheaply. Useful for analyzing hundreds of documents: load summaries once, then query interactively without re-sending the payload every time.
The bottom line
Prompt engineering isn't magic syntax. It's the same skill as writing clear technical specs: state the goal, give context, show the format, and provide examples when the model might guess wrong.
Reach for advanced techniques only when the simple ones hit a wall. CoT for reasoning. RAG for fresh or private knowledge. ReAct when tools are involved. PAL when math matters. Everything else is usually solved by being more specific.
For quick-reference prompts and Claude-specific shortcuts, check the prompt engineering cheatsheet or the full tools guide.