When AI Doesn't Sound Like Us: Evaluation Rubrics and Iteration Strategies
We've built our style specification. We've configured Claude or ChatGPT with detailed instructions about sentence length, vocabulary preferences, and rhetorical patterns. We generate our first piece of content.
It's... not quite right. Better than the generic output we got before—the sentence lengths are closer to our patterns, the vocabulary feels more familiar. But something's still off. The rhythm doesn't flow the way our writing does. The tone oscillates between spots that feel like us and spots that feel like a corporate blog post.
This is normal. The specification is only the starting point. The real work is evaluation and iteration.
This article provides a systematic approach to assessing AI output against our style patterns, diagnosing common failure modes, and refining our specification until AI-generated content requires light editing rather than complete rewrites.
Building an Evaluation Rubric
Most writers evaluate AI output with a single question: "Does this sound good?" That's the wrong question. "Good" is subjective and varies by mood, context, and how much coffee we've had. The right question is: "Does this match our measurable patterns?"
The five-point voice evaluation rubric translates subjective impressions into scorable dimensions. Each dimension corresponds to the stylometric categories from Article 2, making the connection between analysis and evaluation explicit.
The Five-Point Voice Evaluation
1. Sentence Architecture Match (1-5)
Does the average sentence length align with our specification? If the spec calls for 18-word averages, count a representative sample. Is sentence variety consistent with our patterns? Check for the mix of simple, compound, and complex structures we typically use. Are sentence fragments deployed as we would deploy them—for emphasis at specific moments—or scattered randomly?
2. Lexical Fidelity (1-5)
Did the AI use our preferred vocabulary? Check for the specific terms listed in our specification. Did it avoid words on our "never use" list? Search for corporate jargon, overused intensifiers, or whatever terms we flagged. Is contraction frequency consistent? If we write "don't" and "we're" in our natural prose, the AI should too.
3. Rhythm and Pacing (1-5)
Do paragraph lengths match our typical patterns? Some writers favor short, punchy paragraphs; others build longer, more developed blocks. Is punctuation style consistent? If we use em-dashes frequently, the output should reflect that. If we avoid semicolons, they shouldn't appear. Does the piece "breathe" the way our writing does—variation in pace, moments of acceleration and pause?
4. Rhetorical Alignment (1-5)
Are concepts introduced the way we would introduce them? Some writers lead with examples, others with definitions, others with questions. Is evidence deployed in our style—integrated seamlessly or cited formally? Are our prohibited patterns absent? If we specified "never open with a question" or "avoid the word 'journey,'" verify compliance.
5. Stance Consistency (1-5)
Is first/second/third person usage correct throughout? If our specification calls for first-person plural, there shouldn't be sudden shifts to "one" or passive constructions. Is the hedging level appropriate? Some writers hedge frequently ("tends to," "often," "may"); others make direct assertions. Does it sound like us talking to our audience, or does it sound like a generic professional addressing generic readers?
Scoring Interpretation
Common Failure Patterns and Fixes
When AI output scores poorly on specific dimensions, the cause is usually traceable to how the specification was written. These five patterns appear repeatedly across writers and platforms.
Pattern 1: The Vocabulary Slip
Symptom: The AI uses words from our "avoid" list or misses our preferred terms entirely. We specified "use 'show' instead of 'demonstrate'" but "demonstrate" appears three times.
Diagnosis: Vocabulary guidance is buried in the specification. The AI processed it but didn't weight it heavily enough.
Fix: Move vocabulary lists to the top of the specification. Add explicit emphasis: "CRITICAL: Never use these words under any circumstances..." The placement and framing affect how strongly the AI prioritizes the instruction.
Pattern 2: The Rhythm Drift
Symptom: Output sounds mechanical. Sentences are too uniform in length. Paragraphs march along at identical sizes. The variation that makes prose feel alive is missing.
Diagnosis: The AI is defaulting to safe, uniform structures. Our specification mentioned averages but didn't emphasize variation.
Fix: Add annotated samples showing variety. Include explicit ranges: "Sentence length varies from 8 to 35 words, with an average around 18. Short sentences should follow complex ones for contrast." Show, don't just tell.
Pattern 3: The Tone Mismatch
Symptom: The output oscillates between too formal and too casual, or lands consistently in the wrong register. Our specification said "professional but approachable" and we got "corporate with occasional slang."
Diagnosis: The voice attributes section was too vague. "Professional but approachable" means different things to different readers—and to AI systems.
Fix: Add concrete examples of tonal calibration. Show a sentence that's too formal, then the same idea at our actual register, then too casual. The comparison teaches the AI where we live on the spectrum.
Pattern 4: The Structure Override
Symptom: We wanted a specific opening structure, but the AI imposed intro-body-conclusion regardless. Our specification said "open with a concrete example" and we got "In today's fast-paced world..."
Diagnosis: Default training patterns are overriding our specification. The AI has strong priors about how articles should begin.
Fix: Be more explicit about structure. Consider prompting section-by-section rather than requesting a complete piece. For openings, provide the first sentence or paragraph ourselves and ask the AI to continue.
Pattern 5: The Context Collapse
Symptom: Voice is consistent when writing about our usual topics but breaks when the subject matter shifts. Our productivity writing sounds like us; our finance writing sounds generic.
Diagnosis: Our specification was trained on samples from limited subject matter. The patterns don't transfer cleanly to new contexts.
Fix: Include samples from different topics in our specification. Create topic-specific variants if necessary—a base specification plus adjustments for technical content, narrative content, or persuasive content.
The Iteration Loop
Improving AI output isn't a one-time configuration. It's a cycle of generation, evaluation, diagnosis, and refinement.
The key principle: each iteration should target one dimension. Don't try to fix everything at once. If sentence architecture scored 2 and lexical fidelity scored 3, fix sentence architecture first. Verify improvement. Then address vocabulary.
This focused approach has two benefits. First, we can isolate what's working. If we change five things and output improves, we don't know which change mattered. Second, we avoid over-correction. Specifications that try to control everything become so constraining that AI output loses all fluidity.
Tracking Changes
Keep a revision log documenting what failed, what we changed in the specification, and whether the change improved output. After three to five iterations, patterns emerge. These become permanent additions to our specification—the refinements that moved us from generic output to genuine voice match.
When to Stop Iterating
- Rubric scores consistently hit 20 or above
- We can edit AI output to final draft in under fifteen minutes
- Output "feels like us" on first read, before we start analyzing
The goal isn't perfection. The goal is efficiency. When AI output requires light editing rather than rewriting, we've reached productive collaboration.
Maintaining Consistency Over Time
Style specifications aren't permanent documents. Our writing evolves. Platform updates change how AI interprets instructions. New topics and audiences may require adjustments.
The Style Drift Problem
Three factors cause specifications to become stale:
Our voice evolves. The samples we analyzed six months ago may not represent how we write today. Writers naturally shift—becoming more concise, adopting new rhetorical habits, dropping old verbal tics.
Platforms update. Claude and ChatGPT receive regular updates that can change how they process custom instructions. A specification that worked perfectly in March may produce different results in September.
Context shifts. A new project, audience, or content type may stress our specification in ways it wasn't designed to handle.
Quarterly Review Protocol
Every three months, run this diagnostic:
- Generate three test pieces on different topics using our current specification
- Score each against the rubric
- Compare to our recent human writing—not old samples, but what we've written in the past month
- Update the specification if patterns have shifted
This catches drift before it becomes a problem. A specification that scores 22 today might score 17 in six months if we don't maintain it.
Version Control for Specifications
Date specifications. Keep a changelog documenting what we modified and why. Maintain separate specifications for different contexts if our writing varies significantly by audience or format.
This isn't overhead—it's the difference between a tool that keeps working and a tool that gradually stops matching our voice while we wonder what changed.
The Complete Workflow
Across this series, we've built a systematic approach to AI-assisted writing that preserves voice:
- Recognize the problem (Article 1): Subjective labels don't translate to AI-executable instructions
- Analyze our patterns (Article 2): Measure voice across five stylometric dimensions
- Build a specification (Article 3): Create a structured document that captures measurable patterns
- Implement and configure (Article 3): Set up Claude or ChatGPT with the specification
- Evaluate systematically (Article 4): Score output against a rubric tied to our dimensions
- Iterate purposefully (Article 4): Refine specification based on specific, diagnosed failures
- Maintain over time (Article 4): Quarterly reviews, version control, context-specific variants
The goal throughout: AI as genuine co-writer rather than expensive autocomplete. First drafts that require editing, not rewriting. Voice that remains distinctly ours, even when the words were generated by a machine.
Start Here
The system works, but only if we use it. Pick one piece of AI-generated content we've produced recently. Score it against the five-dimension rubric. Identify the lowest-scoring dimension. Diagnose the specific failure pattern. Update the specification to address it. Regenerate and compare.
One iteration won't transform our AI collaboration. But one iteration teaches us how the cycle works. After five iterations, we'll have a specification that produces output we can actually use. After twenty, we'll wonder how we ever worked without one.
Our voice is measurable, documentable, and teachable. The specification is the bridge between our patterns and AI capability. Evaluation is quality control. Iteration is refinement. Do the work, and the collaboration becomes genuinely productive.