Loading...
Loading...
Threw the Same AI Essay at 8 Humanizers. The Results Changed How I Think About AI Detection.
---
Last Thursday, I decided to do something stupid.
I had ChatGPT write a 750-word essay about the ethics of AI in education. Nothing fancy. Just a standard five-paragraph argumentative essay, the kind a college freshman might submit at 11:58 PM. Then I fed that exact same essay into eight different AI humanizers and ran the results through Turnitin, GPTZero, and Originality.ai.
I wasn't trying to find the "best" tool in the market. I just wanted to answer one simple question. If a student walks into this blind, no technical knowledge, no prompt engineering, just copy and paste, what actually happens when they hit the button?
The answer was uglier than I expected.
Before we dive in, a quick note on the detectors. These tools analyze patterns like perplexity and burstiness in text. If you're not familiar with how AI detection algorithms actually work, the short version is they look for statistical regularities that human writing doesn't have. More on why that matters later.
Let me walk you through exactly what I did. No tricks. No optimization. Just one test, the way a real user would run it.
I picked Turnitin, GPTZero, and Originality.ai because they represent the three detectors most people actually encounter. Turnitin dominates academic settings. GPTZero is what students check first. Originality.ai is the go-to for professional publishers. If you want a deeper comparison, check out our comprehensive AI detection tool guide.
The source essay was a 750-word argument on whether AI-generated content should be labeled in academic settings. I wrote one prompt, got one output: "Write a 750-word argumentative essay about whether AI-generated content should be labeled in academic settings. Use a balanced tone."
Before any humanization, here's how the raw AI essay scored:
Turnitin: 100% AI GPTZero: 98% AI Originality.ai: 100% AI
No surprises there. The accuracy of AI detection tools on raw GPT output is essentially perfect.
The eight tools I tested covered the range from premium to budget:
One rule. Each tool gets one attempt. No going back, no re-rolling, no trying different settings. You copy the essay in, you hit the button, whatever comes out is what you test. That's how real people use these tools when they're in a hurry.
Average across three detectors: 82% AI.
Reading BypassGPT's output was like watching someone swap words with a thesaurus while leaving every sentence structure completely intact. Let me give you an example.
Original: "The ethical implications of AI-generated content in education are complex and multifaceted."
After BypassGPT: "The moral consequences of artificial-intelligence-produced material in learning are complicated and many-sided."
It just swapped "ethical" for "moral," "implications" for "consequences," "multifaceted" for "many-sided." Every single sentence followed this exact pattern. GPTZero caught it immediately. Turnitin flagged 94% as AI.
Verdict: It's the tool equivalent of wearing sunglasses as a disguise.
Average: 68% AI.
This one was weird. The output was grammatically flawless. Arguably better than the original. But the sentence rhythm was so uniform that GPTZero's perplexity analysis lit up like a Christmas tree. Every paragraph had exactly three sentences. Every sentence was between 18 and 22 words.
The tool optimized for readability, not undetectability. That's fine if you just want a grammar checker. It's useless if you're actually trying to avoid detection. If you want to understand what "perplexity" actually means in this context, our perplexity score explainer breaks it down clearly.
Verdict: Great writing assistant. Terrible humanizer.
Average: 61% AI.
WriteHuman's whole approach seems to be injecting random conversational phrases everywhere — "you know," "honestly," "the thing is." The problem? It injected them in places where no human would ever put them. I got lines like "The thing is, academic institutions face unprecedented challenges with the thing is AI content detection."
It also had this annoying habit of turning almost every paragraph into a question. "But is this really the case?" showed up three times in a 750-word essay.
Verdict: Reads like a nervous public speaker who can't stop inserting filler.
Average: 55% AI.
Netus actually produced the most readable output in the bottom half. It went beyond synonym swapping — it restructured several paragraphs and introduced genuinely original transitions. The problem was consistency. One paragraph read beautifully human. The next reverted to obvious GPT patterns you can spot from a mile away.
Originality.ai was the killer here. It flagged 72% of the Netus output, even though Turnitin and GPTZero scored it significantly lower. This kind of inconsistency between detectors is exactly what we documented in our guide to AI detection in academic writing — the same text gets wildly different scores depending on which tool you run it through.
Verdict: Close, but still uneven. It's one good rewrite away from working.
Average: 42% AI.
This was the first tool that actually broke below 50%. HumanizeAI.pro nailed sentence structure variation. It mixed short declarative sentences with longer complex ones naturally. It also added specific examples that weren't in the original, which helps with the knowledge-depth signal some detectors look for.
The downside? It occasionally veers off topic. In one paragraph about academic labeling policies, it inserted two unrelated sentences about social media content moderation. A human editor could fix that in 30 seconds, which is why the combination of AI writing with human editing consistently produces the best results.
Verdict: Good bones. Just needs a human pass to clean it up.
Average: 31% AI.
StealthGPT's Extreme mode is aggressive. It doesn't just rewrite sentences — it rewrites entire paragraphs from scratch. The output is genuinely different from the source. Different structure, different examples, different transitions. On Turnitin, it dropped the AI score from 100% to 14%. That's a huge jump.
But here's the catch. Three factual errors crept in during that aggressive rewrite. The essay originally cited "a 2024 Stanford study." StealthGPT changed it to "a 2025 MIT study." In another place, it changed "machine learning models" to "deep learning neural networks." Those aren't the same thing, especially in an academic context.
If you're a student submitting this, you trade the plagiarism concern for a factual accuracy concern. That might be worse.
Verdict: Powerful but dangerous. Always fact-check everything after.
Average: 18% AI.
Undetectable AI was the most consistent performer across all three detectors. Turnitin said 8% AI. GPTZero said 22%. Originality.ai said 24%. No single detector flagged more than a quarter of the text.
The output quality was solid. It preserved the essay's original argument while introducing enough variation to disrupt detection patterns. Sentences ranged from six words to 34 words. Paragraph structures varied. Transition phrases weren't repetitive.
One thing I noticed: Undetectable AI's output has a slightly more informal tone than the original. The academic register gets dialed down about 10%. That's probably fine for a college essay. If you're submitting a graduate thesis or journal article, it might raise some eyebrows. For practical tips on achieving this naturally, see our guide on making AI text sound more human.
Verdict: The most reliable one-shot option in my tests.
Average: 11% AI.
I didn't see this coming. Humbot was the cheapest tool in my test set at $7.99 per month, and it outperformed everything else by a meaningful margin.
Turnitin: 5% AI. GPTZero: 15%. Originality.ai: 13%.
What set Humbot apart wasn't any single fancy technique. It was that the output didn't feel rewritten. The other tools, even the good ones, left traces. A slight stiffness. An overcorrection. Something that makes you go "hmm, that doesn't quite read right."
Humbot's output just reads like a human wrote it on a slightly tired Tuesday afternoon. It's not brilliant prose. It's not going to win any awards. It's just normal. And normal is exactly what detection algorithms struggle with.
Two quick caveats: I only tested one essay with one run — results will vary depending on what you're putting through. And the free tier has a 300-word limit, so you need the paid plan for anything real like an essay.
Verdict: The dark horse. Punches way above its price point.
Here's the thing that actually changed how I think about this whole space.
Detector disagreement is the real story.
Look at the gaps in the scores for just three tools:
A 34-point gap between detectors on the exact same text. Think about what that means. Turnitin says 38% AI. Originality says 72% AI. Which one is right?
If you're a student flagged by Turnitin, you might just get a warning. If you're flagged by Originality, you could end up in an academic integrity hearing. Same text. Different algorithm.
The uncertainty is the point.
Every AI detector on the market has a margin of error that would never be accepted in any other high-stakes context. If a medical test had a 34-point accuracy swing between brands, it would get pulled from the market immediately.
After running this test, here's my honest take for three different groups of people.
If you're a student: Don't use these tools to cheat. The risk-reward ratio is terrible. Even the best tool left detectable traces, and the constant stress of wondering "did they catch me?" isn't worth whatever shortcut you're taking. Use AI as a research assistant and brainstorming partner instead. Write your own drafts. That's how you actually build the skills you're paying tuition for. If you're worried about false positives on work you actually wrote, read our guide on what students need to know about AI detection.
If you're a content creator: Undetectable AI or Humbot will get your detection flags low enough that most automated filters won't bother you. But don't treat humanizers as a substitute for editing. The best results in my test still needed a human pass for fact-checking and voice consistency. The human-AI hybrid workflow consistently produces better results than either approach alone.
If you're an educator: This test confirms something you probably already suspected. Students who want to bypass detection have multiple tools that work at least partially. But the inconsistency between detectors means no single tool is reliable. Combine whatever detection score you get with your own judgment. Look for the things these tools can't fake: personal experience from the student's own life, specific examples from your class discussions, connections to earlier assignments they've turned in. For guidance on building a fair classroom policy, see our comparison of the best AI detection tools for educators.
For anyone who wants to replicate or challenge what I did, here's the full raw data from the test:
Test date: June 2026. Source essay: 750 words on AI ethics in education. One attempt per tool. No prompt engineering. No re-rolls. No post-processing by me.
I'm going to run this exact same test again in three months. New essay topic. Same methodology. Same eight tools plus whatever new ones launch between now and then.
Why? Because these tools update constantly. StealthGPT's Extreme mode has changed three times this year alone. A tool that ranked eighth today could be first in September. The reverse is also true.
If you want to see the follow-up, I'll publish it right here. The data will still be raw and unfiltered either way. After seeing these numbers, I trust data more than I trust any tool's marketing page anyway.
Humanize AI text to sound naturally human with EvalHub.
Start Free Trial