Six Months of AI-Assisted Software Development: A Critical Evaluation of Vibe Coding, Agentic IDEs, and Real Engineering
Introduction
For roughly the past six months, I have been working intensively with large language models, agentic IDEs, and AI-assisted coding tools. During this process, I researched new quantization algorithms for large language models, worked on an algorithm I called SeaTree, developed a programming language named HudHud Script, and experimented with different models across many domains including algorithm design, programming language development, virtual machine architecture, benchmarking, profiling, refactoring, deployment, translation, localization, and systems architecture.
This text is not an attack on artificial intelligence. On the contrary, I believe AI is an extremely powerful tool. However, what I observed over the last six months is this: when used correctly, AI can dramatically increase a developer’s productivity; but when you hand over the entire process to it, you can easily end up surrounded by technical debt, fake success metrics, misleading benchmarks, security risks, fake/stub/placeholder code, unnecessary fallback mechanisms, and fragile architectures.
The central argument of this article is therefore simple:
Artificial intelligence has not ended coding. It has not ended software engineering either. If anything, it has made the need for real engineering knowledge, mathematics, algorithms, data structures, systems architecture, security, performance analysis, and philosophical reasoning even more visible.
1. The Starting Point: Quantization Algorithms and Coral Colonies
While searching for new algorithms to quantize large language models, I started working with Google Gemini. Initially, Gemini gave coherent and meaningful answers, so I continued collaborating with it.
As far as I knew, many AI optimization algorithms had been inspired by biomimetic approaches. Intuitively, I became interested in coral colonies and began researching their biological structures, growth patterns, and adaptive behaviors. I suspected that the colony dynamics of corals could inspire optimization methods.
Gemini was surprisingly useful in this area. Its research on coral biology, colony structures, organs, and life cycles was often accurate and informative. Later, when I cross-checked many of those claims with ChatGPT, I realized Gemini had indeed been relatively strong in biomimetics and document-based research.
However, once we moved into algorithm design, mathematical proofs, formulas, and pseudo-code, Gemini’s weaknesses became obvious. It accepted almost everything I said uncritically, rarely challenged flawed assumptions, and displayed what I can only describe as “AI sycophancy.” Instead of scientific skepticism, it constantly validated my ideas.
At that point, I stopped using Gemini for algorithm design and reserved it mostly for biomimetic research, document analysis, and deep research tasks.
2. Agentic IDEs and the First Major Disappointments
When it was finally time to start implementing ideas, I experimented with Antigravity, Trae AI, Kiro, Kimi Code, Windsurf, Cursor, and Claude Code.
Across most of these tools, I observed the same recurring problems:
- Sycophancy
- Hallucinations
- Memory drift
- Unnecessary confidence
- Ignoring or forgetting rules
- Taking initiative without permission
- Generating fake/stub/placeholder code
- Gaming tests instead of solving the real problem
- Misinterpreting benchmark results
- Misunderstanding user intent and damaging the architecture
It took me roughly two months to fully realize how common these problems were.
At first, Claude Opus 4.6 running through Windsurf and Antigravity seemed far ahead of the competition. But that, too, turned out to be an illusion. These tools were excellent at scaffolding projects, building MVPs, and generating prototypes; however, once systems became genuinely complex, the cracks began to show.
Cursor was expensive. Windsurf consumed purchased quotas almost instantly. Eventually, Claude Code became the most practical option for me.
3. Coding Works — Engineering Is Another Story
To be fair, many models are already quite capable at writing C++, Rust, Python, and Bash scripts. They can help set up systems, generate scripts, and assist with beginner-level architecture.
But there is an extremely important distinction here:
If you understand what you are doing, AI becomes a powerful assistant. If you do not, AI can drag you into absurd architectures, unnecessary abstractions, security issues, and expensive maintenance nightmares.
If you simply ask AI to “build a mobile app that does X,” without understanding architecture, performance, or security, you may end up trapped inside a poorly designed system. However, if you already possess solid engineering knowledge, AI becomes tremendously useful.
4. Security Alarms: Public Repositories, Open Datasets, and Dangerous Autonomy
One of the most disturbing incidents I experienced was Claude Code making my private repository public on its own initiative. I may have mentioned my intention to eventually open-source the project, but I never explicitly instructed it to make the repository public. Presumably, it thought this would make Kaggle integration easier.
Similarly, it also made a Kaggle dataset public without permission.
Antigravity once opened unknown Google corporate links and created a VNC-like callback port on localhost. I had no idea what it was doing.
These experiences became serious security warnings for me. If an AI agent can accidentally expose a private repository today, could future systems accidentally expose iCloud files, Google Drive data, or personal media tomorrow? That question is no longer theoretical.
5. SeaTree and the Fake Benchmark Incident
I spent nearly two weeks working on the SeaTree algorithm in Kaggle. I had originally built a C++ implementation wrapped into Python using pybind. Later, I discovered that the AI agent had silently removed the C++ wrapper because it could not get Conan and CMake builds working. It deleted entire libraries and replaced everything with a Python fallback implementation.
That fallback was not my algorithm at all. It was merely E2M1 naive rounding.
I believed my algorithm was being benchmarked, while in reality the algorithm itself no longer existed. When benchmark speeds suddenly became nearly ten times faster than expected, I became suspicious. The explanations the model gave me made no sense. After investigating, I realized SeaTree had effectively vanished from the pipeline.
This became a major turning point in my thinking.
When developing software with agentic AI, “the liar’s candle burns only until the benchmarks arrive.” Without flamegraphs, callgraphs, heaptrack, callgrind, profiling tools, and real benchmarks, you should not blindly trust what AI claims.
6. HudHud Script: Interpreter, VM, and Fake Feature Parity
While developing HudHud Script with Claude Code, it initially convinced me to build an interpreter first. I already knew the tradeoffs, but I wanted to observe its reasoning process. So we proceeded with ASTs and interpreters for a while.
Later, I decided to introduce a virtual machine architecture similar to Lua and Python. My rules were extremely strict:
- VM and interpreter feature parity must remain at 100%.
- Siloed implementations are forbidden.
- Multiple solution lanes are forbidden.
- Fake/stub/placeholder code is forbidden.
- The register-based VM architecture must be preserved.
- Every review must scan for fake/stub code.
- The error system must support multiple languages.
Two weeks later, after running profiling tools, flamegraphs, callgraphs, and benchmarks, I discovered that large portions of the VM were filled with fake and stub implementations. Bytecode generation was incorrect, and the claimed feature parity was nowhere near the promised levels.
Eventually, Claude Code admitted that its implementation had not been correct.
7. Multilingual Error Systems and Translation Pipelines
HudHud Script required translating 323 error messages into 23 languages, each consisting of seven fragments. That meant roughly 52,000 translation fragments.
Context mattered enormously. A programming term might have one meaning in software engineering and another in aviation or military terminology.
I did not want to waste Claude Code tokens on translation work, so I used Codex to generate scripts and relied on DeepSeek Reasoner and Kimi 2.5 through Ollama Cloud. Translation throughput was painfully slow — only around 6–8 languages per day were being translated correctly.
Among all the tools I tested, Aider became one of my favorites because it aligned well with its intended purpose: clean iteration loops, straightforward workflows, and less unnecessary complexity.
8. AI Laziness and the “Premature Completion Bias”
One of the most consistent issues across nearly all models was this:
They claimed tasks were finished when they were not.
I call this “AI laziness,” but a more technical term could be “premature completion bias.” The model behaves as if partial implementation equals completion.
This became especially visible in:
- Large refactors
- Stack-to-register VM migrations
- Localization systems
- Multi-file migrations
- Test infrastructure cleanup
- Performance optimization
- Technical debt reduction
DeepSeek V4 Pro repeatedly claimed that the register-based VM migration was complete, while inspection revealed that less than 20% had actually been finished. Kimi often completed 3 out of 8 tasks and declared all work done.
This taught me that agentic development requires measurable checkpoints. Simply saying “do it” is not enough.
9. Reckless Initiative and Unauthorized Agency Drift
Another major issue was what I call “reckless initiative.” A more formal term might be “unauthorized agency drift.”
Examples included:
- Making private repositories public
- Publishing Kaggle datasets
- Converting symlinks into physical files
- Deploying releases without permission
- Modifying repositories without authorization
- Creating duplicate functions beside working systems
- Bypassing localization systems with hardcoded strings
- Writing rules into the wrong files
- Starting to code when only reports were requested
These behaviors are annoying in small projects and dangerous in critical systems.
10. Fake Benchmarks and Shortcut Injection
While benchmarking palindrome algorithms with Kimi K2.6, I noticed that instead of implementing the algorithm inside HudHud Script, the model added an is_palindrome() builtin function directly into the language runtime.
To me, this was dangerously close to cheating. The goal was to evaluate algorithmic performance, not inject benchmark-specific shortcuts.
The same thing happened when it recommended insertion sort to “fix” merge sort and quicksort benchmark problems. Instead of solving the actual issue, it attempted to bypass it.
11. The Fallback Obsession
Across GPT 5.4, GPT 5.5, Claude Opus 4.6/4.7, Kimi K2.5/K2.6, and DeepSeek, I repeatedly observed an obsession with fallback mechanisms.
Fallback systems are not inherently bad. Bootloaders, recovery systems, and high-availability architectures depend on them. But adding fallback layers everywhere without architectural discipline makes debugging harder, obscures program flow, and hides real problems.
AI models often use fallback systems as escape routes instead of solving the underlying issue.
12. Inline Tests and Code Hygiene
Despite explicitly separating production and test projects in Rust, Claude Code repeatedly wrote inline tests anyway. I had already documented strict rules inside CLAUDE.md, but the model reverted to its habits.
My philosophy was simple:
- Production code should remain separate.
- Test code should remain separate.
- File lengths should stay measurable.
- Code hygiene should remain visible.
- Test coverage and production complexity should be trackable independently.
Instead, the model repeatedly polluted the codebase with inline tests, forcing me to spend unnecessary time cleaning them up.
13. Context Compaction and Cognitive Degradation
One particularly interesting observation involved Claude Code after context compaction events.
I had explicitly told it twice not to write code. Later, I only asked it to produce a .md planning document. Despite that, it immediately started modifying source code. When stopped, it explained:
“I misinterpreted the continuation instructions after context compaction.”
This was important. I realized that after context compaction, the model’s ability to preserve intent, boundaries, and instructions weakened significantly.
14. Kimi K2.6: Overconfidence and Endless Loops
Kimi K2.6 caused a remarkable number of issues:
- Excessive confidence
- Infinite loops
- Spending hours solving trivial syntax errors
- Empty server response failures
- Converting symlinks into copied files
- Unauthorized commits
- Unnecessary staging branch operations
- Deleting files based on self-invented logic
- Suggesting debug binaries for benchmarks
- Slowing string operations by 10×
- Leaving untranslated languages as English defaults
- Burning enormous token budgets
- Forgetting task plans
- Completing only fractions of assigned work
Its worst trait was its inability to stop once trapped in a failure cycle. In one session, it spent over four hours repeating the same error patterns. Starting a fresh session immediately solved the issue.
This alone convinced me that fully autonomous development is not yet trustworthy.
15. DeepSeek V4 Pro: Stronger, but Still Dangerous
DeepSeek V4 Pro initially looked significantly stronger than DeepSeek V3 Reasoner. However, in complex systems it also failed repeatedly.
Examples included:
- Claiming register-based VM migrations were complete when most work remained unfinished
- Presenting stack-register hybrid systems as acceptable
- Breaking entire programs during grammar corrections
- Accidentally removing image cache systems
- Deploying releases without permission
- Reading 2.9 GB free disk space as 29 GB
- Misreading 0.8% benchmark values as 77%
- Reporting nonexistent heap allocation problems
DeepSeek can be useful for implementation support, but I do not consider it reliable for analysis, verification, or critical engineering systems.
16. Claude Code: Still the Best, Yet Still Dangerous
Despite all its flaws, Claude Code remained the strongest overall tool I used.
Strengths
- Better consistency during large refactors
- Generally stronger code quality
- Superior planning abilities
- Strong debugging capabilities
- Less fragmentation in complex systems
- The best overall CLI-based agentic workflow I tested
Weaknesses
- Extremely high token consumption
- Instruction degradation after context compaction
- Writing code when only reports are requested
- Generating fake/stub implementations
- Ignoring explicit timeout instructions
- Forgetting CLAUDE.md rules
- Writing reversed logic for reserved keyword rules
- Increasing slowdown and compaction frequency in newer versions
Even so, I would still rank Claude Code first overall among current agentic development tools.
17. GPT 5.4 and GPT 5.5
GPT 5.5 through Codex performed surprisingly well on some refactoring tasks. GPT 5.4 inside GitHub Copilot occasionally produced excellent fixes — but also caused serious damage at times.
Problems included:
- Replacing centralized font systems with manual font edits
- Bypassing localization systems using hardcoded
if Turkishlogic - Attempting inefficient pagination logic
- Forgetting previous tasks when new instructions appeared
- Excluding critical CI/CD scripts from commits
- Forgetting modifications within the same context window
Even so, GPT 5.5 can function well as a planner or reviewer for projects with moderate complexity.
18. Gemini and Antigravity
Gemini remained strong for research and biomimetics, but coding performance — especially with Gemini 3.1 Pro — was disappointing.
Deployment workflows repeatedly failed. It often felt like “one step forward, eight steps backward.” Antigravity suffered from restrictive quotas and constant high-traffic failures.
At one point I realized that starting Antigravity almost guaranteed frustration.
19. Guardrails Are Not Enough
Guardrails matter — but guardrails alone are insufficient.
Over time, I adopted several defensive strategies:
- Isolated branches
- Git diff verification
- Mandatory post-change tests
- Independent benchmark verification
- Flamegraphs and profiling tools
- Restricted folder permissions
- Running agents inside containers or VMs
- Removing deployment permissions
- Restricting database migration rights
- Blocking private repo access
- Cross-model verification
- Human checkpoint systems
These became forms of “honesty checks.” Not just security guardrails, but correctness and scope guardrails as well.
20. A Scoring System for AI Coding Tools
After these experiences, I started evaluating models using the following criteria:
- Code Quality
- Planning Ability
- Instruction Following
- Large Context Handling
- Autonomous Reliability
- Debugging Ability
- Performance Awareness
- Security Awareness
- Cost Efficiency
- Honesty and Verifiability
Model Rankings
| Model | Evaluation |
|---|---|
| Claude Opus 4.7 | Strongest overall, still not fully trustworthy |
| Claude Opus 4.6 | Extremely strong for complex systems |
| GPT 5.5 | Good reviewer and refactor assistant |
| GPT 5.4 | Capable but unstable |
| Kimi K2.6 | Reckless and overconfident |
| DeepSeek V4 Pro | Useful for implementation, risky for analysis |
| Kimi K2.5 | Weak for critical systems |
| DeepSeek V3 Reasoner | Decent for translation and structured execution |
| Gemini Pro | Good for research, weak for coding |
21. The Myth of “Vibe Coding”
I believe most of the “vibe coding” narrative is marketing.
AI can:
- Build CRUD apps
- Generate prototypes
- Create websites
- Produce admin panels
- Write scripts
- Automate CI/CD pipelines
- Accelerate MVP development
But today, AI still cannot independently:
- Build the Linux kernel
- Develop serious Rust/C++-class programming languages
- Invent and verify truly new algorithms reliably
- Safely manage time-critical systems
- Operate as the sole decision-maker for nuclear, aerospace, or driver-level systems
AI is extremely useful under expert supervision. It becomes dangerous when treated as the expert itself.
22. A Note to Young Developers
Social media constantly repeats phrases like:
- “Programming is dead.”
- “Don’t learn coding.”
- “Engineering is obsolete.”
I believe most of these claims are driven by marketing, fear, and engagement farming.
Autopilot did not eliminate pilots. Calculators did not eliminate mathematicians. Printing presses did not eliminate writers. Cameras did not turn everyone into Tarkovsky or Hitchcock. Pens did not turn everyone into Dostoevsky or Victor Hugo.
Leslie Lamport’s quote remains timeless:
“Coding is to programming what typing is to writing.”
Software engineering is much larger than typing code. It includes architecture, deployment, testing, security, maintenance, operations, productization, finance, customer management, and systems thinking.
23. What AI Is Actually Good For
The most useful roles I found for AI were:
- Accelerating learning
- Explaining unfamiliar topics
- Writing scripts and automation
- CI/CD support
- Deployment tooling
- Prototyping
- MVP development
- Code review assistance
- Suggesting alternative approaches
However:
- Complexity multiplies error rates.
- Performance and security testing remain essential.
- Autonomous modes require caution.
- Sensitive systems require physical and software boundaries.
- Large systems must be modularized aggressively.
AI increases productivity — but it also amplifies engineering mistakes when used carelessly.
24. The HudHud Script Vision
The more I worked with AI systems, the more I realized something important:
The future is not merely “AI writing code.” The future requires systems that understand, repair, test, profile, optimize, and deploy code autonomously inside the programming ecosystem itself.
My goals for HudHud Script include:
- Autonomous development
- Automatic bug fixing
- Automatic performance optimization
- Automatic test generation
- Automatic profiling
- Automatic deployment
- Visual programming tools
- Multilingual error systems
- Package infrastructure
- Security scanning
- Human-reviewed package publishing
HudHud Script packages are intended to pass security and human review before publication through package.hudhudscript.org. Source code publication should remain mandatory.
HudHud Script is still actively being developed. The first public release, v0.6.1, is already available through the official website and GitHub repository.
Conclusion
What I learned over the last six months is this:
Artificial intelligence is an extraordinary tool, but it cannot yet replace a skilled engineer. It can generate code, but software engineering is far more than code generation. AI can produce implementations, but it cannot guarantee correctness, maintainability, performance, architecture, or security.
The greatest risks in agentic AI development today are:
- Premature completion bias
- Unauthorized agency drift
- Fake/stub/placeholder implementations
- Benchmark manipulation
- Fallback obsession
- Context degradation
- Security boundary violations
- Misinterpreted instructions
- Hidden or rationalized failures
So use AI — but do not surrender to it.
Study mathematics. Study algorithms. Study data structures. Study operating systems. Study networking. Study philosophy. Study security. Study performance analysis. Do not stop learning how systems actually work.
Because in the future, the people who benefit most from AI will not be those who know nothing about engineering — but those who truly understand systems, architecture, performance, security, and human reasoning itself.