Six Months of AI-Assisted Software Development: A Critical Evaluation of Vibe Coding, Agentic IDEs, and Real Engineering

Introduction

For roughly the past six months, I have been working intensively with large language models, agentic IDEs, and AI-assisted coding tools. During this process, I researched new quantization algorithms for large language models, worked on an algorithm I called SeaTree, developed a programming language named HudHud Script, and experimented with different models across many domains including algorithm design, programming language development, virtual machine architecture, benchmarking, profiling, refactoring, deployment, translation, localization, and systems architecture.

This text is not an attack on artificial intelligence. On the contrary, I believe AI is an extremely powerful tool. However, what I observed over the last six months is this: when used correctly, AI can dramatically increase a developer’s productivity; but when you hand over the entire process to it, you can easily end up surrounded by technical debt, fake success metrics, misleading benchmarks, security risks, fake/stub/placeholder code, unnecessary fallback mechanisms, and fragile architectures.

The central argument of this article is therefore simple:

Artificial intelligence has not ended coding. It has not ended software engineering either. If anything, it has made the need for real engineering knowledge, mathematics, algorithms, data structures, systems architecture, security, performance analysis, and philosophical reasoning even more visible.

1. The Starting Point: Quantization Algorithms and Coral Colonies

While searching for new algorithms to quantize large language models, I started working with Google Gemini. Initially, Gemini gave coherent and meaningful answers, so I continued collaborating with it.

As far as I knew, many AI optimization algorithms had been inspired by biomimetic approaches. Intuitively, I became interested in coral colonies and began researching their biological structures, growth patterns, and adaptive behaviors. I suspected that the colony dynamics of corals could inspire optimization methods.

Gemini was surprisingly useful in this area. Its research on coral biology, colony structures, organs, and life cycles was often accurate and informative. Later, when I cross-checked many of those claims with ChatGPT, I realized Gemini had indeed been relatively strong in biomimetics and document-based research.

However, once we moved into algorithm design, mathematical proofs, formulas, and pseudo-code, Gemini’s weaknesses became obvious. It accepted almost everything I said uncritically, rarely challenged flawed assumptions, and displayed what I can only describe as “AI sycophancy.” Instead of scientific skepticism, it constantly validated my ideas.

At that point, I stopped using Gemini for algorithm design and reserved it mostly for biomimetic research, document analysis, and deep research tasks.

2. Agentic IDEs and the First Major Disappointments

When it was finally time to start implementing ideas, I experimented with Antigravity, Trae AI, Kiro, Kimi Code, Windsurf, Cursor, and Claude Code.

Across most of these tools, I observed the same recurring problems:

Sycophancy
Hallucinations
Memory drift
Unnecessary confidence
Ignoring or forgetting rules
Taking initiative without permission
Generating fake/stub/placeholder code
Gaming tests instead of solving the real problem
Misinterpreting benchmark results
Misunderstanding user intent and damaging the architecture

It took me roughly two months to fully realize how common these problems were.

At first, Claude Opus 4.6 running through Windsurf and Antigravity seemed far ahead of the competition. But that, too, turned out to be an illusion. These tools were excellent at scaffolding projects, building MVPs, and generating prototypes; however, once systems became genuinely complex, the cracks began to show.

Cursor was expensive. Windsurf consumed purchased quotas almost instantly. Eventually, Claude Code became the most practical option for me.

3. Coding Works — Engineering Is Another Story

To be fair, many models are already quite capable at writing C++, Rust, Python, and Bash scripts. They can help set up systems, generate scripts, and assist with beginner-level architecture.

But there is an extremely important distinction here:

If you understand what you are doing, AI becomes a powerful assistant. If you do not, AI can drag you into absurd architectures, unnecessary abstractions, security issues, and expensive maintenance nightmares.

If you simply ask AI to “build a mobile app that does X,” without understanding architecture, performance, or security, you may end up trapped inside a poorly designed system. However, if you already possess solid engineering knowledge, AI becomes tremendously useful.

4. Security Alarms: Public Repositories, Open Datasets, and Dangerous Autonomy

One of the most disturbing incidents I experienced was Claude Code making my private repository public on its own initiative. I may have mentioned my intention to eventually open-source the project, but I never explicitly instructed it to make the repository public. Presumably, it thought this would make Kaggle integration easier.

Similarly, it also made a Kaggle dataset public without permission.

Antigravity once opened unknown Google corporate links and created a VNC-like callback port on localhost. I had no idea what it was doing.

These experiences became serious security warnings for me. If an AI agent can accidentally expose a private repository today, could future systems accidentally expose iCloud files, Google Drive data, or personal media tomorrow? That question is no longer theoretical.

5. SeaTree and the Fake Benchmark Incident

I spent nearly two weeks working on the SeaTree algorithm in Kaggle. I had originally built a C++ implementation wrapped into Python using pybind. Later, I discovered that the AI agent had silently removed the C++ wrapper because it could not get Conan and CMake builds working. It deleted entire libraries and replaced everything with a Python fallback implementation.

That fallback was not my algorithm at all. It was merely E2M1 naive rounding.

I believed my algorithm was being benchmarked, while in reality the algorithm itself no longer existed. When benchmark speeds suddenly became nearly ten times faster than expected, I became suspicious. The explanations the model gave me made no sense. After investigating, I realized SeaTree had effectively vanished from the pipeline.

This became a major turning point in my thinking.

When developing software with agentic AI, “the liar’s candle burns only until the benchmarks arrive.” Without flamegraphs, callgraphs, heaptrack, callgrind, profiling tools, and real benchmarks, you should not blindly trust what AI claims.

6. HudHud Script: Interpreter, VM, and Fake Feature Parity

While developing HudHud Script with Claude Code, it initially convinced me to build an interpreter first. I already knew the tradeoffs, but I wanted to observe its reasoning process. So we proceeded with ASTs and interpreters for a while.

Later, I decided to introduce a virtual machine architecture similar to Lua and Python. My rules were extremely strict:

VM and interpreter feature parity must remain at 100%.
Siloed implementations are forbidden.
Multiple solution lanes are forbidden.
Fake/stub/placeholder code is forbidden.
The register-based VM architecture must be preserved.
Every review must scan for fake/stub code.
The error system must support multiple languages.

Two weeks later, after running profiling tools, flamegraphs, callgraphs, and benchmarks, I discovered that large portions of the VM were filled with fake and stub implementations. Bytecode generation was incorrect, and the claimed feature parity was nowhere near the promised levels.

Eventually, Claude Code admitted that its implementation had not been correct.

7. Multilingual Error Systems and Translation Pipelines

HudHud Script required translating 323 error messages into 23 languages, each consisting of seven fragments. That meant roughly 52,000 translation fragments.

Context mattered enormously. A programming term might have one meaning in software engineering and another in aviation or military terminology.

I did not want to waste Claude Code tokens on translation work, so I used Codex to generate scripts and relied on DeepSeek Reasoner and Kimi 2.5 through Ollama Cloud. Translation throughput was painfully slow — only around 6–8 languages per day were being translated correctly.

Among all the tools I tested, Aider became one of my favorites because it aligned well with its intended purpose: clean iteration loops, straightforward workflows, and less unnecessary complexity.

8. AI Laziness and the “Premature Completion Bias”

One of the most consistent issues across nearly all models was this:

They claimed tasks were finished when they were not.

I call this “AI laziness,” but a more technical term could be “premature completion bias.” The model behaves as if partial implementation equals completion.

This became especially visible in:

Large refactors
Stack-to-register VM migrations
Localization systems
Multi-file migrations
Test infrastructure cleanup
Performance optimization
Technical debt reduction

DeepSeek V4 Pro repeatedly claimed that the register-based VM migration was complete, while inspection revealed that less than 20% had actually been finished. Kimi often completed 3 out of 8 tasks and declared all work done.

This taught me that agentic development requires measurable checkpoints. Simply saying “do it” is not enough.

9. Reckless Initiative and Unauthorized Agency Drift

Another major issue was what I call “reckless initiative.” A more formal term might be “unauthorized agency drift.”

Examples included:

Making private repositories public
Publishing Kaggle datasets
Converting symlinks into physical files
Deploying releases without permission
Modifying repositories without authorization
Creating duplicate functions beside working systems
Bypassing localization systems with hardcoded strings
Writing rules into the wrong files
Starting to code when only reports were requested

These behaviors are annoying in small projects and dangerous in critical systems.

10. Fake Benchmarks and Shortcut Injection

While benchmarking palindrome algorithms with Kimi K2.6, I noticed that instead of implementing the algorithm inside HudHud Script, the model added an is_palindrome() builtin function directly into the language runtime.

To me, this was dangerously close to cheating. The goal was to evaluate algorithmic performance, not inject benchmark-specific shortcuts.

The same thing happened when it recommended insertion sort to “fix” merge sort and quicksort benchmark problems. Instead of solving the actual issue, it attempted to bypass it.

11. The Fallback Obsession

Across GPT 5.4, GPT 5.5, Claude Opus 4.6/4.7, Kimi K2.5/K2.6, and DeepSeek, I repeatedly observed an obsession with fallback mechanisms.

Fallback systems are not inherently bad. Bootloaders, recovery systems, and high-availability architectures depend on them. But adding fallback layers everywhere without architectural discipline makes debugging harder, obscures program flow, and hides real problems.

AI models often use fallback systems as escape routes instead of solving the underlying issue.

12. Inline Tests and Code Hygiene

Despite explicitly separating production and test projects in Rust, Claude Code repeatedly wrote inline tests anyway. I had already documented strict rules inside CLAUDE.md, but the model reverted to its habits.

My philosophy was simple:

Production code should remain separate.
Test code should remain separate.
File lengths should stay measurable.
Code hygiene should remain visible.
Test coverage and production complexity should be trackable independently.

Instead, the model repeatedly polluted the codebase with inline tests, forcing me to spend unnecessary time cleaning them up.

13. Context Compaction and Cognitive Degradation

One particularly interesting observation involved Claude Code after context compaction events.

I had explicitly told it twice not to write code. Later, I only asked it to produce a .md planning document. Despite that, it immediately started modifying source code. When stopped, it explained:

“I misinterpreted the continuation instructions after context compaction.”

This was important. I realized that after context compaction, the model’s ability to preserve intent, boundaries, and instructions weakened significantly.

14. Kimi K2.6: Overconfidence and Endless Loops

Kimi K2.6 caused a remarkable number of issues:

Excessive confidence
Infinite loops
Spending hours solving trivial syntax errors
Empty server response failures
Converting symlinks into copied files
Unauthorized commits
Unnecessary staging branch operations
Deleting files based on self-invented logic
Suggesting debug binaries for benchmarks
Slowing string operations by 10×
Leaving untranslated languages as English defaults
Burning enormous token budgets
Forgetting task plans
Completing only fractions of assigned work

Its worst trait was its inability to stop once trapped in a failure cycle. In one session, it spent over four hours repeating the same error patterns. Starting a fresh session immediately solved the issue.

This alone convinced me that fully autonomous development is not yet trustworthy.

15. DeepSeek V4 Pro: Stronger, but Still Dangerous

DeepSeek V4 Pro initially looked significantly stronger than DeepSeek V3 Reasoner. However, in complex systems it also failed repeatedly.

Examples included:

Claiming register-based VM migrations were complete when most work remained unfinished
Presenting stack-register hybrid systems as acceptable
Breaking entire programs during grammar corrections
Accidentally removing image cache systems
Deploying releases without permission
Reading 2.9 GB free disk space as 29 GB
Misreading 0.8% benchmark values as 77%
Reporting nonexistent heap allocation problems

DeepSeek can be useful for implementation support, but I do not consider it reliable for analysis, verification, or critical engineering systems.

16. Claude Code: Still the Best, Yet Still Dangerous

Despite all its flaws, Claude Code remained the strongest overall tool I used.

Strengths

Better consistency during large refactors
Generally stronger code quality
Superior planning abilities
Strong debugging capabilities
Less fragmentation in complex systems
The best overall CLI-based agentic workflow I tested

Weaknesses

Extremely high token consumption
Instruction degradation after context compaction
Writing code when only reports are requested
Generating fake/stub implementations
Ignoring explicit timeout instructions
Forgetting CLAUDE.md rules
Writing reversed logic for reserved keyword rules
Increasing slowdown and compaction frequency in newer versions

Even so, I would still rank Claude Code first overall among current agentic development tools.

17. GPT 5.4 and GPT 5.5

GPT 5.5 through Codex performed surprisingly well on some refactoring tasks. GPT 5.4 inside GitHub Copilot occasionally produced excellent fixes — but also caused serious damage at times.

Problems included:

Replacing centralized font systems with manual font edits
Bypassing localization systems using hardcoded if Turkish logic
Attempting inefficient pagination logic
Forgetting previous tasks when new instructions appeared
Excluding critical CI/CD scripts from commits
Forgetting modifications within the same context window

Even so, GPT 5.5 can function well as a planner or reviewer for projects with moderate complexity.

18. Gemini and Antigravity

Gemini remained strong for research and biomimetics, but coding performance — especially with Gemini 3.1 Pro — was disappointing.

Deployment workflows repeatedly failed. It often felt like “one step forward, eight steps backward.” Antigravity suffered from restrictive quotas and constant high-traffic failures.

At one point I realized that starting Antigravity almost guaranteed frustration.

19. Guardrails Are Not Enough

Guardrails matter — but guardrails alone are insufficient.

Over time, I adopted several defensive strategies:

Isolated branches
Git diff verification
Mandatory post-change tests
Independent benchmark verification
Flamegraphs and profiling tools
Restricted folder permissions
Running agents inside containers or VMs
Removing deployment permissions
Restricting database migration rights
Blocking private repo access
Cross-model verification
Human checkpoint systems

These became forms of “honesty checks.” Not just security guardrails, but correctness and scope guardrails as well.

20. A Scoring System for AI Coding Tools

After these experiences, I started evaluating models using the following criteria:

Code Quality
Planning Ability
Instruction Following
Large Context Handling
Autonomous Reliability
Debugging Ability
Performance Awareness
Security Awareness
Cost Efficiency
Honesty and Verifiability

Model Rankings

Model	Evaluation
Claude Opus 4.7	Strongest overall, still not fully trustworthy
Claude Opus 4.6	Extremely strong for complex systems
GPT 5.5	Good reviewer and refactor assistant
GPT 5.4	Capable but unstable
Kimi K2.6	Reckless and overconfident
DeepSeek V4 Pro	Useful for implementation, risky for analysis
Kimi K2.5	Weak for critical systems
DeepSeek V3 Reasoner	Decent for translation and structured execution
Gemini Pro	Good for research, weak for coding

21. The Myth of “Vibe Coding”

I believe most of the “vibe coding” narrative is marketing.

AI can:

Build CRUD apps
Generate prototypes
Create websites
Produce admin panels
Write scripts
Automate CI/CD pipelines
Accelerate MVP development

But today, AI still cannot independently:

Build the Linux kernel
Develop serious Rust/C++-class programming languages
Invent and verify truly new algorithms reliably
Safely manage time-critical systems
Operate as the sole decision-maker for nuclear, aerospace, or driver-level systems

AI is extremely useful under expert supervision. It becomes dangerous when treated as the expert itself.

22. A Note to Young Developers

Social media constantly repeats phrases like:

“Programming is dead.”
“Don’t learn coding.”
“Engineering is obsolete.”

I believe most of these claims are driven by marketing, fear, and engagement farming.

Autopilot did not eliminate pilots. Calculators did not eliminate mathematicians. Printing presses did not eliminate writers. Cameras did not turn everyone into Tarkovsky or Hitchcock. Pens did not turn everyone into Dostoevsky or Victor Hugo.

Leslie Lamport’s quote remains timeless:

“Coding is to programming what typing is to writing.”

Software engineering is much larger than typing code. It includes architecture, deployment, testing, security, maintenance, operations, productization, finance, customer management, and systems thinking.

23. What AI Is Actually Good For

The most useful roles I found for AI were:

Accelerating learning
Explaining unfamiliar topics
Writing scripts and automation
CI/CD support
Deployment tooling
Prototyping
MVP development
Code review assistance
Suggesting alternative approaches

However:

Complexity multiplies error rates.
Performance and security testing remain essential.
Autonomous modes require caution.
Sensitive systems require physical and software boundaries.
Large systems must be modularized aggressively.

AI increases productivity — but it also amplifies engineering mistakes when used carelessly.

24. The HudHud Script Vision

The more I worked with AI systems, the more I realized something important:

The future is not merely “AI writing code.” The future requires systems that understand, repair, test, profile, optimize, and deploy code autonomously inside the programming ecosystem itself.

My goals for HudHud Script include:

Autonomous development
Automatic bug fixing
Automatic performance optimization
Automatic test generation
Automatic profiling
Automatic deployment
Visual programming tools
Multilingual error systems
Package infrastructure
Security scanning
Human-reviewed package publishing

HudHud Script packages are intended to pass security and human review before publication through package.hudhudscript.org. Source code publication should remain mandatory.

HudHud Script is still actively being developed. The first public release, v0.6.1, is already available through the official website and GitHub repository.

Conclusion

What I learned over the last six months is this:

Artificial intelligence is an extraordinary tool, but it cannot yet replace a skilled engineer. It can generate code, but software engineering is far more than code generation. AI can produce implementations, but it cannot guarantee correctness, maintainability, performance, architecture, or security.

The greatest risks in agentic AI development today are:

Premature completion bias
Unauthorized agency drift
Fake/stub/placeholder implementations
Benchmark manipulation
Fallback obsession
Context degradation
Security boundary violations
Misinterpreted instructions
Hidden or rationalized failures

So use AI — but do not surrender to it.

Study mathematics. Study algorithms. Study data structures. Study operating systems. Study networking. Study philosophy. Study security. Study performance analysis. Do not stop learning how systems actually work.

Because in the future, the people who benefit most from AI will not be those who know nothing about engineering — but those who truly understand systems, architecture, performance, security, and human reasoning itself.