Can You Build an Alternative to LLMs? 8 Months, ~200 Failed Experiments, One Wall. 2

This is part 2 of a research series documenting an attempt to build something adjacent to — and in some ways alternative to — large language models. Part 1: "My Synthetic Eval Said 30/30. LoCoMo Said 0.13." The code that survived lives in AuraSDK.

Not a chatbot wrapper.

Not another prompt stack.

Not a vector database with a nicer UI.

The question was narrower and more dangerous:

Can a non-neural, CPU-only system accumulate experience, mutate its own internal state, and improve future behavior without retraining an LLM?

After eight months, the honest answer is not "yes".

The honest answer is:

some mechanisms survived,
most carriers failed,
and the wall became precise.

The wall is not storage. Storage is easy.

The wall is transferable causal transition:

condition -> action -> consequence -> when NOT to apply it

That is the thing I kept failing to preserve.

1. What I Mean by "Alternative to LLMs"

The phrase is too broad, so I need to narrow it.

I was not trying to build a new frontier model. I was not training a transformer from scratch. I was not trying to beat GPT-class models at language.

The target was smaller:

Find a knowledge substrate that behaves like a tiny, mutable set of weights.

The required properties were:

it changes after experience;
it survives restart;
it changes future behavior without code changes;
it transfers to unseen but related cases;
it fails shuffled/null controls if the signal is fake;
it is compact in a behavioral sense, not merely compressed bytes;
it abstains when unsupported instead of confidently guessing.

In other words, I was looking for something between memory and weights.

Memory stores what happened.

Weights change what happens next.

The experiment was to find a symbolic or hybrid substrate that could cross that line.

2. Why Normal Memory Was Not Enough

The first tempting answer is always memory:

append the conversation;
summarize it;
store facts;
retrieve relevant chunks;
build a graph;
add embeddings;
add scores;
add typed edges.

All of those are useful engineering tools. None of them automatically become knowledge.

The distinction became brutal:

storage can preserve an event;
knowledge must change the next action.

A log can say:

candidate A failed in situation X
candidate B worked in situation X

But a useful system must do more:

in a new situation X',
recognize what is shared with X,
avoid A-like actions,
try B-like actions,
and know when the analogy no longer applies.

That last line is where most designs died.

3. The Carrier Bakeoff

I tested many candidate "carriers" of knowledge. A carrier is the internal form that is supposed to hold experience and make it reusable.

The early list looked reasonable:

Carrier	What it preserved	Where it failed
append-only memory	event history	did not transfer to unseen cases
surface compression	shorter text	compressed language, not behavior
graph / n-gram links	co-occurrence	became an index, not a causal model
route state	role and expectation	changed behavior, but stayed narrow
typed edges	relation labels	helped only when the relation grammar was already given
graded vectors	magnitude and pressure	carried direction, not an executable mechanism
living cells	local conflict	transferred some signal, but could transfer the wrong action
Jacobian-like footprint	local numeric effect	worked on linear repairs, failed on branching and bit-shift semantics
topology fields	learned attraction basins	signal on synthetic worlds; on real cargo snippets — 0/5 correct, 5/5 abstain

The pattern was consistent.

Each carrier preserved one aspect of consequence:

that something was supported;
that something was refuted;
that a choice had pressure;
that two roles were related;
that a numeric direction changed;
that a local conflict existed;
that a state transition had a shape.

But none of the early carriers preserved the full causal transition.

That is the difference between:

"A was good near B"

and:

"When input has property P and state has condition C,
action A changes state S in direction D,
unless boundary condition B is active."

The second form is much closer to knowledge.

It is also much harder to acquire without already having a model that can do it.

4. Signals That Were Real but Not Enough

Some results were not failures in the simple sense. They were real signals.

One route-state experiment proved that internal state can change behavior without changing code. The system saw support and refutation, mutated its state, and later selected differently. The shuffled-state control failed. That was important.

But it did not prove a replacement for weights. It proved that a narrow symbolic state can influence selection.

Another experiment used graded consequence vectors. The vector was updated by experience, not by gradient training. It showed useful properties:

same direction, more support       -> larger magnitude
50% shared experience              -> similarity ~0.41
0% shared experience               -> similarity ~0.05  (margin 0.35)
support followed by scar evidence  -> direction flipped
random nudges                      -> near-zero similarity

An important detail: 0.41 (not 1.0) on partially shared experience is exactly what real generalization looks like, as opposed to memorization. The similarity emerges from structure; it is not hard-coded. That too was a real signal.

But it was not enough. It carried pressure and similarity. It did not carry the executable rule of when a transition should fire.

The most honest label for these results:

SIGNAL, not solution.

That distinction saved months of self-deception.

5. The "Abstain" Lesson

One of the most useful discoveries was negative:

A weak knowledge carrier should abstain instead of guessing.

In several gates, the system got better when unsupported cases were answered with "I do not know" instead of being forced into the nearest known pattern.

This matters because many memory systems look good only because they always answer. But in a causal system, a wrong transfer is worse than silence.

If a stored pattern says:

this action repaired the previous case

and the new case looks similar but has a different boundary condition, reusing the action can be actively harmful.

The real requirement became:

transfer when supported;
abstain when the binding is not justified;
never convert weak similarity into certainty.

This is where many memory architectures quietly fail. They retrieve something. The model uses it. The answer looks grounded. But the binding between old evidence and the new situation may be invalid.

6. Executable Worlds Made the Failure Sharper

Synthetic gates are useful, but dangerous. If you design the world, you can accidentally design the success.

So part of the testing moved into small executable worlds: code snippets, controlled bugs, compiler feedback, repair attempts, and real pass/fail consequences.

The results became sharper.

A local numeric footprint could transfer part of a lived consequence pattern. It worked when the change was close to a linear repair:

increase this value;
change this numeric boundary;
map this local residual to that local patch.

But it failed on cases that required real mechanism:

bit-shift semantics;
branch behavior;
boundary conditions;
state transitions.

That mattered. Passing a generated linear-repair gate does not mean the system learned code. It means the carrier can reuse a local numeric direction.

The criterion became stricter:

a carrier must survive mixed nonlinear executable snippets:

- numeric delta;
- bit shift;
- boundary condition;
- state transition;
- negative abstain.

If it only passes local-linear repair, it is a helper operator, not a candidate core.

7. The Single-File Brain Attempt

One direction looked especially attractive:

What if the entire learned state lived in one mutable file?

The idea was simple:

The system acts.
The world returns consequences.
The state file mutates.
After restart, the file still changes future behavior.

This was not supposed to be a database. It was supposed to be a growing behavioral substrate.

The result was mixed:

transfer cycle: 3/5
blind baseline: 1/5
wrong: 1

So the file was not dead storage. It grew. It mutated. It transferred some lived transition experience. It beat the blind baseline.

But wrong=1 matters more than the improvement.

The system did not merely abstain on unsupported transfer. It sometimes confidently selected a wrong action. For a knowledge carrier, that is a critical failure.

The diagnosis:

bound transition-cells increased capacity,
but did not preserve enough mechanism.

This is a recurring pattern:

more structure != more knowledge

A richer container can still carry the wrong abstraction.

8. The Wall Became a Binding Problem

After many carrier failures, the question changed.

At first I asked:

Which carrier stores knowledge best?

Later, the better question was:

How does a new situation bind to old experience?

The binding is the hard part.

Suppose the system has learned a useful transition in one world:

node A influences node B
changing A repairs failure F

Now it sees a new world with different node names and a different surface form.

Which node corresponds to A?

Which node corresponds to B?

Which local structure is the same mechanism, and which is only superficially similar?

Without an anchor, this turns into graph matching. In one of the gates I attempted an autonomous bijection between train and heldout worlds directly — and failed exactly here: computing the correspondence required already knowing the correspondence. That is graph isomorphism, NP-hard without an anchor. Not an engineering obstacle — a fundamental one.

This produced the anchor trilemma.

9. The Anchor Trilemma

Every attempted solution fell into one of three buckets.

Option 1: The anchor is given

If a human or a hand-written rule tells the system which parts correspond, transfer becomes much easier.

But then the hard part was not learned. It was supplied.

This can still be useful engineering. It is not an alternative knowledge substrate.

Option 2: The anchor is searched

If the system searches over possible bindings, it can sometimes find the match.

But the search explodes quickly. The more nodes, relations, states, and conditions, the less attractive this becomes.

You have not built cheap knowledge. You have moved the cost into combinatorial search.

Option 3: The anchor is learned by a model

If an LLM, an embedding model, an analyzer, or a world-probing system supplies the binding, the system works much better.

But then the symbolic substrate is no longer the source of the core intelligence. It becomes a memory, a cache, a verifier, or an optimizer around another intelligence source.

That may be a good product.

It is no longer the original hypothesis.

The trilemma:

given anchor    -> not learned
searched anchor -> too expensive or unstable
learned anchor  -> not independent from the model/teacher

This was the wall.

10. What Actually Survived

The failed broad claim was:

A symbolic carrier can become alternative weights by storing enough structured consequences.

That did not survive.

What survived was narrower.

1. Consequence loops matter

The system improves only when it receives world feedback:

try -> observe -> mutate -> retry

Static documents, summaries, and graphs are weak unless tied to consequences.

2. Deterministic guards matter

Some fields must not be entrusted to a generative or fuzzy substrate:

IDs;
dates;
amounts;
names;
paths;
exact constraints;
status changes.

Extract them and preserve them explicitly.

3. World judges matter

The compiler, tests, and executable checks were more honest than internal scores.

They do not care how elegant the architecture is. They return:

pass / fail

4. Abstention is a feature

A weak system that abstains is more useful than a weak system that always transfers.

5. Some organs are still valuable

Route-state, scars, typed relations, date guards, append-only memory, cheap probes, and executable judges are not useless.

They are useful organs.

They are not a full brain.

11. What I Am Not Claiming

I am not claiming I built an alternative to LLMs.

I am not claiming symbolic systems cannot work.

I am not claiming all memory systems are useless.

I am not claiming the wall is mathematically insurmountable in every form.

The honest claim is narrower:

In my experiments, every carrier that tried to preserve knowledge as stored structure eventually failed at transferable causal binding.

And there is an even narrower boundary I have to state myself, before someone states it for me: this is one person working for 8 months on a CPU, not the output of a lab. There may be a carrier I never tried. But 150+ gates kept failing at the same point so consistently that the pattern outweighs any single failure — and that point coincides with why symbolic AI historically lost to learning: the correspondence metric has to be learned, not postulated.

That is still useful.

It prevents wasting another cycle on prettier containers.

12. The Test for Any New Approach

After these failures, a new approach is worth testing only if it clears a stricter bar.

It must:

change future behavior without code changes;
survive restart;
beat shuffled and null controls;
transfer to unseen cases;
abstain on unsupported cases;
preserve exact fields separately;
operate in a world with consequences;
handle nonlinear transitions, not only local numeric repair;
show value over a simple baseline.

And most importantly:

it must carry the transition,
not just an aspect of the transition.

This is the line that killed most of my designs.

13. Why This Still Matters

A negative result is not the same as no result.

Before these experiments, the problem looked like:

find a better memory structure

After the failures, it looks like:

store and transfer executable causal transitions with valid binding

That is a much better problem statement.

It also changes product thinking. A useful system does not need to pretend it replaces an LLM. It can be valuable if it provides:

cheaper long-session memory;
evidence-preserving state;
deterministic guards;
world-verified actions;
refusal to overwrite known scars;
test generation from consequences;
bounded automation around external judges.

Those are real mechanisms.

They just should not be sold as "alternative weights" until they pass the binding wall.

14. Conclusion

The original dream was:

build a CPU-only mutable substrate that behaves like alternative weights

The experiments produced something less glamorous and more useful:

storage is easy;
behavioral mutation is possible;
transfer is fragile;
causal binding is the wall.

The wall is not that the system cannot remember.

The wall is that remembering is not enough.

To act intelligently in a new situation, the system must know what old experience corresponds to, which transition applies, and where the analogy breaks.

That is the part I could not solve with another graph, vector, summary, route state, or mutable file.

The final number:

~200 experiments, 30+ candidate carriers, 1 wall.