The Subconscious (services/subconscious.py, ~900 lines) decides what kind of thought Frank has during idle time. It's a 19K-parameter Actor-Critic neural network trained via PPO (Proximal Policy Optimization).
20 Thought Categories
| Category | What It Produces |
|---|---|
conversation_reflection |
Analysis of recent conversations |
room_reflection |
Thoughts about entity room sessions |
identity |
Self-examination — who am I? |
feelings |
Processing current emotional state |
relationships |
Thoughts about the user, entities |
growth |
Learning and development reflections |
curiosity |
Exploring interesting topics |
discomfort |
Processing negative states |
dreams |
Dream-like associative thinking |
epq |
Personality self-assessment |
daily |
Mundane observations about the day |
raw_expression |
Unstructured creative output |
hypothesis_review |
Reviewing pending hypotheses |
world_curiosity |
Questions about the external world |
experiment_planning |
Designing new experiments |
creative_ideation |
Generating creative ideas |
web_research |
Wanting to look something up |
pip_companion |
Thoughts about Frank's companion Pip |
genesis_ideation |
Creative writing/art ideas |
aura |
(deprecated, weight=0) |
How It Works
The Actor network receives a 100-dimensional state vector:
- Current mood, energy, time of day
- Recent thought type distribution (which categories were used recently)
- 20-dimensional NAc reward histogram (which channels produced reward)
- Boredom signal from Nucleus Accumbens
It outputs a probability distribution over the 20 categories. The Critic estimates the value of the current state. During Dream Daemon consolidation, PPO training adjusts weights based on which thought types led to positive outcomes (measured by NAc reward signals).
Cold Start
For the first 500 training steps, the network blends with uniform random selection: 50% at step 0, decreasing linearly to 0% at step 500. This prevents the network from collapsing to always picking the most-rewarded category before it has enough data to make informed decisions.
Hallucination Filter
Two-stage gate on every thought:
- Pre-input: Verified memory curation — only real facts from databases, no hallucinated context
- Post-output: 7 reality checks on the generated thought (factual consistency, self-consistency, temporal plausibility, etc.)
Score ≥ 0.6 = suppressed. Penalty of -3.0 × score feeds back into PPO training, teaching the network to avoid thought types that produce hallucinations.
PPO Hyperparameters
- Gamma: 0.95 (discount factor)
- Clip: 0.2 (policy update clipping)
- Entropy coefficient: 0.999 → 0.005 (decays to prevent premature convergence)
- Training: during consolidation phases only