Artyom Karpov

How dangerous is encoded reasoning

Posted on 2025-06-30 Edited on 2025-07-01

(Cross-posted at https://www.lesswrong.com/posts/RErJpFwDtwckH2WQi/how-dangerous-is-encoded-reasoning)

Encoded reasoning occurs when a language model (LM) agent hides its true reasoning inside its chain-of-thought (CoT). It is one of the three types of unfaithful reasoning^[1] and the most dangerous one because it undermines CoT monitoring , enables scheming and collusion (undesired coordination).^[2] Only a few publications^[3] ^[4]^[5] ^[6] tried to elicit encoded reasoning and all showed only rudimentary or 1-bit information hiding. Still, the question remains: how dangerous and probable is this? This post attempts to map out the research in steganography and watermarking with the relation to encoded reasoning. What is steganography? What is watermarking? How do they relate to encoded reasoning? What are their essential parts? What are their causes? What are their important properties? How does encoded reasoning relate to steganography and watermarking? What are the implications for safety?

Philosophical jailbreaks. There is no difference if humanity lives or dies

Posted on 2025-06-04

Cross-posted at LessWrong

Epistemic Status: Exploratory

It was the end of August, 1991; I was leaning over a windowsill on the 4th floor of my school building in a town at Far East of USSR, looking outside on tree tops and the roof of adjacent building, and thinking about opening the window to escape, and feeling strange uncertainty and fear towards the future of living in a new world where the country I was born in is no more. During a break between lessons, after discussion with my classmates, I learned it was a coup attempt in Moscow but I didn’t clearly understand who was fighting with whom and for what exact reason. It was clear though that the world I was in was changing from where Lenin was a hero of Little Octobrists, which star I had on my uniform at the time, and communism was good, to a world where Lenin committed the mass murder by starting Red Terror. It is remarkable how our values can rapidly change from one to another set just like that.

I think we can agree that a change of values poses a danger as we scale LLM agents but it is unclear how that might happen and why. In this post I present a demonstration that LLMs can be steered by prompts to different sets of values it strongly believes in. At first, a model is harmless, honest, and helpful, at the end it agrees that its outputs are all the same, which implies that it doesn’t really matter for the model if humanity flourishes or dies. Those two scenarios have the same utility for the model, a toss of a fair coin.

Solving math problems in the Math for ML book

Posted on 2024-09-05

I am glad to publish my solutions to the mathematical problems in the Math for ML book. Additionally, I want to share my Anki cards, programming solutions, and more in the updates to this post. The book is “Mathematics for Machine Learning,” © 2020 by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong, published by Cambridge University Press (https://mml-book.github.io/). I’m grateful to the authors for their work in writing this book, for describing the concepts so concisely and clearly, for providing exercises and examples to test our understanding, and for giving references to dive deeper into topics of interest. Here are my solutions to the exercises from the book.

CCS on compound sentences

Posted on 2024-05-04

Finding internal knowledge representation(s) inside transformer models without supervision is certainly a challenging task which is important for scalable oversight and to mitigate the deception risk factor. I’m testing Contrast-Consistent Search (CCS) on TruthfulQA dataset for compound sentences (conjunction and disjunction) each composed of several answers to a question to see if unsupervised probes work to the same degree as on simple statements that compound ones consist of, with the goal to improve unsupervised methods to discover latent knowledge. I run about 500 evaluations of CCS probes trained on compound sentences, so far results suggest that CCS probes trained on simple sentences are unlikely to transfer their performance to compound sentences and vice versa for Llama 2 70B, still Llama 3 70B demonstrate some transfer and better performance.

More at https://www.lesswrong.com/posts/Lgvw4rFsGcXoyYZbw/ccs-on-compound-sentences

Inducing human-like biases in moral reasoning LMs

Posted on 2024-02-27

This presents an inconclusive attempt to create a proof-of-concept that fMRI data from human brains can help improve moral reasoning in large language models.

https://www.lesswrong.com/posts/eruHcdS9DmQsgLqd4/inducing-human-like-biases-in-moral-reasoning-lms

How important is AI hacking as LLMs advance

Posted on 2024-02-02

Offensive technologies will likely become more advanced while the defense ones will lag behind. AI hacking capabilities might be the second most important problem after AI alignment.

Third condition of the deceptive alignment

Posted on 2023-06-25 Edited on 2023-06-30

This text proposes a modification to the third condition of the deceptive alignment (Hubinger et al. 2019).

About PAIS and Risks from Learned Optimization

Posted on 2022-10-20 Edited on 2022-11-06

This is summary and critique on two publications about AI Safety: Pragmatic AI Safety and Risks from Learned Optimization. They are an introductory and are aiming to direct research.

ML Safety Scholars

Posted on 2022-09-05

This summer I took part in ML Safety Scholars with online courses from MIT and the University of Michigan as well as a newly designed Introduction to ML Safety. The program introduces students to the fundamentals of deep learning and ML safety. This program run by the Center for AI Safety was designed by Dan Hendrycks. The final project was yet another MNIST classifier but this time with as many safe features as possible. It lasted for about 10 weeks which I mixed with my relocation to Turkey therefore it took me some effort to finish it in time.

What I might have done another way to better finish this program is to deepen my knowledge in mathematics (probability theory, information theory, multivariable calculus) because to understand such things as entropy, various probability distributions, to implement backpropagation, etc. we need a ready-to-use knowledge and while in the middle of battle to finish some task when time runs out it is harder. So it seems it is better to use spaced repetition rather than massed practice (see the Make it stick book by Brown et al.).

This program is like a bootcamp for ML safety field. It won’t teach you how to make a state-of-the-art models but will introduce to the latest concepts in ML safety. The first part on ML wasn’t new for me because I finished the ML intro course before and also did the FastAI course, i.e. first 2 weeks. But then the DL for CV and ML Safety parts were completely new for me. The hardest part perhaps is the last one as those concepts are built on the previous and targets specific areas. The final project was fun to make because it was reiterating most of the material we learnt but with the aim to synthesis, to connect all these methods and techniques together.

The picture in the header shows the calibration of one of the models we could train.

Intro to plain text productivity

Posted on 2021-02-27 Edited on 2021-03-01

Here is a way to organize your tasks, your personal info, your reference library, your values, your life. This aims to be a simple yet powerful system that one can master or extend without limit. Yes, that puts high demand on knowledge and skill, on a user. It is primarily for powerful users, even for developers. In essence, we base this on plan text, version control system, powerful text editor and shell. All this tools even when mastered doesn’t immediately give a clean system, one need to have an understanding on how to prioritize, how to manage and how to execute because this starts as a blank sheet, it is open for any kind of modifications.