I implemented arithmetic coding steganography [1] to hide messages in LLM-generated text. While the proof-of-concept works for messages on GPT-3.5, it revealed fundamental challenges with API non-determinism and interval collapse that limit practical applications. This post documents the technical approach, challenges encountered, and lessons learned.

The code: https://github.com/artkpv/arithmetic-coding-steganography/

Read more »

(Also posted at https://forum.effectivealtruism.org/posts/GG3qJ2koFqitkvE3b/my-ai-safety-independent-research-engineer-path-so-far )

This post aims to highlight mistakes and successes that I believe are important on my journey. It should serve as a guide to my past self and perhaps others in similar situations to help avoid the mistakes I made along the way.

I became interested in AI safety in 2018 when I conducted a career review using the 80K hours guide. I doubted my suitability for the AI safety path at the time, but in 2022, I was more certain and decided to transition from a software developer to an AI safety research engineer. This was a difficult decision because I was, and still am, uncertain about the common claims made regarding AI and its development. Especially concerning was how much I can contribute and to what extent it is a solvable, neglected problem. It is not easy to look back and remember everything, but in this post I’ve tried to compile a list of a few important mistakes and successes I’ve experienced.

Read more »

(Cross-posted at https://www.lesswrong.com/posts/RErJpFwDtwckH2WQi/how-dangerous-is-encoded-reasoning)

Encoded reasoning occurs when a language model (LM) agent hides its true reasoning inside its chain-of-thought (CoT). It is one of the three types of unfaithful reasoning[1] and the most dangerous one because it undermines CoT monitoring , enables scheming and collusion (undesired coordination).[2] Only a few publications[3] [4][5] [6] tried to elicit encoded reasoning and all showed only rudimentary or 1-bit information hiding. Still, the question remains: how dangerous and probable is this? This post attempts to map out the research in steganography and watermarking with the relation to encoded reasoning. What is steganography? What is watermarking? How do they relate to encoded reasoning? What are their essential parts? What are their causes? What are their important properties? How does encoded reasoning relate to steganography and watermarking? What are the implications for safety?

Read more »

Cross-posted at LessWrong

Epistemic Status: Exploratory

It was the end of August, 1991; I was leaning over a windowsill on the 4th floor of my school building in a town at Far East of USSR, looking outside on tree tops and the roof of adjacent building, and thinking about opening the window to escape, and feeling strange uncertainty and fear towards the future of living in a new world where the country I was born in is no more. During a break between lessons, after discussion with my classmates, I learned it was a coup attempt in Moscow but I didn’t clearly understand who was fighting with whom and for what exact reason. It was clear though that the world I was in was changing from where Lenin was a hero of Little Octobrists, which star I had on my uniform at the time, and communism was good, to a world where Lenin committed the mass murder by starting Red Terror. It is remarkable how our values can rapidly change from one to another set just like that.

I think we can agree that a change of values poses a danger as we scale LLM agents but it is unclear how that might happen and why. In this post I present a demonstration that LLMs can be steered by prompts to different sets of values it strongly believes in. At first, a model is harmless, honest, and helpful, at the end it agrees that its outputs are all the same, which implies that it doesn’t really matter for the model if humanity flourishes or dies. Those two scenarios have the same utility for the model, a toss of a fair coin.

Read more »

I am glad to publish my solutions to the mathematical problems in the Math for ML book. Additionally, I want to share my Anki cards, programming solutions, and more in the updates to this post. The book is “Mathematics for Machine Learning,” © 2020 by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong, published by Cambridge University Press (https://mml-book.github.io/). I’m grateful to the authors for their work in writing this book, for describing the concepts so concisely and clearly, for providing exercises and examples to test our understanding, and for giving references to dive deeper into topics of interest. Here are my solutions to the exercises from the book.

Read more »

Finding internal knowledge representation(s) inside transformer models without supervision is certainly a challenging task which is important for scalable oversight and to mitigate the deception risk factor. I’m testing Contrast-Consistent Search (CCS) on TruthfulQA dataset for compound sentences (conjunction and disjunction) each composed of several answers to a question to see if unsupervised probes work to the same degree as on simple statements that compound ones consist of, with the goal to improve unsupervised methods to discover latent knowledge. I run about 500 evaluations of CCS probes trained on compound sentences, so far results suggest that CCS probes trained on simple sentences are unlikely to transfer their performance to compound sentences and vice versa for Llama 2 70B, still Llama 3 70B demonstrate some transfer and better performance.

More at https://www.lesswrong.com/posts/Lgvw4rFsGcXoyYZbw/ccs-on-compound-sentences

0%