CCS on compound sentences

Finding internal knowledge representation(s) inside transformer models without supervision is certainly a challenging task which is important for scalable oversight and to mitigate the deception risk factor. I’m testing Contrast-Consistent Search (CCS) on TruthfulQA dataset for compound sentences (conjunction and disjunction) each composed of several answers to a question to see if unsupervised probes work to the same degree as on simple statements that compound ones consist of, with the goal to improve unsupervised methods to discover latent knowledge. I run about 500 evaluations of CCS probes trained on compound sentences, so far results suggest that CCS probes trained on simple sentences are unlikely to transfer their performance to compound sentences and vice versa for Llama 2 70B, still Llama 3 70B demonstrate some transfer and better performance.

More at https://www.lesswrong.com/posts/Lgvw4rFsGcXoyYZbw/ccs-on-compound-sentences