kromem ,

It's right in the research I was mentioning:

https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html

Find the section on the model's representation of self and then the ranked feature activations.

I misremembered the top feature slightly, which was: responding "I'm fine" or gives a positive but insincere response when asked how they are doing.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • random
  • chatgpt@lemmy.world
  • test
  • worldmews
  • mews
  • All magazines