teawrecks ,

Oh I see, you're saying the training set is exclusively with yes/no answers. That's called a classifier, not an LLM. But yeah, you might be able to make a reasonable "does this input and this output create a jailbreak for this set of instructions" classifier.

Edit: found this interesting relevant article

  • All
  • Subscribed
  • Moderated
  • Favorites
  • random
  • technology@beehaw.org
  • test
  • worldmews
  • mews
  • All magazines