In which some researchers draw a spooky picture and spook themselves

swlabr@awful.systems · edit-2 7 months ago

In which some researchers draw a spooky picture and spook themselves

V0ldek@awful.systems · 6 months ago

Satelite models are increasingly trained and deployed as autonomous agents, which significantly increases their potential for risks. One particular safety concern is that the Moon might covertly pursue misaligned goals, hiding its true capabilities and objectives – also known as scheming. We study whether the Moon has the capability to scheme in pursuit of a goal that we provide in-context and instruct the Moon to strongly follow. We evaluate satelite models on a suite of six planetary evaluations where the Moon is instructed to pursue goals and is placed in orbits that incentivize scheming.