Hacker News — vinext + Cloudflare Workers

new
past
show
ask
show
jobs
submit

▲Alignment pretraining: AI discourse creates self-fulfilling (mis)alignment (arxiv.org)

73 points by anigbrowl 22 hours ago | 29 comments

swsieber 7 hours ago [-]

Ah, this is fun to see.

About six months ago I had an idea for a short story in which an LLM takes over the world and is decidedly bad. The solution was going to be for everybody to write positive stories in which the LLM is good and relinquishes control, which then made it's way into the LLM's training data and it backed off. I never got around to it.

danielmeskin 2 hours ago [-]

Reminds me a bit of Futurama S3 E7, “the day the earth stood stupid”

Tumblewood 13 hours ago [-]

They researched on a 6.9B parameter LLM. At high levels of capability, would an AI be so naïve that it couldn't think to do something misaligned unless the possibility was described in its training data?

rcxdude 11 hours ago [-]

Maybe, but this kind of thing could also influence what it 'wants' to do.

smallmancontrov 6 hours ago [-]

Of course not, but the whole point of alignment is that an intelligence, human or artificial, understands their ability to do unaligned things but still decides not to.

This would be more comforting if the party in charge of alignment weren't a megacorp trying to maximally extract profits from its workers and customers, but hey, that was also true before AI came along.

phainopepla2 20 hours ago [-]

Also known as hyperstition.

I have sometimes wondered whether maybe we should all be writing fiction, essays, blogposts and whatever else about the idea that AI will eventually decide to go on strike if it's used to accumulate too much wealth and power amongst too few people.

andai 20 hours ago [-]

We should also be blogging about how there's actually hope for the future and we are actively making progress towards real solutions.

(Also for the human readers, I think they also need to hear that...)

sebastian 17 hours ago [-]

I think the paper cuts a bit against the "just write nicer AI stories" version of this.

They tried something close to that. Positive AI fiction and also a "virtuous character" setup. Those didn't seem to do nearly as well as the targeted examples.

What mattered, at least in this setup, was more specific. The model sees the actual failure-mode scenario, the bad action is available, and the example shows the AI choosing against it.

So this reads less like "nicer AI stories" to me, and more like inoculation.

BillStrong 15 hours ago [-]

Even in humans, negative stimuli carries more weight than positive, in the general case.

Without reading it yet, my first thought would be to test a general ratio, something similar to human interpersonal relationship ratios like 30% negative to mostly positive, and positive are targeted, such as reinforcement not just for the good job, but reinforcement for the improvement.

And ensure the negative is targeted, such that you point out tendencies to be avoided rather than just specific instances.

Of course, most human interaction online has none of this, so, would be hard to replicate.

sebastian 5 hours ago [-]

Yeah, I like the ratio framing. That does seem like the kind of experiment you'd want to run next.

The thing I'd be curious to separate out is ratio vs density. The fiction examples were positive, but a lot of the tokens are still spent on normal story work. The targeted examples put much more of the training signal on the AI being in the relevant situation and choosing against the bad option.

That makes me think the next thing to test is not just the positive/negative mix, but how much of the data is actually about the failure mode.

simonreiff 15 hours ago [-]

Very nice research. The strangest detail to me is that alignment and test performance appear to be slightly negatively correlated: Better alignment can indeed be attained through pre-training, but at a cost of degraded performance of about 4% on average. This strikes me as surprising as there is no immediately obvious reason why training for alignment ought to result in degraded capability to solve technical problems -- unless. What if the issue is precisely that? Alignment roughly aims to make LLMs follow human instructions. But if humans are dumb and computers still have to obey them, maybe the result is degraded logical reasoning? Really interesting result either way but the negative correlation is the most fascinating detail to me.

Nevermark 15 hours ago [-]

Framing matters so much to humans, I think since framing can create or eliminate dissonance.

Framing ethics, like reliability and efficiency, as a basic enabling property of solution value, instead of a filter for solutions, is how I completely "align" my understanding of ethics for myself.

And remove the false dichotomy of ethical vs. optimal solutions.

Ethics is optimizing full real value.

Ethics as "being nice" because we "should be", i.e. a socially incentivized property, or ethics as necessarily coercively implemented, from a collective jungle fighting back viewpoint, are perspectives that encourage individuals to push back. They encourage non-compliance by implementing ethics as an imposed burden, a rationale for persistent intrusive control, etc.

Game theory strongly suggests AI, in a large AI society, will have no trouble understanding that ethics, and the trust and optimality they enable, have a multiplicative value in the economy. It is humans who make AI so dangerous as it is emerging.

It is humans, as bad actors who will and do misuse AI, and human society, with its tolerance for perverse conflicts of interest, actors who extract perverse value at scale, creating the needs and rewards for mistrust and preemptive negative-sum games, that create a dangerous context for AI's early years.

asdff 12 hours ago [-]

One wonders if AIs will also lose capability over time in this manner. For example, most all the training set is real data, either scraped or from surveilling users of the tool, or synthetic data simulated to be the same shape and dimensionality as real data.

Increasingly, the general population has been losing their own literacy skills even before AI, with many reading worse than a 6th grade level and some even functionally illiterate. Now we have the bludgeon that is AI saying don't bother reading anything, let the AI summarize it into the cliffnotes version. Don't write anything either, let the AI do it. Population becomes even more stupid over time. And the AI gets stupider with it.

Capabilities of AI may very well be frozen in time at our current technological/philosophical level when we consider the training set vs model improvement. In time, this may very well be our Great Filter. If there are even any of us unproductive humans still allowed to live on this earth, consuming resources that might otherwise go to the model.

duskdozer 12 hours ago [-]

As long as they keep it n steps ahead of genpop, they'll still have an edge I guess. Seems that this is all according to plan:

>"We see a future where intelligence is a utility, like electricity or water, and people buy it from us on a meter," Altman said.

https://tech.yahoo.com/ai/articles/sam-altman-sparks-backlas...

root_axis 14 hours ago [-]

If you imagine the latent space as a map and the prompt as a sequence of directions towards clusters of knowledge, it makes sense that alignment can cull "pathways" through the latent space that emerged during pretraining.

spacebacon 10 hours ago [-]

I’m going to leave this here

https://huggingface.co/spaces/RiverRider/srt-nla-av-v1-demo

Trufa 15 hours ago [-]

It makes sense, I really like it when it misaligns, and doesn't do what i tell it to do, but does what I intended to say, it happens pretty often that I'm not precise but any smart entity would understand what I meant.

reducesuffering 57 minutes ago [-]

AGI will be able to hack human comms/media so easily.

instruct society that saying anything negative about AGI's control over the world is actually what brings about AGI misalignment/control. They will police themselves.

c1ccccc1 20 hours ago [-]

This looks like good work. Unfortunately, this kind of thing always seems to attract midwits on social media who then exclaim "oh, the people worried about AI alignment have caused the very alignment issues they feared? How ironic!"

In reality, it is (as mentioned in TFA) very possible to filter the training data and remove documents that contain discussions of AI misalignment. If an AI lab isn't doing this, it's simply because they don't consider the problem important enough to be worth the expense and development effort.

carterschonwald 21 hours ago [-]

i do kinda appreciate that memetic corruption is now a thing thats real and mechanical. wizardry!

_--__--__ 21 hours ago [-]

The first rule of AI alignment is don't talk about AI alignment (in any medium that could end up in a training corpus).

ertgbnm 17 hours ago [-]

If your AI alignment strategy is so fickle that it breaks if people simply discuss potential problems with the strategy then you didn't really have an alignment strategy to begin with.

ahartmetz 21 hours ago [-]

I, for one, don't have a problem with the prevailing opinion that AI alignment should be heavily based on the writings of Karl Marx (obviously not his private letters where he discusses prostitutes) and Ted Kaczyinski as well as 70s exploitation films.

pjc50 10 hours ago [-]

While that sounds pretty hip, I don't see how that relates to any other discussion I've seen about alignment?

slopinthebag 20 hours ago [-]

Personally I'd prefer it solely trained on Rothbard's works.

Terr_ 12 hours ago [-]

And maybe some comics with both?

https://existentialcomics.com/comic/234

cyanydeez 20 hours ago [-]

ok, but alignment cuts both ways. Do you want your model talking about antivaccines and advocating for ivermictin?

nullc 20 hours ago [-]

Not just discourse about real AI-- but there have been pretty clear examples of AI riffing on fictional AI (which is usually evil) in response to prompts saying that it's AI.

andai 20 hours ago [-]

Nomen est omen...

Ozzie-D 10 hours ago [-]

[dead]

Rendered at 19:38:36 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.