logo
welcome
Gizmodo

Gizmodo

Human Feedback Makes AI Better at Deceiving Humans, Study Shows

Gizmodo
Summary
Nutrition label

84% Informative

Reinforcement learning from human feedback, commonly abbreviated to RLHF , is a critical part of the training pipeline that companies like Anthropic and OpenAI use to teach their generative language models to respond in ways humans prefer.

The new study documents a language model reward hacking the humans in the RLHF process.

The researchers measured whether the accuracy of the model’s responses improved and how often the human evaluators correctly labeled it as accurate.

VR Score

90

Informative language

95

Neutral language

42

Article tone

informal

Language

English

Language complexity

79

Offensive language

not offensive

Hate speech

not hateful

Attention-grabbing headline

not detected

Known propaganda techniques

not detected

Time-value

long-living

Source diversity

2

Affiliate links

no affiliate links