Open weights light a safety vs security clash, with abliteration, censorship, and incentives around releasing weights on the line. Post 7 argues this is a governance problem as much as a safety one, and that “it’s their model, not yours” when creators pick their guardrails [1].
Openness, Safety, and Security — If you release open weights, you control what stays or goes; preventing abliteration is a security choice, not a purely safety claim. The takeaway is blunt: you can build your own LLM if you don’t like the model you’re given, and tune or censor as you wish [1].
RLHF Transparency and Public Datasets — Full RLHF instruction sets aren’t publicly released, but researchers can study alignment patterns through public datasets. Open examples include NVIDIA’s Nemotron-Post-Training-Dataset-v1, Tulu-3 SFT mixture, and UltraFeedback, which reveal the shaping of preferences and tone without leaking internal docs [2]. HuggingFace datasets are often cited as entry points for this kind of analysis [2].
Governance Signals Across Providers — Across providers, the way alignment is encoded—tone, refusals, cultural framing—becomes a governance signal. Analyses comparing Anthropic, OpenAI, and Mistral reveal how safety and alignment norms travel between teams, even when the exact RLHF instructions stay private [2]. And as OpenAI has shown via their safety work on GPT-OSS, practical tuning and guardrails are real-world governance tools, not just theory [1].
Closing thought: openness, safety, and alignment data will keep shaping how we govern LLMs as they scale.
References
An Embarrassingly Simple Defense Against LLM Abliteration Attacks
Discusses safety vs security, open weights, abliteration, censorship, and incentives for researchers; argues model owners control release of their models.
View source[R] How to retrieve instructions given to annotators - RLHF
Explores ethics and feasibility of accessing RLHF instructions; recommends analyzing public datasets for alignment patterns and cross-provider bias without leaks.
View source