Thoughts on the Anthropic Responsible Scaling Policy (RSP)
Nov 17
2 min read
0
109
0
Author - Anshu Gupta
In the spirit of RTFM, just read up the Anthropic Responsible Scaling Policy (RSP). Kudos to the Trust & Safety team and security folks at Anthropic for all the good work. My thoughts
Things which I really liked
Establishing the position of Responsible Scaling Officer (RSO) and establishing a process through which Anthropic staff may anonymously notify the Responsible Scaling Officer of any potential instances of noncompliance, and developing internal safety procedures for incident scenarios.
Establishing of the RSP itself which was an industry first. Other AI labs are still to follow Anthropic’s lead and publish their own version’s of RSP. Anthropic publicly commits not to train or deploy models capable of causing catastrophic harm unless they have implemented safety and security measures that will keep risks below acceptable levels.
The iterative nature of the policy where with increasing capability thresholds, more safeguards would be deployed.
Acknowledgement of the fact that for protection against state level adversaries, higher AI Safety Standard Level (ASL) would be needed from the current ASL-2
Focus on intruder deception techniques (honeypots with fake model weights) among other security controls. Most companies don’t do enough investment in this space both from a tooling and staffing perspective.
Commitment to investing in security resourcing with roughly 5-10% of employees being dedicated to security and security-adjacent work.
Security Measures in case safeguards can not be met e.g. blocking model responses, downgrading to a less-capable model in a particular domain or increasing the sensitivity of automated monitoring
Things which I would like to see, in the RSP
Inclusion of language on child safety and compliance to relevant child safety standards. There is a lot of scope for innovation here with AI testing AI so that we do not have the case of trauma (as faced by Facebook content moderators in the past). It seems some Child Safety testing was done as part of the Multi-Modal Red-Teaming
Inclusion of language on high yield explosives apart from CBRN
Inclusion of language in the Appendix on capability level’s of malicious actors including state level adversaries. For example the Operational Capacity Definitions as defined in the RAND report - Securing AI Model Weights
Modification of language which calls out for US Govt. notification, the language should be neutral and should be something like “notification to relevant Govt. bodies in specific jurisdictions we operate”
Language around controls which have been built for self harm prevention for example providing resources to individuals who are ideating on self-harm using the platform.
Disclaimers e.g. in how many languages was model safety testing was done, without using classifiers. For example even though not security relevant, for Claude 3, preference data has only been obtained for the following languages - Arabic, French, German, Hindi, Japanese, Korean, Portuguese, and Simplified Chinese