随着企业在数字化转型过程中逐步引入人工智能与大模型技术,数据已成为驱动智能应用的核心资产。在实际应用中企业数据通常包含大量敏感信息,例如:用户隐私数据、商业机密、财务信息以及内部业务逻辑等,这些数据在用于模型训练、提示词交互或API调用时,极易面临泄露、滥用或合规风险。尤其是在大模型广泛接入企业业务系统之后,数据在"采集—传输—处理—生成"的全链路中暴露面显著扩大,使得传统的数据安全手段难以满足新的安全需求。在此背景下企业亟需构建面向AI场景的数据脱敏解决方案,通过在数据进入模型之前进行结构化识别与敏感信息处理,实现对隐私与敏感数据的有效保护,从而在保障数据可用性的同时,降低安全风险与合规压力,为企业AI应用的安全落地提供基础支撑

Microsoft Presidio是微软开源的一套用于敏感信息识别与数据脱敏的框架,其核心能力是对文本(以及扩展支持图像场景)中的敏感实体进行快速识别与脱敏处理,能够检测包括信用卡号、姓名、地址、社会安全号码(SSN)、比特币钱包地址、美国电话号码以及各类金融信息等多种私人数据类型并提供替换、遮蔽或匿名化等处理方式,框架由"分析器(Analyzer)"负责识别敏感实体,匿名化器(Anonymizer)负责执行数据变换,同时支持基于规则(例如:正则表达式)与机器学习模型的混合检测机制使其既具备高准确性也具有良好的扩展能力,广泛应用于企业数据合规(例如:GDPR、HIPAA)、日志脱敏、AI对话安全处理以及隐私保护数据管道建设等场景
Microsoft Presidio的整体工作可以简单概括:输入文本 → 多模型/规则并行检测(NER + Regex + Checksum + Context) → 生成PII实体列表 → 按策略执行脱敏操作 → 输出匿名化文本

Microsoft Presidio提供了效果演示页面,我们进行预测评:
https://huggingface.co/spaces/presidio/presidio_demo

Hash处理:

加密处理:

我们可以通过在本地Python脚本实现自动化的脱敏处理,具体流程如下:
Step 1:安装presidio
pip install presidio-analyzer presidio-anonymizer
Step 2:安装NLP模型
pip install spacy
python -m spacy download en_core_web_lg
备注:下载太慢也可以转Github下载对应的whl文件,例如:https://github.com/explosion/spacy-models/releases/tag/en_core_web_lg-3.8.0

随后直接安装即可:
pip install en_core_web_lg-3.8.0-py3-none-any.whl
Step 3:批量信息脱敏测试
测试文本:
Hello, my name is John Michael Doe.
I am a software engineer working at OpenAI Inc.
My employee ID is EMP-739201.
My email address is john.doe.personal@gmail.com and my secondary email is john.doe@outlook.com.
My phone number is +1 (415) 555-1234 and my backup number is 415-666-7890.
I live at 742 Evergreen Terrace, Springfield, CA 90210, United States.
My date of birth is 1992-08-15.
My passport number is X12345678 and my US Social Security Number is 123-45-6789.
My credit card is 4111 1111 1111 1111 and the CVV is 123, expiry 12/29.
My bank account number is 9876543210 and IBAN is GB29NWBK60161331926819.
My crypto wallet address is 0x4b20993Bc481177ec7E8f571ceCaE8A9e22C02db.
I often log in using username john_doe_92.
My IP address is 192.168.1.101.
My device ID is device-ABCD-9876.
I live near Central Park, New York.
My driving license number is D123-456-789-000.
Emergency contact: Jane Doe, phone +1-202-555-0199.
My GitHub account is https://github.com/johndoe.
My LinkedIn: https://linkedin.com/in/johndoe-engineer
I work in the AI department of OpenAI.脱敏结果:



脱敏脚本:
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
class PresidioDeidentifier:
def __init__(self, language="en"):
self.analyzer = AnalyzerEngine()
self.anonymizer = AnonymizerEngine()
self.language = language
def analyze_text(self, text: str):
return self.analyzer.analyze(text=text, language=self.language)
def anonymize_text(self, text: str, analyzer_results):
result = self.anonymizer.anonymize(
text=text,
analyzer_results=analyzer_results
)
return result.text
def process(self, text: str):
results = self.analyze_text(text)
return self.anonymize_text(text, results)
if __name__ == "__main__":
engine = PresidioDeidentifier(language="en")
print("====================================")
print(" Presidio PII 脱敏工具(多行输入)")
print(" 输入 END 结束输入")
print("====================================\n")
print("请输入待脱敏内容:\n")
lines = []
while True:
line = input()
if line.strip().upper() == "END":
break
lines.append(line)
input_text = "\n".join(lines)
print("\n====================================")
print("原始输入内容:\n")
print(input_text)
output = engine.process(input_text)
print("\n====================================")
print("脱敏结果:\n")
print(output)从上面可以看到这里的crypto wallet并没有做到脱敏处理,在这里我们可以通过自定义Recognizer(正则 + 规则)来进行正则匹配的方式实现脱敏:
from presidio_analyzer import (
AnalyzerEngine,
PatternRecognizer,
Pattern,
RecognizerRegistry,
)
from presidio_anonymizer import AnonymizerEngine
class CustomPresidioDeidentifier:
def __init__(self, language="en"):
# ----------------------------
# 1. 创建注册表
# ----------------------------
registry = RecognizerRegistry()
# ----------------------------
# 2. 内置 Analyzer
# ----------------------------
self.analyzer = AnalyzerEngine(registry=registry)
self.anonymizer = AnonymizerEngine()
self.language = language
# ----------------------------
# 3. 自定义 PII 规则
# ----------------------------
# Crypto Wallet (Ethereum)
crypto_recognizer = PatternRecognizer(
supported_entity="CRYPTO_WALLET",
patterns=[
Pattern(
name="eth_wallet",
regex=r"\b0x[a-fA-F0-9]{40}\b",
score=0.95,
)
],
)
# Device ID
device_recognizer = PatternRecognizer(
supported_entity="DEVICE_ID",
patterns=[
Pattern(
name="device_id",
regex=r"\bdevice-[A-Za-z0-9\-]+\b",
score=0.95,
)
],
)
# Username(较宽松,避免误报)
username_recognizer = PatternRecognizer(
supported_entity="USERNAME",
patterns=[
Pattern(
name="username",
regex=r"\b[a-zA-Z][a-zA-Z0-9_]{3,30}\b",
score=0.4,
)
],
)
# GitHub / LinkedIn / URL
url_recognizer = PatternRecognizer(
supported_entity="SOCIAL_URL",
patterns=[
Pattern(
name="github_linkedin_url",
regex=r"https?://(www\.)?(github|linkedin)\.com/[A-Za-z0-9_\-/]+",
score=0.95,
)
],
)
# ----------------------------
# 4. 注册 recognizers
# ----------------------------
registry.add_recognizer(crypto_recognizer)
registry.add_recognizer(device_recognizer)
registry.add_recognizer(username_recognizer)
registry.add_recognizer(url_recognizer)
self.registry = registry
self.analyzer = AnalyzerEngine(registry=self.registry)
# ----------------------------
# 5. 识别
# ----------------------------
def analyze_text(self, text: str):
return self.analyzer.analyze(text=text, language=self.language)
# ----------------------------
# 6. 脱敏
# ----------------------------
def anonymize_text(self, text: str, results):
return self.anonymizer.anonymize(
text=text,
analyzer_results=results,
).text
# ----------------------------
# 7. 一步流程
# ----------------------------
def process(self, text: str):
results = self.analyze_text(text)
return self.anonymize_text(text, results)
# ============================
# CLI 交互入口
# ============================
if __name__ == "__main__":
engine = CustomPresidioDeidentifier(language="en")
print("\n==============================")
print(" Presidio PII 脱敏工具(增强版)")
print(" 输入 END 结束输入")
print("==============================\n")
print("请输入待脱敏内容:\n")
lines = []
while True:
line = input()
if line.strip().upper() == "END":
break
lines.append(line)
input_text = "\n".join(lines)
print("\n==============================")
print("原始文本:\n")
print(input_text)
output = engine.process(input_text)
print("\n==============================")
print("脱敏结果:\n")
print(output)测试效果如下:
My crypto wallet is 0x4b20993Bc481177ec7E8f571ceCaE8A9e22C02db
My device ID is device-ABCD-9876
My username is john_doe_92
https://github.com/johndoe
END
在这里我们通过docker来构建Presidio服务环境:
# Download images
docker pull mcr.microsoft.com/presidio-analyzer
docker pull mcr.microsoft.com/presidio-anonymizer
# Run containers
docker run -d -p 5002:3000 mcr.microsoft.com/presidio-analyzer:latest
docker run -d -p 5001:3000 mcr.microsoft.com/presidio-anonymizer:latest
随后我们可以通过接口进行简单测试:
import requests
import json
analyze_url = "http://localhost:5002/analyze"
anonymize_url = "http://localhost:5001/anonymize"
# =========================
# 1️⃣ 读取用户输入(多行 + END结束)
# =========================
print("请输入要脱敏的内容(输入 END 结束):")
lines = []
while True:
line = input()
if line.strip().upper() == "END":
break
lines.append(line)
text = "\n".join(lines)
print("\n========================")
print("待处理文本:")
print(text)
print("========================\n")
# =========================
# 2️⃣ 调用 Presidio Analyze
# =========================
analyze_resp = requests.post(analyze_url, json={
"text": text,
"language": "en"
})
entities = analyze_resp.json()
print("识别结果:")
print(json.dumps(entities, indent=2, ensure_ascii=False))
print("========================\n")
# =========================
# 3️⃣ 自动构建 anonymizers(动态)
# =========================
anonymizers = {}
for e in entities:
entity_type = e["entity_type"]
# 默认策略:replace
if entity_type not in anonymizers:
anonymizers[entity_type] = {
"type": "replace",
"new_value": f"[{entity_type}]"
}
# =========================
# 4️⃣ 调用 anonymize
# =========================
anonymize_resp = requests.post(anonymize_url, json={
"text": text,
"analyzer_results": entities,
"anonymizers": anonymizers
})
result = anonymize_resp.json()
print("脱敏结果:")
print("========================")
print(result)
print("========================")运行测试结果如下:


语法结构
Presidio自定义规则结构:
Pattern(
name="rule_name",
regex=r"your_regex_here",
score=0.0~1.0
)字段说明:
自定义规则结构实例:
from presidio_analyzer import PatternRecognizer, Pattern
custom_recognizer = PatternRecognizer(
supported_entity="DEVICE_ID",
patterns=[
Pattern(
name="device_pattern",
regex=r"device-[A-Za-z0-9\-]+",
score=0.95
)
]
)同时我们也可以为一个实体写多个pattern:
PatternRecognizer(
supported_entity="CRYPTO_WALLET",
patterns=[
Pattern(
name="eth_wallet",
regex=r"\b0x[a-fA-F0-9]{40}\b",
score=0.95
),
Pattern(
name="btc_wallet",
regex=r"\b[13][a-km-zA-HJ-NP-Z1-9]{25,34}\b",
score=0.9
)
]
)同时我们也可以加入"语义提示词"来提高准确率,增强规则:
PatternRecognizer(
supported_entity="CRYPTO_WALLET",
patterns=[
Pattern(
name="eth_wallet",
regex=r"\b0x[a-fA-F0-9]{40}\b",
score=0.9
)
],
context=["wallet", "crypto", "address"]
)下面我们可以进行一个简单的语言类脱敏测试:

脱敏结果如下:

我们在脱敏的时候可以安装对应语言的NLP模型,使得在进行脱敏时可以更好的理解语义进行识别和脱敏处理,关于语言可以访问spaCy官方提供的预训练自然语言处理(NLP)模型集合网站,用于下载和管理不同语言、不同规模的语言模型(备注:lg-大模型,精度最高,md-中等模型,sm-小模型,快但是精度低,trf-transformer模型)
https://spacy.io/models/zh#zh_core_web_lg

我们这里选择中文,安装之前的测试效果如下:

安装NLP中文语言模型之后脱敏效果如下:
pip install zh_core_web_lg-3.8.0-py3-none-any.whl
随后修改脚本文件使其加载NLP Engin
from presidio_analyzer import (
AnalyzerEngine,
PatternRecognizer,
Pattern,
RecognizerRegistry,
)
from presidio_analyzer.nlp_engine import NlpEngineProvider
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
class CustomPresidioDeidentifier:
def __init__(self, language="zh"):
try:
# ============================
# NLP ENGINE(中文)
# ============================
configuration = {
"nlp_engine_name": "spacy",
"models": [
{
"lang_code": "zh",
"model_name": "zh_core_web_lg"
}
]
}
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()
# ============================
# registry
# ============================
registry = RecognizerRegistry(supported_languages=["zh"])
# ============================
# analyzer
# ============================
self.analyzer = AnalyzerEngine(
nlp_engine=nlp_engine,
registry=registry,
supported_languages=["zh"]
)
# ============================
# anonymizer
# ============================
self.anonymizer = AnonymizerEngine()
self.language = language
# ============================
# 自定义规则
# ============================
crypto_recognizer = PatternRecognizer(
supported_entity="CRYPTO_WALLET",
patterns=[
Pattern(
name="eth_wallet",
regex=r"\b0x[a-fA-F0-9]{40}\b",
score=0.95,
)
],
)
device_recognizer = PatternRecognizer(
supported_entity="DEVICE_ID",
patterns=[
Pattern(
name="device_id",
regex=r"\bdevice-[A-Za-z0-9\-]+\b",
score=0.95,
)
],
)
username_recognizer = PatternRecognizer(
supported_entity="USERNAME",
patterns=[
Pattern(
name="username",
regex=r"\b[a-zA-Z][a-zA-Z0-9_]{3,30}\b",
score=0.4,
)
],
)
url_recognizer = PatternRecognizer(
supported_entity="SOCIAL_URL",
patterns=[
Pattern(
name="github_linkedin_url",
regex=r"https?://(www\.)?(github|linkedin)\.com/[A-Za-z0-9_\-/]+",
score=0.95,
)
],
)
# 注册规则
registry.add_recognizer(crypto_recognizer)
registry.add_recognizer(device_recognizer)
registry.add_recognizer(username_recognizer)
registry.add_recognizer(url_recognizer)
except Exception as e:
print("❌ 初始化失败:", e)
raise
# ============================
def analyze_text(self, text: str):
return self.analyzer.analyze(text=text, language="zh")
def anonymize_text(self, text: str, results):
operators = {
"CRYPTO_WALLET": OperatorConfig("replace", {"new_value": "<CRYPTO_WALLET>"}),
"DEVICE_ID": OperatorConfig("replace", {"new_value": "<DEVICE_ID>"}),
"USERNAME": OperatorConfig("replace", {"new_value": "<USERNAME>"}),
"SOCIAL_URL": OperatorConfig("replace", {"new_value": "<SOCIAL_URL>"}),
}
return self.anonymizer.anonymize(
text=text,
analyzer_results=results,
operators=operators
).text
def process(self, text: str):
results = self.analyze_text(text)
return self.anonymize_text(text, results)
# ============================
# 🚀 CLI(END 结束)
# ============================
if __name__ == "__main__":
engine = CustomPresidioDeidentifier(language="zh")
print("\n==============================")
print(" PII中文脱敏工具(END结束版)")
print(" 输入 END 结束输入")
print("==============================\n")
print("请输入待脱敏内容:\n")
lines = []
while True:
line = input()
if line.strip().upper() == "END":
break
lines.append(line)
input_text = "\n".join(lines)
print("\n==============================")
print("原始内容:\n")
print(input_text)
result = engine.process(input_text)
print("\n==============================")
print("脱敏结果:\n")
print(result)
在企业数字化转型不断深入的背景下,数据已成为核心生产要素,而其中蕴含的个人隐私与敏感信息也带来了前所未有的安全挑战。数据脱敏作为数据安全体系中的关键一环,不仅是合规要求的必然选择,更是企业构建可信数据流通能力的基础能力。通过引入自动化脱敏技术与可配置的策略引擎,企业可以在研发、测试、分析与跨部门协作等多个场景中实现"数据可用不可见",在保障业务效率的同时最大限度降低数据泄露风险。结合规则化识别、NLP语义识别以及动态策略管理能力,数据脱敏方案正在从传统的静态处理方式,逐步演进为具备智能化与实时性的安全能力体系
https://spacy.io/models/
https://microsoft.github.io/presidio/
https://microsoft.github.io/presidio/samples/docker/
横向移动之RDP&Desktop Session Hi