Auto PII Tagging
Auto PII tagging for Sensitive/NonSensitive at the column level is performed based on the two approaches described below.
Tagging logic
- Column Name Scanner: We validate the column names of the table against a set of regex rules that help us identify common English patterns to identify email addresses, SSN, bank accounts, etc.
- Entity Recognition: If the sample data ingestion is enabled, we'll validate the sample rows against an Entity Recognition engine that will bring up any sensitive information from a list of supported entities. In that case, the
confidence
parameter lets you tune the minimum score required to tag a column asPII.Sensitive
.
Note that if a column is already tagged as PII
, we will ignore its execution.
Troubleshooting
In OpenMetadata, the auto-classification feature primarily applies the PII classification, tagging data as either Sensitive or Non-Sensitive. The General classification, which includes tags like Address, Name, etc., is not available in the OpenMetadata. This functionality is present in the Collate and is expected to be included in the open-source release starting from version 1.7.1.
SSL: CERTIFICATE_VERIFY_FAILED
If you see an error similar to:
This is a scenario that we identified on some corporate Windows laptops. The bottom-line here is that the profiler is trying to download the Entity Recognition model but having certificate issues when trying the request.
A solution here is to manually download the model on the ingestion container / Airflow host by running:
If using Docker, you might want to customize the openmetadata-ingestion
image to have this command run there by default.