Skip to main content
  1. Posts/

Privacy Enhancing Technologies Summary

Privacy

Definitions

  • Privacy dictionary definition
    • the quality or state of being apart from company or observation : seclusion
    • freedom from unauthorized intrusion <one's right to privacy>

    ⇒ right to be let alone

  • CS definition of privacy
    the claim of individuals … to determine for themselves when, how, and what extent of information about them is communicated to others. ~ Alan Westin (1967)
  • Privacy Sphere model

    Modelling protection requirements (expectations) of classes of information as concentric circles of decreasing need for protection.

    Cf. German constitutional court decisions about the absolute protection of the intimate sphere, that later got slightly diluted by the state trojan decision

  • Problems of the sphere model
    • Assign data to corresponding spheres
    • Assignment may depend on context and situation…
  • Privacy mosaic model. Is it good?
    • Small snippets of information (probably) don’t expose a human
    • Loss (and aggregation) of several snippets lead to a mosaic of the individual
    • Increasing aggregation of puzzle pieces, increases detail of knowledge on the individual
    • Management of pieces that initially are not considered “intimate” possible
    • Independence of the way data is lost (or: collected)
    • Does not simplify determining criticality of pieces
    • Considers data capture/collection, but also further data processing
  • Privacy roles model
    • Humans act in roles depending on their situation 🎭
    • Usually specific information required to achieve certain task ✅
    • Group shared information according to context 🔂
    • Personas, various levels of sensitivity 🗝️
    • Individual images to be restrained to context ⭕
    • Transfer through 3rd parties may cause unknown leaks ❗
  • Consent Privacy Notion: Contextual Integrity
    • Linked to the privacy as roles model
    • Idea that data is shared with a specific mind set in specific context

    ⇒ expected privacy

  • Types of violation of expected privacy
    • violation of appropriateness of revelation
      • the context “defines” if revealing a given information is appropriate
      • violation: information disclosed in one context (even “public”) may not be appropriate in another (Asking a person participating in a gay pride vs. the same participating in a governmental press conference)
    • violation of distribution
      • the context “defines” which information flows are appropriate
      • violation: inappropriate information flows between spheres, or contexts; information disclosed in one context used in another (telling, even if first context was “public”)
  • Ethics in Research: Nuremberg code

    Nuremberg code → Ethics in Research → Privacy (in Research)

    1. Required is the voluntary, well-informed, understanding consent of the human subject in a full legal capacity.
    1. The experiment should aim at positive results for society that cannot be procured in some other way.
    1. It should be based on previous knowledge (like, an expectation derived from animal experiments) that justifies the experiment.
    1. The experiment should be set up in a way that avoids unnecessary physical and mental suffering and injuries.
    1. It should not be conducted when there is any reason to believe that it implies a risk of death or disabling injury.
    1. The risks of the experiment should be in proportion to (that is, not exceed) the expected humanitarian benefits.
  • Definition informational self-determination

    “The claim of individuals, groups and institutions to determine themselves, when, how and to what extent information about them is communicated to others” (GDPR: is processed)

  • Informational self-determination: Important underlying idea

    European law (GDPR) is based on that idea:

    The sovereign (self-determined) citizen controls collection, use, and can effectively retract even previously openly published data, upon change of mind

  • Difference: privacy and security

    Privacy wants to minimize trust assumptions

    • Security Model
      • CIA
    • Privacy Model
      • Sphere
      • Mosaic
      • Roles
      • Contextual Integrity
    • Security Communication Model Attacker
      • Eve
      • Mallory
    • Privacy Attacker
      • Passive, Active
      • Internal, External
      • Global, Local

  • Definition PETs

    are coherent measures that protect privacy by

    • eliminating or reducing personal data or
    • by preventing unnecessary/undesired processing personal data
    • without losing the functionality of the system

Data Protection

  • Principles of data processing according to GDPR
    • collect and process personal data fairly and lawfully ⚖️
    • purpose binding ➰
      • keep it only for one or more specified, explicit and lawful purposes
      • use and disclose it only in ways compatible with these purposes
    • data minimization 🤏
      • adequate, relevant and not excessive wrt. the purpose
      • retained no longer than necessary
    • transparency 🪟
      • inform who collects which data for which purposes
      • inform how the data is processed, stored, forwarded etc.
    • user rights ✊
      • access to data
      • correction
      • deletion
    • keep the data safe and secure 🔒
  • How is personally identifiable information defined according to the GDPR

    “personal data” shall mean any information relating to an identified or identifiable natural person (‘Data Subject’);

  • Difference between US law and EU law wtr. personal data

    US: Name, address (Phone, Email), national identifiers (tax, passports), IP address, driving (vehicle registration, driver's license), biometrics (face, fingerprints), credit card numbers, date/place of birth (age, login name(s), gender, "race", grades, salary, criminal records)

    EU: 'personal data' means any information relating to an identified or identifiable natural person ('data subject'); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person; [Art. 4, GDPR]

  • Are pseudonyms personal identifiable data

    EU: 'personal data' means any information relating to an identified or identifiable natural person ('data subject'); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person; [Art. 4, GDPR]

    Pseudonymous data can be linked back to individual, and it hence is considered PII!

  • Definition of pseudonym according to GDPR

    A pseudonym is any unique piece of information corresponding to an identity (quasi id)

  • Types of data
    • Data without any relation to individuals
      • Simulation data
      • Measurements from experiments
    • Data with relation to individuals
      • Types
        • Content
        • Metadata
      • Revelation
        • Consciously
        • Unconsciously
  • Sources for data with relation to individuals
    • Explicit
      • Created content
      • Comments
      • Structural interaction (contacts, likes)
    • Metadata
      • Session artifacts (time of actions)
      • interest (retrieved profiles; membership in groups/participation in discussions)
      • influence
      • Clickstreams, ad preferences
      • communication (end points, type, intensity, frequency, extent)
      • location (IP; shared; gps coordinates)
    • Inferred
      • Preference
      • Image recognition models
      • Personal details
    • Externally correlated
      • Observation in ad networks
  • Types of disclosures
    • Disclosure of identity
      • Identify an individual (in a dataset)
      • Link identity to an observation
    • Disclosure of attributes
      • Infer a (hidden) attribute of an individual
      • Link additional information to identity
  • Soft vs. Hard Privacy Technologies (with examples)
    • Soft: Fully trust another party, to keep data private in the user's interest
      • TLS
      • Privacy settings on a web platform
    • Hard: Semi-trusted participants, the goal is to reduce trust to a minimum and have mathematical privacy guarantees

Statistical Disclosure Control

  • The (virtual) Curator

Lecture 3: Privacy Notions in Anonymous Communication

  • Definition Anonymity

    Anonymity: “Anonymity of a subject means that the subject is not identifiable within a set of subjects, the anonymity set.”

  • IND-CPA
  • Communicationsrr
  • Sender Unobservability

    we do not learn who sends something

  • Sender-Message Unlinkability

    we do not learn who send which message (same receiver)

  • Sender Unlinkability

    we do not learn who sends to whom (same message)

  • Sender-Receiver Unlinkability

    we do not learn who sends which message or to whom

  • Hierarchy of ACN grantees
  • Is Sender-Message Unlinkability stronger than Sender Unobservability?
  • Is Sender Unobservability stronger than Sender Unlinkability?
  • Is Sender-Receiver Unlinkability stronger than Sender-Message Unlinkability?
  • How does Crowds work?
  • Crowds privacy guarantees
    • Sender Unobservability
    • Passive external receiver
    • Higher latency Management overhead
    • Availability risk (blenders)
  • Chaum’s Mix: Mix Cascade
  • Chaum’s Mix drawback
    • Availability drawback: Cascades = single point of failure
    • Improve Availability: Free-route mix networks
      • route is not fixed, any sequence of nodes from the network can be used for relaying messages
  • Mix Systems: mixing strategies
  • DC-Nets
  • Properties of DC-Nets
    • Sender Unobservability
    • Global passive adversary and up to n-2 corrupt participants
    • High bandwidth overhead Collisions and DoS Scalability issues
  • Protocol classes: Name, Goal, Adversary, Cost

Lecture 4: Privacy Metrics

  • Common components of privacy metrics
    • Adversary goals
      • Goals include
        • identifying a user
        • user properties (interests, preferences, location, etc.)
      • Metrics are defined for a specific adversary
    • Adversary capabilities
      • Attacker’s success depends on its capabilities
      • Metrics can only be employed to compare two PETs if they rely on the same adversary capabilities
      • Taxonomy
        • Local-global
        • Passive-active
        • Internal-External
        • Prior knowledge
        • Resources
    • Data sources
      • Published data
      • Observable data
      • Repurposed data
      • All other data
    • Input of metric
      • Prior knowledge of the adversary
      • Adversary’s resources
      • Adversary’s estimate
      • Ground truth/true outcome
      • Parameters
    • Output measures
      • Uncertainty-based privacy metrics
        • Assume that low uncertainty in the adversary’s estimate correlates with low privacy
        • The majority of these privacy metrics rely upon information-theoretic quantities (e.g., entropy)
        • Origin in anonymous-communication systems
        • Examples
          • Anonymity set (size)
            • Given a target member 𝑢, it is defined as the (size of the) set of members the adversary cannot distinguish from 𝑢
            • The larger the anonymity set, the more anonymity a member is enjoying
            • Widely used metric, not only in ACSs
            • Simplicity, tractability are positive properties of this metric
            • However: it only depends on the number of members in the system
          • Shannon Entropy
          • Normalized Shannon’s entropy
          • Rényi’s entropy
          • Interpretation of entropy measures
            • Cross-Entropy
      • Information gain/loss-based privacy metrics
        • Measure how much information is gained by an adversary after the attack
        • Originate from information theory
        • Applied to a variety of information, although mostly in anonymous Communications and database
        • Well-known examples include
          • Relative entropy
          • Interpretation of relative entropy
          • Mutual information
          • Loss of anonymity
          • Information privacy assessment metric (IPAM)
      • Data-similarity-based privacy metrics
        • Arise in the context of database anonymity
        • Measure properties of observable or published data
        • Derive the privacy level based on the features of disclosed data
        • Well-known examples include
          • 𝑘-Anonymity
            • Limitations of 𝑘-anonymity
          • 𝑝-Sensitive, 𝑘-anonymity
            • Limitations of 𝑝-sensitive, 𝑘-anonymity

              Skewness Attack: If the relative frequency of a value within a cluster differs wildly from the overall one, a possibly more sensitive value can be strongly predicted for a target.

          • 𝑙-Diversity
            • Limitations of 𝑙-diversity
          • 𝒕-Closeness
          • stochastic 𝑡-closeness
      • Metrics based on adversary’s success probability
        • Capture how likely the adversary will be to compromise our privacy in one or several attacks
        • High privacy correlates with low success probability
        • Examples include
          • Degree of anonymity
          • Sender anonymity
      • Indistinguishability-based privacy metrics
        • Is the adversary able to distinguish between two outcomes of a PET?
        • The harder for the adversary to distinguish any pair of outcomes, the higher the privacy provided by the PET
        • Typically binary metrics
        • Examples include
          • Differential privacy
          • Individual differential privacy
      • Error-based privacy metrics
        • Measure the error an adversary may make in their attempt to estimate unknown private information
        • Examples include
          • Correctness, by Shokri et al
          • Mean Squared error

            Most popular measure of utility

          • Attacker’s estimation error
      • Time-based privacy metrics
        • The output is time, an important resource for adversaries to compromise user privacy
        • Pessimistically assume the adversary will succeed at some point
          • Time until adversary’s success
            • Define “success”: Able to identify 𝑛 out of 𝑁 of the target’s possible communication peers
          • Maximum tracking time
            • Privacy defined as the cumulative time the attacker tracks a user
            • Assumes tracking is carried out only if the size of the anonymity set is 1
            • Optimistic privacy metric
  • The Inference Privacy Fallacy
    • We measure the privacy of the data release mechanism
    • We cannot protect adaptation of the prior (and corresponding inference)
    • General: If statistics are revealed, they are useless or help improve the prior
  • Queryable databases protections
    • Query perturbation
      • Deterministically correct answers not needed
      • Input vs output perturbation
    • Query restriction
      • Deterministically correct answers and exact are needed
      • Refuse to answer to sensitivity queries
    • Camouflage
      • Deterministically correct answers but non-exact are okay
      • Small interval answers of each confidential value
  • Methods for microdata protection
    • Masking methods: generate a modified version of the original data
      • Perturbative: modify data
        • Microaggregation: Mask by grouping and replacement by “mean” value
        • Microaggregation: SSE
        • Data swapping
        • Noise addition
          • Uncorrelated noise addition
            • Neither variances nor correlations are preserved
          • Correlated noise addition
            • Means and correlations can be preserved
          • Noise addition and linear transformation
          • Noise addition and non-linear transformation
        • Differential privacy for microdata
      • Non-Perturbative: do not modify the data, but rather produce partial suppressions or reductions of detail in the original dataset
        • Sampling
          • Publish random sample of the original set of records
          • Correlation determines which properties are retained (uncorrelated: none)
          • Continuous numerical data need further protection
        • Generalization/Coarsening
        • Global recoding
        • Local suppression
    • Synthetic methods: generate synthetic or artificial data with similar statistical properties
      • Extract chosen, preserved statistics from microdata (probabilities,distributions, ML models)
      • Randomly generate data (sampling, transformation)
      • Pros:
        • Possibility to generate “unlimited” data sets
        • seem to address the reidentification problem, as data are “synthetic”
      • Cons:
        • Published synthetic records can match an individual’s data, if model is not private
        • Data utility limited to the statistics captured by the model

Differential Privacy

  • ε\varepsilon-Differential privacy
  • Randomized response protocol
  • Laplace mechanism

  • Post-processing theorem
  • Composability

    A privacy model is composable if the privacy guarantees of the model are preserved (possibly to a limited extent) after repeated independent application of the privacy model. From the opposite perspective, a privacy model is not composable if multiple independently data releases, each of them satisfying the requirements of the privacy model, may result in a privacy breach

  • Sequential composition
  • Parallel composition
  • Exponential mechanism
  • Utility of the exponential mechanism
  • Query forgery

Privacy Notions in Trajectory Data

  • Challenges of Trajectory Data Privacy

    Sparse data, where only a few data points are sufficient to identify someone

  • Ways to model trajectories

Problems of Syntactic Techniques

  • Problem of Suppression
    • Drastic reduction of database
    • Dangerous when used by itself

  • Problem of Generalization
    • Not generalizing all dimensions
    • Inappropriate regions definition
    • Background knowledge attacks
    • Drastic reduction of precision
    • Dangerous when used by itself
  • Problem of Masking
    • Unpredictable biases
    • Impossible trajectories

Semantic Techniques

  • Privacy Notions in Trajectory Data
  • Event-neighborhood
  • Geo-indistinguishability
  • w-event neighborhood

    Not allowed to change entries that are far from each other, as in event-neighborhood

  • ℓ-trajectory privacy
  • Distance between users
  • Distance of trajectories
    • Euclidean distance
    • Hausdorff & Fréchet distances
    • DTW
    • LCSS
    • EDR
  • Granularity notions and their concept of neighborhood

Mechanisms Achieving Differential Privacy

  • ℓ1 -sensitivity
  • Laplace mechanism

Databases

  • SDC

    Statistical disclosure control (SDC) is the field that protects statistical databases so that they can be released without revealing confidential information that can be linked to specific individuals among those to which the data correspond

  • SDC vs PPDM vs PIR
    • SDC aims to provide respondent privacy
    • Privacy-preserving data mining (PPDM) seeks database owner privacy
    • Private information retrieval (PIR) aims for user/analyst privacy
  • External attack
  • Internal attack

    Having two respondents is not enough.

    Other respondents can collude.

  • Dominance attack

    When one person is known to be an outlier, and the average without them is known, then we can identify the max.

  • Methods for microdata protection
    • Masking methods: generate a modified version of the original data
      • Perturbative: modify data
        • Noise addition, microaggregation, rank swapping, microdata rounding, and resampling
      • Non-perturbative: do not modify the data but rather produce partial suppressions or reductions of detail in the original dataset
        • Sampling, global recoding, top and bottom coding, and local suppression
    • Synthetic methods: generate synthetic or artificial data with similar statistical properties
  • Output perturbation vs. input perturbation

    The Laplace mechanism belongs to a class of mechanisms called output perturbation vs. input perturbation (e.g., RR)