Chandan Singh | fairness

fairness

view markdown


*Some notes on algorithm fairness and STS.**

fairness metrics

  • good introductory blog
  • causes of bias
    • skewed sample
    • tainted examples
    • selectively limited features
    • sample size disparity
    • proxies of sensitive attributes
  • definitions
    • unawareness - don’t show sensitive attributes
      • flaw: other attributes can still signal for it
    • group fairness
      • demographic parity - mean predictions for each group should be approximately equal
        • flaw: means might not be equal
      • equalized odds - predictions are independent of group given label
        • equality of opportunity: $p(\hat y=1 y=1)$ is same for both groups
      • predictive rate parity - Y is independent of group given prediction
    • individual fairness - similar individuals should be treated similarly
    • counterfactual fairness - replace attributes w/ flipped values
  • fair algorithms
    • preprocessing - remove sensitive information
    • optimization at training time - add regularization
    • postprocessing - change thresholds to impose fairness

fairness in cv (tutorial)

Computer vision in practice: who is benefiting and who is being harmed?

  • timnit gebru - fairness team at google
    • also emily denton
  • startups
    • faceception startup - profile people based on their image
    • hirevue startup videos - facial recognition for judging interviews
    • clearview ai - search all faces
    • police using facial recognition - harms protestors
    • facial recognition rarely has good uses
    • contributes to mass surveillance
    • can be used to discriminate different ethnicities (e.g. Uighurs in china)
  • gender shades work - models for gender classification were worse for black women
    • datasets were biased - PPB introduced to balance things somewhat
  • gender recognition is harmful in the first place
    • collecting data without consent is also harmful
  • letter to amazon: stop selling facial analysis technology
  • combating this technology
    • fashion for fooling facial recognition

data ethics

  • different types of harms
    • sometimes you need to make sure there aren’t disparate error rates across subgroups
    • sometimes the task just should not exist
    • sometimes the manner in which the tool is used is problematic because of who has the power
  • technology amplifies our intent
  • most people feel that data collection is the most important place to intervene
  • people are denied housing based on data-driven discrimination
  • collecting data
    • wild west - just collect everything
    • curatorial data - collect very specific data (this can help mitigate bias)
  • datasets are value-laden, drive research agendas
  • ex. celeba labels gender, attractiveness
  • ex. captions use gendered language (e.g. beautiful)

where do we go?

  • technology is not value-neutral – it’s political
  • model types and metrics embed values
  • science is not neutral, objecive, perspectiveless
  • be aware of your own positionality
  • concrete steps
    • ethics-informed model evaluations (e.g. disaggregegated evaluations, counterfactual testing)
    • recognize limitations of technical approaches
    • transparent dataset documentation
    • think about perspectives of marginalized groups

misc papers

concrete harms

Technologies, especially world-shaping technologies like CNNs, are never objective. Their existence and adoption change the world in terms of

  • consolidation of power (e.g. facial-rec used to target Uighurs, increased rationale for amassing user data)
  • a shift toward the quantitative (which can lead to the the type of click-bait extremization we see online)
  • automation (low-level layoffs, which also help consolidate power to tech giants)
  • energy usage (the exorbitant footprint of models like GPT-3)
  • access to media (deepfakes, etc.)
  • a lot more

pandemic

I hope the pandemic, which has boosted the desire for tracking, does not result in a long-term arc towards more serveillance

  • from here: City Brain would be especially useful in a pandemic. (One of Alibaba’s sister companies created the app that color-coded citizens’ disease risk, while silently sending their health and travel data to police.) As Beijing’s outbreak spread, some malls and restaurants in the city began scanning potential customers’ phones, pulling data from mobile carriers to see whether they’d recently traveled. Mobile carriers also sent municipal governments lists of people who had come to their city from Wuhan, where the coronavirus was first detected. And Chinese AI companies began making networked facial-recognition helmets for police, with built-in infrared fever detectors, capable of sending data to the government. City Brain could automate these processes, or integrate its data streams.
  • “The pandemic may even make people value privacy less, as one early poll in the U.S. suggests”

ethics

  • Moral Trade (ord 2015) - moral trade = trade that is made possible by differences in the parties’ moral views
  • examples
    • one trading their eating meat for another donating more to a certain charity they both believe in
    • donating to/against political parties
    • donating to/against gun lobby
    • donating to/for pro-life lobby
    • paying non-profit employees
  • benefits
    • can yield Pareto improvements = strict improvements where something gets better while other things remain atleast constant
  • real-world examples
    • vote swapping (i.e. in congress)
    • vote swapping across states/regions (e.g. Nader Trader, VotePair) - ruled legal when money not involved
    • election campaign donation swapping - repledge.com (led by eric zolt) - was taken down due to issues w/ election financing
  • issues
    • factual trust - how to ensure both sides carry through? (maybe financial penalties or audits could solve this)
    • counterfactual trust - would one party have given this up even if the other party hadn’t?
  • minor things
    • fits most naturally with moral framework of consequentalism
    • includes indexicals (e.g. prioritizing one’s own family)
    • could have uneven pledges

sts

  • social determinism - theory that social interactions and constructs alone determine individual behavior

  • technological determinism - theory that assumes that a society’s technology determines the development of its social structure and cultural values

  • do artifacts have politics? (winner 2009)
    • politics - arrangements of power and authority in human associations as well as the activitites that take place within those arrangements
    • technology - smaller or larger pieces or systems of hardware of a specific kind.
    • examples
      • pushes back against social determinism - technologies have the ability to shift power
      • ex: nuclear power (consolidates power) vs solar power (democratizes power)
      • ex. tv enables mass advertising
      • ex. low bridges prevent buses
      • ex. automation removes the need for skilled labor
        • ex. tractors in grapes of wrath / tomato harvesters
      • ex. not making things handicap accessible
    • “scientific knowledge, technological invention, and corporate profit reinforce each other in deeply entrenched patterns that bear the unmistakable stamp of political and economic power”
      • pushback on use of things like pesticides, highways, nuclear reactors
    • technologies which are inherently political, regardless of use
      • ex. “If man, by dint of his knowledge and inventive genius has subdued the forces of nature, the latter avenge themselves upon him by subjecting him, insofar as he employs them, to a veritable despotism independent of all social organization.”
      • attempts to justify strong authority on the basis of supposedly necessary conditions of technical practice have an ancient history.
  • Disembodied Machine Learning: On the Illusion of Objectivity in NLP

  • less work for mother (cowan 1987) - technologies that seem like they save time rarely do (although they increase “productivity”)

  • The Concept of Function Creep - “Function creep denotes an imperceptibly transformative and therewith contestable change in a data-processing system’s proper activity.”

facial rec. demographic benchmarking

  • Face Recognition Vendor Test (FRVT) Part 3: Demographic Effects (grother et al. 2019), NIST
    • facial rec types
      • 1: 1 == verification
      • 1: N == identification
    • data
      • domestic mugshots collected in the United States
      • application photographs from a global population of applicants for immigration benefits
      • visa photographs submitted in support of visa applicants
      • border crossing photographs of travelers entering the United States
    • a common practice is to use random pairs, but as the pairs are stratified to become more similar, the false match rate increases (Fig 3)
    • results
      • biggest errors seem to be in African Americans + East Asians
        • impact of errors - in verification, false positives can be security threat (while false negative is mostly just a nuisance)
      • In domestic mugshots, false negatives are higher in Asian and American Indian individuals, with error rates above those in white and black face
      • possible confounder - aging between subsequent photos
      • better image quality reduces false negative rates and differentials
      • false positives to be between 2 and 5 times higher in women than men
      • one to many matching usually has same biases
        • a few systems have been able to remove bias in these false positives
      • did not analyze cause and effect
        • don’t consider skin tone
  • Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing (2020)

legal perspectives

  • “the master’s tools will never dismantle the master’s house.”
  • Title VII defines accountability under under U.S. antidiscrimination law
    • protected attributes: sex, religion, national origin, color
    • law is all about who has the burden of proof - 3 steps
      • plaintiff identifies practice that has observed statistical disparities on protected group
      • defendant demonstrates practice is (a) job-related (b) consistent with business necessity
      • plaintiff proposes alternative
  • ex. dothard vs rawlingson: prison guards were selected based on weight/height rather than strength, so female applicants sued
    • legitimate target variable: strength
    • proxy variable: weight/height
    • supreme court ruled that best criterion is to assess strength, not use weight/height proxies
  • ex. redlining - people in certain neighborhoods do not get access to credit
    • legitimate target: ability to pay back a loan
    • proxy variable: zip code (disproportionately affected minorities)
  • 2 key questions
    • legitimate target variable - is unobservable target characteristic (e.g. strength) one that can justify hiring disparities?
      • disparate outcomes mus be justified by reference to a legitimate “business necessity” (e.g. for hiring, this would be a required job-related skill)
    • biased proxy - do proxy variables (e.g. weight/height) properly capture the legitimate target variable?
      • problematic “redundant encodings” - a proxy variable can be predictive of a legitimate target variable and membership in a protected group

input accountability test - captures these questions w/ basic statistics

  • intuition: exclude input variables which are potentially problematic
    • in this context, easier to define fairness without tradeoffs
    • even in unbiased approach, still need things like subsidies to address systemic issues
  • the test
    • look at correlations between proxy and legitimate target, proxy and different groups - proxy should not systematically penalize members of a protected group
    • regression form
      • predict legitimate target from proxy: $Height_i = \alpha \cdot Strength_i + \epsilon_i$
      • measure if residuals are correlated with protected groups: $\epsilon_i \perp gender$
      • if they are correlated, exclude the feature
  • difficulties
    • target is often unobservable / has measurement err
    • have to define a threshold for testing residual correaltions (maybe 0.05 p-vaues)
    • there might exist nonlinear interactions
  • major issues
    • even if features are independently okay, when you combine them in a model the outputs can be problematic
  • some propose balancing the outcomes
    • one common problem here is that balancing err rates can force different groups to be different
  • some propose using bst predictive model alone
    • some have argued that a test for fairness is that there is no other algorithm that is as accurate and have less of an adverse impact (skanderson and ritter)
  • HUD’s mere predictive test - only requires that prediction is good and that inputs are not subsitutes for a protected characteristic