This article is adapted and expanded from a presentation I made as part of a data ethics panel at the AI, Ethics and Society Conference on May 8, 2019. The topic for the session was the ethical line between our private selves and public lives in the context of contemporary machine learning practices.
This is a discussion about data and boundaries between public and private, which is interesting in light of news last week from CBC about a Toronto lawyer whose electronics were seized at the Canadian border when he refused to provide his passwords. “Public and Private” intersect with “rights and responsibilities,” so I decided to trace a path from my interest in privacy education to some examples of the intersection between machine learning, privacy, and data disclosure. This led to some newer thoughts about how we currently frame privacy rights. The short version is this: that the dominant approach to privacy is a flawed one, and rhetoric about “empowering” people to take control of their privacy is problematic.
You may have guessed that I’m a privacy advocate. And before I go any further, it’s worth talking about what I mean when I talk about privacy. The word “privacy” is overloaded, and there is no universally-applicable definition that everyone agrees on: even my own thoughts on it vary as I consider other people’s perspectives. For the sake of this discussion, though, I want to frame “privacy” primarily according to the taxonomy defined by Alan Westin in his 1967 book, Privacy and Freedom.
Westin outlines privacy in terms of four “states:”
- solitude (the right to be “left alone”)
- intimacy (the ability to share things with those to whom we are close)
- reserve (the ability to ‘carve out’ aspects of identity by withholding information in certain contexts – for example, what you share at work, versus what you share on your anonymous Reddit account), and
All warrant in-depth discussion (and for a great overview in the Canadian context, I recommend chapter 2 of Sandra Schwab’s Master’s thesis on the topic), but the last one – anonymity — is probably the most salient. Anonymity, according to Westin, is freedom from identification and surveillance, even in public spaces. I might even go a bit beyond Westin to an approach advocated by Brandeis and Warren in 1890, where privacy is framed as a right to autonomy.
I’d like to suggest that our current ability to collect data and process it to target or predict future behaviour is where the bulk of friction between privacy and machine learning arises.
Why I care about this stuff
I love teaching privacy workshops. I am on the tail end of my combined Master’s degree in Digital Humanities and Library & Information Studies, and have particularly strong passion for the very un-sexy domain of information policy, including policy related to privacy and surveillance.
Of course, I am not at all unique: many, many people teach about privacy and online privacy tools. What librarianship provides me, though, is a belief that everyone—regardless of technical background or ability—should be able to take control of their online data sharing and vulnerability to tracking. My sense is that privacy literacy is viewed as something that requires technical literacy as a prerequisite: before you graduate to dealing with privacy issues, you have to know how computers and technology work.
I see many inaccessible versions of privacy workshops. For example, privacy hackathons were once popular. I don’t like the word “hackathon:” it implies that participation requires high-technology and coding skills, and I have no doubt that it privileges certain “kinds” of people—mostly male, mostly white.
The workshop we built
Few people will argue that privacy rights are not fundamental, but there is less awareness of the growing “privacy divide:” those who can afford higher-end devices, pay for privacy-centred services, or have access to educational resources are able to realize their privacy rights more than people who are disadvantaged due to age, gender, race, economic status and other factors. I think that everyone—regardless of those factors—is entitled to understand the myriad ways they are targeted, and must also have complete access to privacy.
Based on that foundation, then, a group of University of Alberta students partnered to create an openly-licensed resource guide and a three-hour interactive workshop aimed at providing two things:
- an overview of online privacy and security issues in a contemporary Canadian context
- enabling some control over data collection and online tracking through hands-on work with tools
The ultimate goal was simple: people could bring whatever device they had to the workshop, and they would leave with heightened awareness and at least one or two tools—installed and running—that they can use to monitor and control data collection and tracking.
In the past two years we have conducted workshops for graduate students and staff on the University of Alberta campus, conducted a more resource-focused session as part of an Edward Snowden appearance last year, and, in partnership with Edmonton Public Library last fall, presented a series of workshops at suburban library branches, where we focused on individuals that self-identified as “non-technical.” Attendees at those workshops were almost exclusively older, largely female, and majority non-white.
Reflections on the responses of people who attended those workshop are my launching point. From there, I’ll share some thoughts on privacy rights, challenges with data collection, and the implications for machine learning processes that use “public data” (those are very sarcastic quotes).
One: “Awareness” is a convenient scapegoat
The first of three main points I want to make is this: there is minimal awareness of the breadth and scope of data collection, but “increasing awareness” is not a total solution.
As part of the groundwork for the workshops we teach, we talk about the kinds of personal information that people have (or generate), the types of people and organizations that want that data, and the myriad ways it is collected. There is always a point in this discussion where I have to stop and check in with the workshop attendees, because they’ve started to look like rabbits caught in headlights. A 2017 study, from Pakistan, aligns its findings with mine: that the more aware users are about what data is being collected, the more uneasy they become about that collection.
The thing that causes anxiety spikes for our workshop participants is that the data collection practices they know the least about are the ones that are the most ubiquitous and the most difficult to monitor or circumvent.
People usually understand “direct collection,” when they are asked to provide personal information in return for a service. They know they can decline to disclose details or provide intentionally-inaccurate information: a throwaway email address, an incorrect birthdate, or an alternate name. People often know to ask if personally-identifying data will be shared or sold, and will avoid services that are brazen about those practices. Beyond these examples, though, awareness and comfort both break down.
A Pew Research study of Americans from 2015 says 93% of people think that controlling what other people know about them is important, but people are quite disturbed once they begin to understand how organizations share or sell data with one another—especially after claiming that it’s been “anonymized,” which does nothing to protect privacy—or how the evolving data market collects and combines datasets, sometimes with additional processing (such as personality profiling) to create increasingly-more-specific user profiles.
Once people are made aware of how ubiquitous online tracking and data collection are, the discomfort peaks. In the workshop, I will use a Firefox extension called Lightbeam to show how even “trusted” sites like CNN will reference more than 100 third-party websites when you visit them in an unprotected web browser. At this point, I often have the impulse to hand out tinfoil so workshop participants can start to make hats.
This information frightens people because it highlights how little they know about online observation and data collection, and how powerless they are to stop most of it. Mainstream media has warned us about getting hacked, about using strong passwords, and about not falling for phone or email scams. Thanks to media reports, people are also aware that data provided to companies like Facebook can be compromised and shared without consent. But people are still very largely unaware of how Facebook and Google collect data while surfing the web, and how even “trusted” websites like CNN are complicit and participating in this regime.
The Privacy Paradox
Increasing awareness causes alarm bells to ring, but it doesn’t really address the underlying problem. A recent study from the Netherlands, which targeted people of means who havr technical skills and an awareness of privacy issues, highlights another facet of the issue. The study was part of an examination of the “privacy paradox,” which is the observation that people who are aware of intrusive data collection and anti-privacy practices still do very little to monitor their own privacy rights. The study looked at app choices on mobile devices, and found that factors like price, ratings and design trump concerns about privacy across the board. Moreover, when people were given the choice between a free app that collected data and a paid version that did the same thing and guarded privacy, they generally preferred the free one. A paradox, indeed: why would someone who knows they need a tinfoil hat to go online still refuse to wear one?
A paper by Deuker (2010) asserts that the privacy paradox is made up of several factors, including the idea that we minimize privacy risks relative to other risks (for example, health or safety or financial risks), that instant gratification clouds our ability to see potential future threats, and that we usually get incomplete information about:
- what data is being disclosed
- the consequences of data disclosure
Deuker suggested that awareness-raising could counteract the privacy paradox, but the Netherlands study finds otherwise.
Two: consent is contextual and easily manipulated online
Deuker’s point about incomplete information leads to my second point: that “informed and meaningful consent” for data collection is slippery, because humans are terrible at predicting the future. We’re also bad at evaluating the present, it seems, since we know that almost nobody reads terms of service or privacy policies.
Two machine learning examples, drawn from relatively recent events, will illustrate what I mean about the impact of not being able to predict the future. Both examples are more in the realm of usage rights than privacy rights, but the connection will be clear.
Here’s the first example. Rebecca Forster is an author who was upset in 2016 when she learned that some of her 29 novels were being used by Google to help products like Google Assistant sound more conversational. The “Books Corpus” used to do this work was collected from “free books” available online, including published and unpublished material.
Forster knew her writing was available for free online, because she made it available for free. However, she did not predict that her “free book” could also be used by Google for product development. Authors like Forster self-publish on sites like Smashwords.com, and can set any price they want for their work. Smashwords defends creators with a strong stance on copyright enforcement and piracy, but when it comes to pricing they suggest that making at least some content free increases readership. The purpose of a site like Smashwords is to connect authors with readers, and most authors assume “readers” are people… but when it comes to data used as input to machine learning processes, a free book is a free book.
The second example is more recent. NBC reported in March 2019 that IBM was pulling photos of people from the popular website Flickr to train its facial recognition AI. What is ironic about this situation, in light of the current discussion about bias and oppression embedded in algorithms, is that IBM went to Flickr because they wanted a diverse set of photos they could use to reduce discrimination and bias in their machine learning systems. Their intent was noble, but artists whose photos were harvested from Flickr still expressed alarm that their work were being used for this purpose.
The NBC story noted a lack of consent in its headline, and within a few days other outlets posted rebuttals. They asserted that IBM had committed no copyright violations, and the assertion was correct: IBM was 100% respectful of the open Creative Commons licenses that photographers had assigned to their Flickr uploads.
Like Forster’s “free book” example, artists who posted images of faces to Flickr had a limited view of what their open licence meant for downstream uses of their content, and were dismayed to discover that their work was used for something they hadn’t considered.
Ryan Merkley, the CEO of Creative Commons shared his thoughts on the issue:
“CC licenses were designed to address a specific constraint, which they do very well: unlocking restrictive copyright. But copyright is not a good tool to protect individual privacy, to address research ethics in AI development, or to regulate the use of surveillance tools employed online. Those issues rightly belong in the public policy space”Ryan Merkley, March 13 2019
The bottom line is this: “informed” consent to data collection and data use is almost impossible to achieve. It’s irresponsible to assume that providing information means everyone receives that information, and it’s irresponsible to assume that new uses of information can rely on the same consent obtained for old uses.
Three: data is not property
My third and final point is where I most strongly believe we need to reset our thinking. There is a fundamental problem with thinking of data as property, and no privacy protection regime based on that principle can be successful.
To contextualize this idea, I’ll highlight some discourse about privacy rights from roughly twenty years ago. One fascinating and very readable report from 1999 is by Ann Cavoukian, who served as Ontario’s privacy commissioner from 1997 through 2014. Much of the report focused on what she called the “market-based”—or economic— approach to privacy, where the private sector was allowed to self-regulate in determining what would and would not constitute privacy violations. Even in 1999, when online services were just beginning to saturate the web, Dr Cavoukian was already expressing concern about how market forces and secondary markets for data might impact private-sector data collection and personal privacy. And as we know from her recent resignation from the Sidewalk Labs project in Toronto, she still has those concerns.
Cavoukian’s report reminded me of a principle that I, as a formally-educated librarian, already understand and appreciate. This principle, referred to as “information commodification,” is that the public sphere is under constant assault from corporate forces that seek to treat information as a rival good: a commodity, like a physical object, that can be used as an exchange of value. Issues of copyright, intellectual property and digital copy protection are front-of-mind for information professionals, but it strikes me that privacy is often overlooked as part of the same overall trend. And, just as our culture has been conditioned to think of digital data as property, we have become conditioned to thinking of privacy-related data as something that someone owns. This harms our ability to have a proper discussion about the problem.
“My data” and “my privacy”
I’m guilty of this myself, and have been recently begun acknowledging it when it appears in conversations about privacy. We frame “my privacy” as something used to protect “my data:” my personal information, my location, my address, my interests, my photos, my books, my intellectual property. The problem with this model is that it breaks quickly when we apply it to the full range of privacy issues associated with data collection and data use.
For example: my name may not be unique but it is certainly “mine”, and so it’s easy to think of my name as “my data.” But is “my location” really “mine”? Do I own the fact that Google knows where I am as long as I have an Android phone in my pocket? Few would argue that my location within my own home should be private but what about when I’m outside, where current Canadian jurisprudence suggests I have no “reasonable expectation of privacy”? [A side note, because this still surprises many: court interpretations of the Canadian Charter of Rights and Freedoms have consistently reinforced the notion of a “reasonable expectation of privacy,” and privacy is therefore a quasi-constitutional right, but aside from references to physical security and unreasonable search and seizure, there is no explicit right to privacy in our Charter]
The location example is a little bit fuzzier when we apply the lens of data ownership to privacy, but we might still be able to reconcile it by extending the property metaphor to include the “capture and storage” of location information. In other words, the information is public until it is recorded and used in a way that identifies me; this is one of the ideas explored by the Supreme Court of Canada in the recent R v Jarvis decision. A recent New York Times opinion piece also covers the topic in-depth, referring to the problem as a “loss of obscurity.”
Where it breaks
The breakdown of the “property” approach to privacy becomes more clear, though, when we abstract just a bit further. If Google takes data “about me” from multiple sources, aggregates and anonymizes it, and then processes that data to make an accurate prediction of my activity, associations, interests, and preferences…. is that “my data”?
Shoshana Zuboff—who asks a very salient question when she muses about why Google, a “search company,” is now making home thermostats— discusses the work of Alex Pentland in Chapter 15 of her terrifying and amazing book, The Age of Surveillance Capitalism. She draws attention to Pentland’s “God’s Eye View” and “social physics” concepts, which relate to how information observed from a distance (such as location observations) can be used to develop detailed predictions for individuals. Pentland is seen as the “godfather of wearable computing” and one of his labs’ “reality mining” experiments, based on a study of only 100 participants wearing mobile tracking devices, resulted in the ability to make 90%-accurate predictions of where people were likely to be at any time, and what they were likely to be doing.
Here I think it’s clear that the property model of privacy breaks down, since data is collected en masse, theoretically-anonymized and aggregated to secure “my data.” And yet: Google’s ability to predict my preferences and future behaviour based on its unseen but internally-shared observations of my activity on Google Maps, YouTube, Gmail, Calendar, Nest, and a growing list of other products creates an almost universal reaction of “deeply creepy.” Despite the fact that “my data” is out of the picture, Google’s ability to predict my future behaviour can be seen as a violation of privacy since it impacts my solitude, my anonymity, and my autonomy.
A return to privacy as a human right
As a reminder, Cavoukian’s report was written in 1999; Westin’s famous book was published in 1967; and the original UN Universal Declaration of Human Rights entrenched its right to privacy in Article 12 in 1948. All of these examples suggest to me that an earlier view of privacy as a human right has now been subsumed by its perception as an economic right.
To address and resolve issues of data collection and surveillance as they relate to privacy, I believe a human rights-based approach is more sensible.
Interestingly, the United Nations released their Human Rights Based Approach to Data (HRBAD) last year, as part of their Sustainable Development Goals Framework. Among its six main principles, it advocates for three that seem most relevant to the ethics of data use for machine learning:
- public participation in all aspects of data collection, including analysis and dissemination
- self-identification of parameters for data collection (a society’s view of demographics may shift, and the public should be able to mandate that a new parameter, such as gender identity, may be of interest to the public good)
- transparency of data at every stage of its use
The six principles also include confidentiality related to information that can be used to identify individuals. Note the careful use of language, which bypasses data ownership to encompass situations where anonymous data is de-anonymized by cross-referencing it with other information.
Conclusion: suggestions for data ethics in relation to privacy
This article has taken us from small-scale, local privacy tools workshops all the way to United Nations Sustainable Development Goals, which is ambitious even for me. What summary can we take from the journey, though, as it relates to ethical data use and machine learning?
I offer the following. Given that:
- The idea that people are responsible for their own privacy is highly fallible, based on their inability to make rational choices about data collection, non-use of tools for protection, and their inability to predict how data will be used by non-human systems in the future
- The “property model” of privacy is a poor frame for understanding the collection and use of data derived from human behaviour and experience.
….accountability lies with the machine learning community for taking stock of privacy issues, rather than downloading the problem to the community through unilateral “use policies” or implied consent decisions. The UN’s Human Rights Based Approach to Data is designed for governments and public institutions, but its scope can be generalized and applied to data use in the private sector as well.
My proposal is as follows:
- Where data is collected, it should be done with the full knowledge and participation of those under observation.
- In cases where existing data is being used for machine learning processes, we must ask questions about the source of the data, the nature of the consent used to obtain it, and the ability for that data to affect our rights to withhold information, to be left alone, and to be free from identification and surveillance.