The Big Brother I
This material is no longer part of the course.
I think it is quite important for understanding many issues in our everyday life. Thus, I have decided to leave it publicly available.
User tracking and user profiling
User tracking is the act of collecting the sequence of URLs visited by a user. User profiling is the act of constructing a model of a user (called a profile) based on information about that user. The information obtained by user tracking is one of the many ingredients used for user profiling. User profiling is widely used for making decisions about users. A typical decision is choosing which content to display or recommend to a user, given his/her profile.
User tracking and user profiling are a pervasive phenomenon on the web, with strong and deep effects on the society. We consider only the technical aspects of this activity.
User tracking is implemented in many ways. One of the most pervasive mechanisms is identical to the one of site analytics: a website that allows tracking its users must contain resources located on other websites and fetched automatically by browsers; returning browsers are detected by cookies (called third-party cookies).
There are many different interaction patterns between the actors involved in user tracking and user profiling. We consider a simple model that captures the essential aspects of those interactions, as follows.
- Tracking organizations perform user tracking.
-
Profiling organizations:
-Construct a model of each user (called a profile) based on information provided by tracking organizations. - Offer a service for displaying a given content to users that match a specified profile.
-
Advertising organizations use the service offered by profiling organizations, i.e., they specify a content and their desired profile of users that should see that content. We do not detail what a profile actually is and leave this notion defined intuitively.
Behavior and incentives of the actors involved are driven by money:
- advertising organizations pay profiling organizations (for displaying content to users matching the desired profile);
- profiling organizations pay tracking organizations (for obtaining information useful for profiling);
- tracking organizations pay websites (for allowing user tracking);
- profiling organizations pay websites (for displaying content).
The money that a website receives for allowing user tracking and displaying content selected by profiling organizations is often the only reason why a website exists. From a different point of view, users do not pay for “free” websites with money. Users pay these websites giving them the right to track their browsing activity.
In reality, an organization may play multiple roles of the model assumed here. Google, Facebook and Twitter, for example, are both tracking organizations and profiling organizations; the tracking information that they collect is not provided to other organizations. Amazon is a profiling organization and an advertising organization: it decides which item to recommend to a user (i.e., which content to display) based on the profile of that user. Tracking, profiling and advertising activities that occur in practice follow a number of different and complex patterns. Mapping all those patterns to the model assumed here may be difficult. The same consideration applies to the flow of money and incentives.
The Big Brother(s)
Nearly all websites include resources located on other websites that have the only aim of performing user tracking. Furthermore, a website usually includes resources from many tracking organizations. For example, the web magazine “Vice” includes resources from 85 tracking organizations whose name begins with “A” (the interested reader may count all the remaining ones autonomously). The equivalent figure for “Triesteprima” is 35.
A profiling organization may obtain data from many tracking organizations. A profiling organization may cooperate with other profiling organizations in order to augment the amount of information about a user thereby increasing the accuracy of the corresponding profile.
There is a myriad of tracking, profiling, advertising organizations. In practice, every action that a user performs on any website or Internet-connected device is tracked by many different organizations and is used for profiling that user.
The term “action” must be intended in its widest possible meaning, not only as the act of visiting a URL. Search queries, videos watched, songs listened, “likes”, comments on posts, purchases on ecommerce sites or on PayPal, vocal orders to smart loudspeakers, locations visited with a GPS-enabled smartphone, installation and usage of smartphone apps are just some of the many meanings of the term action. In this course we will consider only browsing histories, that is, visited URLs.
The General Data Protection Regulation (GDPR) in Europe dictates that each website must inform its users of the corresponding tracking organizations and of the usage made by those organizations of the corresponding information.
Tracking organizations: Browsing histories
Let T be a tracking organization. Let W be a website whose visits are to be tracked by T. The key requirements for implementing user tracking are identical to those for implementing analytics and can be implemented in the same way, that is, by including in all web pages of W a resource to be fetched from T automatically.
Functionalities of the form “share”/ “like” / “tweet” are usually implemented with resources that allow user tracking (e.g., an image on a web page of W that the browser is instructed to fetch from T).
The tracking resources of T cannot be present on all websites. Thus, any tracking organization may collect only an approximation (a subsequence) of the sequence of URLs visited by a user. The accuracy of the approximation depends on the set of websites tracked by T. This set should be as large as possible and should contain websites visited by as many users as possible.
Some tracking organizations are able to track visits to a very large portion of the WWW and are thus able to collect very good approximations of the sequences of URLs visited by users. Google and Facebook are examples of those organizations.
The information collected by a tracking organization can be modelled as a set of browsing histories. A browsing history is a sequence of URL coupled with a cookie identifier. This identifier is not necessarily associated with a username at some website.
Profiling organizations: Making decisions
Let P be a profiling organization. P receives an HTTP request and selects the specific content to include in the HTTP response based on the profile of the user that sent that request. The content may be displayed either on P or on other websites that host resources provided by P (that is, the HTTP request may be sent by a browser that is visiting either P or another website).
The content is a resource that may be located either on P or on the website of an advertising organization (in which case P will send a redirection response).
Choosing a content based on a user profile is a form of decision about the corresponding user. P constructs a user profile from all the information available about that user. The browsing history obtained from tracking organizations is just one of the many ingredients that may be used by P. The accuracy of a user profile depends, in general, on the quantity and nature of the available information. Intuitively, the more information is available, the more accurate the profile. We do not analyze how information can be actually exploited for constructing user profiles (machine learning is an important tool in this area). This is a complex and broad topic that is not part of this course. Similarly, we do not analyze how to choose a content based on a profile.
Let T be a tracking organization that provides browsing histories to P. A fundamental problem consists in linking the cookie identifier used by T to the cookie identifier used by P. We consider two cases:
- P and T are the same organization.
- P has a local copy of the browsing histories collected by T.
In case 1 the linking procedure is trivial because the cookie identifiers at P and at T are the same: the browser sends a request to P; if the request does not contain a cookie then T has no history available; otherwise, the cookie identifies the history at T to be used for taking the decision.
In case 2 the linking procedure may be realized as follows (this procedure is often called cookie syncing).
- The browser sends a request to P (this request contains cookie-p).
- P forces the browser to fetch a resource from T.
- The browser sends a request to T (this request contains cookie-t).
- T redirects the browser to a predefined URL in P (i.e. with a Location response heeder); T will include in the redirection URL a query string containing the value of cookie-t;
- The browser sends a request to P (because the browser has to follow the redirection); this request contains cookie-p as a request header and cookie-t in the query string of the requested URL.
As a result of the last request, P is able to match cookie-p with cookie-t (of course, if cookie-t has been just created then there will be no history available for that cookie).
Note that browsing histories need not to be associated with the real identity of the user to be useful. All that is needed is the ability of linking cookie-t (identifier of a browsing history) to cookie-p (identifier of the user about which a decision is made).
User profiling: more data
Some profiling organizations associate usernames with users and implement authenticated sessions. The username of a user may not have any relation to the real identity of the user. We denote those organizations as pseudonymous profiling organizations. Google, Apple, Facebook and Twitter are examples of those organizations.
Pseudonymous profiling organizations construct a profile of each username. The profile is constructed based on the available browsing histories of that username and on any other information that can be linked to that username. Some information is often provided by users voluntarily (for example, age, country, preferred language, email address, family composition, job, marital status and alike). Other information is collected by the profiling organization automatically.
Many pseudonymous profiling organizations collect information about usernames by offering “free” services and observing the usage of those services by each username. Google, Apple, Facebook, Twitter are all examples of those organizations. Users do not pay for thee “free” services with money. Users pay these services by giving them the right to observe their usage of those services and exploit the corresponding information.
Real identity of users
Note that user profiles need not to be associated with the real identity of the user to be useful. All that is needed is the ability of linking the user profile used for making a decision (choosing the content to show) to the user to which that decision will be applied (which user will see that content). Whether the user is identified with a cookie, a username, a pseudonym or a real identity is irrelevant.
Some organizations are potentially able to determine the real identity of their users, because of the quantity and nature of information they have available for each username. Google and Facebook are the key examples. Note that much of such information is provided by users voluntarily, perhaps without realizing it.
The fact that an organization may potentially link usernames to real identities does not imply that it actually does. In fact, an organization has usually no reason for doing so. The interest of the organization is collecting as much information about each user as possible, because this potentially augments the quality of its decisions. Real identities are not necessary and provide very little additional value, if any.
Even more data
A profiling organization may cooperate with other profiling organizations in the attempt of augmenting the information available for constructing user profiles. To this end, the cooperating organizations must be able to detect the respective sets of information that may be associated with the same user.
Many heuristics are used for this purpose. These heuristics are not perfect: they may fail to detect all information of the same user and may attribute pieces of information to the same user erroneously. Despite their imperfection, these merging heuristics are highly useful and widely used.
A simple, common and very powerful heuristic is based on the email address. Many websites require users to provide their email address. Users often provide their email address to many different organizations. Those organizations can match the respective information based on the email address (that is, org-1 has some information associated with email-address-x
; org-2 has other information associated with the same email-address-x
; should org-1 and org-2 cooperate, they would be able to merge the respective pieces of information associated with email-address-x
. Facebook is one of the profiling organizations that make use of this heuristic.
Another class of heuristic is based on advertising identifiers. These identifiers are meant to identify each user univocally across all organizations, in order to simplify matching information about the same users. The technical details by which advertising identifiers are exchanged across organizations depend on the specific scenario. Each Android smartphone has an advertising identifier. All Android apps associate each installation with the corresponding advertising identifier. Data collected by those apps, advertisements displayed by those apps, payments and purchases made by those apps are all linked by these identifiers. The same occurs in the iPhone ecosystem, with a different identifier.
User tracking: more data
Merging browser histories
A browsing history is associated with a cookie identifier. A tracking organization may have many different histories that are actually originated by the same user:
- A user may use multiple devices.
- A cookie may expire.
- A browser may discard cookies.
- Each incognito window generates its own history, different from the normal window and different from other incognito windows.
A tracking organization may attempt to detect histories originated by the same user in order to merge those histories together. The resulting history will provide much more information about the corresponding user.
Tracking organizations use many heuristics for attempting to detect histories of the same user. These heuristics are not perfect: they may fail to detect all histories of the same user and may attribute different histories to the same user erroneously. Despite their imperfection, these merging heuristics are highly useful and widely used.
The heuristics for merging browser histories are mostly based on:
- Detecting histories that originated from the same device. Histories from the same device are merged together.
- Detecting histories that can be attributed to the same username. Histories from the same username are merged together.
We describe some of these heuristics in the following sections. We do not enter into the details of whether these heuristics are legally permitted and only describe their technical feasibility.
Browser fingerprinting
Browser fingerprinting is a heuristic for detecting histories originated from the same browser.
A tracking organization T may collect several pieces of information about the browser that sent the HTTP Request (browser type, preferred language, timezone, operating system version, geographical region of the IP address, screen resolution, sets of installed fonts and so on). This information may be collected from the HTTP request header and by scripts sent by T for execution on the browser. This set of information can be synthesized in a compact form called fingerprint.
A fingerprint often identifies a browser uniquely, in the sense that:
- the fingerprint of a browser will change very rarely;
- any two browsers that contact T will have different fingerprints.
T associates each history with the fingerprint of the corresponding browser and merges all histories with the same fingerprint.
Consider histories that are not associated with any form of username (those histories are analyzed in the next section). Histories with the same fingerprint will be merged together even if the corresponding browser has been used by different users. This fact may harm the quality of the resulting history. However, a browser tends to be used by a single user most of the time, thus the overall quality of all the available histories is much improved.
Similar techniques can be used for device fingerprinting, which allows merging histories from all browsers on the same device. Device fingerprinting is technically simple to do on smartphones.
Unexpected effects
This heuristics is very powerful because it allows merging all histories from the same browser.
In particular, a user cannot prevent the merging by discarding cookies nor by using the incognito mode. Histories of an incognito tab can thus be coupled with histories of normal tabs and the user cannot prevent this fact.
Authenticated histories
Consider histories collected by a tracking organization that is also a pseudonymous profiling organization. Google, Facebook, Apple and Twitter are examples of those organizations.
Let uP denote one such organization. While a user is authenticated on uP with username U, uP will associate the corresponding history with U. uP will thus have two kinds of browsing histories:
- histories associated with a cookie identifier (anonymous histories);
- histories associated with a cookie identifier and a username (authenticated histories);
The organization will group all authenticated histories of the same username together.
Authenticated histories of the same username may be originated on different devices. Common scenarios in this respect are the following:
- PC and Android smartphone (Google)
- Mac and iPhone smartphone (Apple)
- PC and dedicated app installed on a smartphone (Facebook, Twitter, Spotify and so on).
In all these cases, uP collects a single authenticated history across multiple devices of the same user.
Unexpected Effects
When a user logs in on uP from a certain device, uP may be able to attribute to the corresponding username one or more anonymous histories of that device. In other words, uP may track even URLs visited while not logged in on uP and even in browsing sessions that have already terminated. The details are explained below.
Consider an anonymous history. If the user logs in on uP, then the history becomes authenticated and associated with a username. Note that:
- The history will include all the URLs, even those visited before authenticating.
- Even if the user logs out from uP, uP may keep the session open (the cookie may remain valid); in this case uP may continue to record the visited URLs even after logging out.
Consider a set SH of anonymous histories on the same browser (those histories could correspond to a set of navigations executed in incognito mode tabs already closed). Assume uP implements browser fingerprinting. In this case uP is able to merge all the histories in SH together. If the user logs in on uP at least once on that browser, then uP may associate all these histories with the username.
Other tracking techniques
Google may collect the full browsing history of Chrome users and associate those histories with the corresponding Google username. The above observations about anonymous histories obviously apply to this case.
Websites that contain links to other websites may construct browser histories by recording clicks on those links (the links will be implemented by a redirection from the website). Facebook, Twitter, websites for URL shortening, search engines may exploit this technique.
The above techniques do not require the presence of any resource on the tracked websites.
Influencing user behavior (A Big Problem)
Some users may be concerned that an organization can determine their real identity. Users should be much more concerned that an organization has a lot of information that can be used to make decisions about them and try to influence their behavior.
For example, a profiling organization could know that I am particularly interested in a certain illness and decide to show me positive but unfounded articles about that illness. Or, the organization could know that I am particularly interested in a certain potential problem and decide to show me articles claiming that the problem will never occur; or that the problem will certainly occur. Or, the organization could know that I have a specific political orientation and decide to show me only content that consolidate my orientation (for example, by making me angry at specific groups of people and alike).
The profiling organization could even perform those actions as a service, requested and paid for by other advertising organizations. Perhaps organizations that act before political elections and represent the interests of foreign countries.
Note that these examples are not hypothetical. Also note that these examples do not require the ability to identify users by their real identities. All that is needed is the ability of linking the user profile used for making a decision (choosing the content to show) to the user to which that decision will be applied (which user will see that content). Whether the user is identified with a cookie, a username, a pseudonym or a real identity is irrelevant.
The issues briefly outlined in this section have strong and deep implications for modern societies, not widely understood yet.