Appunti - Web
Documents and URLs
A URL identifies univocally a document on the WWW (we omit the structure of URLs in this document). An HTTP request encodes a URL and the corresponding HTTP response contains the document identified by the URL.
The document identified by a URL may be either static or dynamic. A static document is one whose content is defined independently of the request, for example it may be a file in the storage of the web server. A dynamic document is one whose content is constructed dynamically by the web server while processing the request.
Given a URL that identifies a static document, multiple requests for that URL will always deliver the same document. Given a URL that identifies a dynamic document, multiple requests for that URL may deliver different documents.
The content of a dynamic document may depend on:
- the time instant;
- the content of the request;
- the content of previous requests sent by the same browser (the implementation of this functionality is based on the notion of session, discussed later);
- the username which sent the request (the implementation of this functionality is based on the notion of authentication, discussed later).
Regarding 2 and 3, any part of an HTTP request may be used for constructing a dynamic document: the URL; the query string of the URL; the value of request headers. When the URL identifies a dynamic document and contains a query string, each name-value pair of the query string usually represents a textual variable used as input by the program that constructs the response.
Given a URL, there is no standard way for determining whether the corresponding document is static or dynamic.
More in general, given a URL, there is no standard way for determining how the web server uses the local portion of the URL for identifying the document addressed by that URL. In this respect, a web server may be configured in a variety of ways and those configurations are independent of the HTTP protocol. Common possibilities are:
- The local portion of the URL identifies the pathname of a file to be returned as a static document;
- The local portion of the URL identifies the pathname of an executable file to be run for constructing the dynamic document to be returned;
The mapping between local portion of the URL and file pathname is server-specific. The choice between the two cases may depend on the URL itself.
The technical literature often uses the term resource instead of the term document. The term “document” tends to implicitly convey the idea of a piece of text, while the term “resource” is more general and may refer to information of any kind (e.g., photos). We will use these terms interchangeably (of course, we consider only documents that can be accessed through the WWW).
HTTP and TCP
An HTTP request-response pair always travels within the same TCP connection.
A browser can send a request only after receiving the response to the request that it has previously sent. A server sends exactly one response for each request and it never sends a response spontaneously. The content of this paragraph is an oversimplification and more complex behaviors are actually possible. In this course we will only consider this oversimplified description.
A browser that sends an HTTP request may observe that the connection breaks or is closed by the server before receiving the corresponding HTTP response. In this case the browser will not receive the requested document, unless it later sends a new request for the same URL (that request will have to be sent in a new TCP connection with the same server). Indeed, the browser cannot even tell whether the server received the request; this fact is a consequence of the TCP guarantees and is not analyzed in this course.
In practice, a website may be implemented by a very complex infrastructure. We do not enter into these details and assume that a website is implemented simply by a single web server. We also assume that the web server is associated with:
- Storage: a collection of files and/or databases.
- Programs: a set of programs.
Storage and programs are specific of each website.
We model each program as a procedure with an input parameter and an output parameter, the former representing the HTTP request and the latter representing the HTTP response. Each program constructs the response based on the request and on data extracted from the Storage. The programming language used by programmers for writing programs is irrelevant and independent on the HTTP protocol. The same consideration applies to the technology used by the web server for executing programs (which may or may not actually be implemented as a “procedure”).
A web server handles an HTTP request as follows:
- Determine requested URL.
- If the URL identifies a static document D, then
- obtain D from the Storage;
- construct an HTTP response containing D;
- send the HTTP response.
- determine the program P to be executed;
- invoke P with the HTTP request as input;
- send the HTTP response returned by P.
A web server may implement multiple websites. In this case the server must have multiple domain names. Upon receiving an HTTP request, the server determines the website based on the value of the Host request header. This ability is of fundamental importance in web hosting services.
A web page is the document visualized by a browser in a tab; the address bar of the browser contains the document URL.
What is visualized by a browser in a tab is typically the result of assembling many different resources fetched by the browser automatically. Each of those resources has its own URL, thus those resources may be located on a number of different servers. Those servers may be in administrative domains that have nothing to do with the server of the web page.
Specifically, the browser:
- fetches the document whose URL is in the address bar;
- analyzes the content of the document and visualize it appropriately; if the content is HTML then it may contain directives specifying URLs of resources that have to be fetched automatically;
- fetches the resources identified at step 2.
Step 3 may be executed recursively: the fetched resources could force the browser to fetch other resources.
Thus, a web page is perceived by users as a single document with a single URL but is actually constructed with a number of resources, each with its own URL.
The set of resources fetched by the browser for visualizing a web page can be determined in several ways, for example
- Developer tools of Chromium-derived browsers (i.e., Chrome, Opera, Edge); other browsers have similar features;
- urlquery.net (section HTTP Transactions);
In practice, most web pages are dynamic documents. Resources that compose a web page are, in most cases, static documents.
A website is a collection of web pages whose URL has the same domain name.
Thus, a website is perceived as being associated with a single server. Being a collection of web pages, however, the resources that compose the website may be distributed across different servers. Some of those servers are usually in administrative domains that have nothing to do with the server of the website.
The term website often refers to the hw/sw infrastructure that implements the collection of web pages rather than to the web pages. In its simplest form, such an infrastructure consists of a web server. The actual meaning of the term website can be inferred from the context.
A program with a graphical user interface (GUI) is logically composed of two modules: a module that implements the GUI and interacts with the user (responds to keyboard and mouse actions, displays information) and a module called backend that implements all the logic and computation. The GUI sends commands to the backend and displays results received from the backend.
Depending on the specific technology used, GUI and backend may be executed either on the same computer or on different computers. When they are executed on different computers, GUI and backend must implement the same protocol. This protocol may be program-specific or may be a set of program-specific conventional rules for using an existing protocol.
A web app is a program where the backend is executed on a web server and the GUI is a browser. Backend and GUI communicate over HTTP. GMail and Google Docs are examples of web apps.
In terms of the web technology, a web app is a website. All pages of the website are dynamic documents.
An organization may offer services for creating and managing websites. These services are called web hosting.
A web hosting service is implemented as a web app that requires authentication. Each authenticated user may create and manage one or more websites. These websites will be located on web servers of the organization that offers the web hosting service. We are only interested in the mechanisms for managing the names of those websites.
A web hosting service that supports custom domain allows defining websites in which the domain name is completely chosen by the user. In practice, the TLD cannot be arbitrary and must be one of those supported by the web hosting service.
A web hosting service that supports non-custom domain allows defining websites whose names have a predefined structure. The most common case is:
- Each website has a unique domain name.
- The user may choose the starting part of the domain name.
- The final part of the domain name is identical for all web sites and is chosen by the web hosting service.
Implementation of names
The implementation of non-custom domain does not require any action by the user and is completely performed by the web hosting service within its infrastructure.
The implementation of custom domain requires that the DNS contains a type CNAME RR mapping the website name (chosen by the user) to the name of the web server where the website is located. This RR must obviously be located in the zone of the website name.
The creation of this RR may be done in two ways
- by the user, if the user manages the corresponding zone;
- by the web hosting service, otherwise.
In case 2 the web hosting service will have to create a zone for the name chosen by the user and then create a type CNAME RR within that zone. The name servers for the zone will usually be name servers of the web hosting service, used for all custom domain websites of that service.
HTTP Session State
When a web server invokes a program for constructing a dynamic document, the is actually invoked with two input parameters: the HTTP request and the state of the session of the request.
A session is a sequence of request-response pairs between a browser and a web server. The web server creates sessions, deletes sessions, associates HTTP requests with sessions automatically. The lifetime of a session is chosen by the web server and may range from a few minutes to several months.
Each session has an identifier chosen by the web server and stored in the cookie (we omit the description of cookies in this document). The session identifier must be unique across all the sessions of the web server.
Each session has a state. The session state is defined by programs and stored in the web server. When the web server receives a request containing a cookie, the session identifier in the cookie allows the web server to retrieve the session state and pass that state to the program that will handle the request. If the received request does not contain any cookie, the web server creates a new session.
The session state is an initially empty set of textual variables. The session state is modified only by programs, not by the web server. Programs may insert variables, remove variables, assign and modify values to variables, query the value of variables. The mechanism with which a program operates on the session state depends on the programming language and is irrelevant. The session state is shared by all programs in the same website, i.e., a program P1 could store information later used by either the same program P1 or another program P2.
The session state allows a program to construct a dynamic document based on:
- the content of previous requests by the same browser;
- the username that sent the request.
In order to support 1, a program that processes a request will store in the session state any information that may be necessary for processing future requests by the same browser (i.e., requests in the same session). Such information will be stored in variables of the session state with conventional names.
In order to support 2, a program that receives a request containing a correct username-password pair will store in the session state a variable with a conventional name (e.g., user) whose value is the received username. Absence in the session state of a variable with that name implies that the session is not authenticated, i.e., a correct username-password pair has not been provided yet.
Note that content, meaning and usage of the session state are determined completely by programs. The programs that implement a website are written based on conventional rules for using the session state.
Windows, Tabs, Incognito mode
All windows of the same browser share all cookies. Different browsers do not share any cookie. Sessions are thus associated with a specific browser on a specific device.
The expiration time of a cookie is set by the website that sent the cookie. A cookie may last for a few minutes or for several weeks or even months. Cookies are thus stored on disk.
Many browsers implement a functionality called incognito. An incognito window starts without any cookie, is completely isolated from any other window, and all its cookies disappear when closing the window. In detail:
- An incognito window does not share any cookie with other windows (whether normal windows or incognito windows): it does not access cookies of other windows; it does not allow other windows to access its cookies.
- All tabs in an incognito window share cookies.
- All cookies of an incognito window are discarded upon closing that window.
Authenticated sessions may be created in an incognito window. This functionality may be useful, for example, for accessing a given website with multiple, different usernames: an authenticated session in an incognito window, another authenticated session with a possibly different username in another window.
Note that a website cannot distinguish whether a request was sent from a normal window or an incognito window. A website may keep track of:
- time instant, duration and IP address of each session;
- username of each authenticated session;
This information is available to the website even if the request was sent from an incognito window.
The term “incognito” is thus extremely misleading. It does not imply that web browsing occurs without leaving any footsteps on web servers (or on the internetwork). It only implies that no footsteps are left on the browser.
Authentication vs Authorization
In any application protocol, a server must be able to decide whether to execute the processing actions encoded in the received request or to refuse to do so: a server must not execute all requests indiscriminately. There are many ways for satisfying this general requirement. In all cases, it is necessary to solve two distinct problems.
- Determine the username that sent the request (authentication).
- Determine whether that username has the right to perform the requested action (authorization).
These two problems are different and are solved in different ways.
Access Control Lists
A protected resource is associated with a set of usernames called the access control list (ACL) of the resource. A protected resource may be accessed only in the context of authenticated sessions with a username in the ACL of the resource.
When a web server receives a request, it handles the request as follows:
- Determine requested URL.
- If the URL identifies a protected resource, then
- Obtain the username of the session;
- If the username is null then
- Return an error page for starting the authentication protocol;
- If the username is not in the ACL of the resource, then
- Return an error page specifying that the username is not authorized;
- Construct the response;
ACLs are the key data structure for solving the authorization problem. We do not enter into the details of how ACLs are defined and managed (of course, ACLs cannot be modified by any username).
In practice, an ACL allows specifying different sets of allowed actions for each username (e.g., read and write vs read-only). We assume a simplified scenario in which all the usernames in an ACL are allowed to execute all the actions.
ACLs are not specified separately for each resource. Resources are grouped in named sets (realms) with a single ACL for all the resources in the realm.
A realm is associated with the specification of the authentication protocol to use for binding a username to a session. Authentication protocols may be organized in a hierarchy based on the kind of information that must be provided by the user and the threat models assumed. A session authenticated with a certain protocol may require a further authentication step when attempting to access a realm that requires a stronger authentication protocol. In other words, an authentication executed for accessing a realm may not suffice for accessing other realms. In this course we will assume a simplified scenario: once a session has been authenticated according to the authentication protocol of a given realm, no further authentication steps will be required for accessing other realms.
Resources that are not explicitly inserted in a realm are considered to be part of a default realm whose resources can be accessed without any restriction, even in sessions that are not authenticated. Thus, each resource belongs to exactly one realm.
The organization associated with a website is often interested in keeping track of the accesses to the website made by browsers. Information of interest includes:
- Country of origin (obtained by geolocating the IP address of the client);
- Device type (value of the User-Agent request header); -Language (value of the Accept-Language request header);
Such information may be useful for the website as a whole and for each individual web page.
When analyzing accesses to a website, the ability to group accesses made by the same browser is of great interest. This ability is based on the usage of HTTP sessions. Based on this ability, a number of temporal analyses are possible. To mention just a few:
- sequence of web pages;
- time spent on each page;
- whether the same browser returns on the website after some time;
- number of different browsers;
Such analyses can be focused on certain web pages, time intervals, countries of origin and so on.
HTTP sessions for these analytics purposes must last for several weeks or more, to make sure that a returning browser is indeed detected as such: a browser that returns on the website after the corresponding session has expired will be considered as a new browser, i.e., one that has never visited the website before. The same consideration applies, in general, when a browser returns on the website without any cookie.
In principle, a website could collect analytics information autonomously. In this case the website should implement all the infrastructure for storing and analyzing that information. In practice, websites use analytics services provided by specialized organizations and implemented with web technology.
Let W be the website to be monitored and let A be the website of an analytics service. The key requirements are:
- Whenever a browser sends a request to W, the browser must also send a request to A;
- The request sent to A must enable A to determine the URL requested to W and to group requests from the same browser;
These requirements are implemented as follows.
- A selects an identifier for W, say w-id; this identifier must be unique across all websites monitored by A;
- W must include in all its web pages a snippet of this kind:
<IFRAME src=“A-URL?id=w-id”></IFRAME>Thus, when the browser fetches a web page p1 on W, the browser will also fetch the corresponding
IFRAMEfrom A. The HTTP request sent to A will specify the URL of p1 as the value of the Referrer field. The HTTP response sent by A will contain a cookie, which will enable A to recognize future requests from the same browser. This cookie is called a third-party cookie, because it is sent by a site different from the one that the user is visiting (W).
The arguments of the
IFRAME are usually structured so as to make the
IFRAME invisible. Many variations of 2 are possible, for example it could be an invisible or nearly-invisible image stored on A whose URL contains w-id.
The key advantage of this framework is that W does not need to store any analytics information and can analyze that information with specialized web apps already developed to this purpose by A.