How a web page request makes it down to the metal

How a web page request makes it down to the metal

The other day I was interested in how many steps occur between sending a POST or GET request from a website to the actual processing that happens on the CPU of the server. I figured that I knew bits and pieces of the puzzle but I wanted to see the complete path from the highest levels of abstraction all the way to the lowest without missing anything too big in-between. It turns out that in a modern web system there are a lot of steps. I have been really fascinated by this much like the explorer that wants to find a path from one known place to another. If you are interested in better understanding how your computer works you might find walking along this path with your tech stack helpful.

One thing that’s always been very obvious to systems programmers is the amount of abstraction that exists between applications written in high level languages and the processing that happens on the metal. We have these fantastic frameworks and abstractions that save us from details we are not directly concerned with. The fact that these frameworks have good API’s for their abstractions is a large factor in why they make us so productive. However if like me you have ever wondered what the abstractions are doing behind the scenes there’s always this sense of wonder and a burning desire to know how it all really works. This post is an attempt at going from the highest level of abstraction of seeing a website with our own eyes all the way down to the lowest level, the electrons. This whole process is long so if areas don’t interest you feel free to skip over sections. If you do skip take a moment to enjoy the wonders of the abstractions that modern software offers, you could still be productive without having to know all the details you just skipped.

To make this tangible this article uses real code! This can be found JaggedVerge’s GitHub page, you may wish to look at the source code for this article if you would like to interact with the code directly.

Overview:

Here are all the steps from the web browser down to the electrons that are covered in this article. This index is arranged in the way that the abstractions layer in the real code:

The overall “product” – the highest level of abstraction

Specification: The user is supplied with a webpage where they can enter in 2 numbers and find the result of adding those numbers. The addition itself must be performed on the server and the results will be presented to the user via a results page.

To do this we have a web server running the Django web framework and the simplest code possible that lets this all happen (hopefully without hurting the educational value of this article). For the purposes of keeping this article manageable we are aiming for a minimal amount of conceptual complexity in order to focus on the nature of the layers of abstraction, in particular giving some insight into less commonly thought about layers. We are avoiding some abstractions that we would use in real projects as these would increase the number of layers in this article and increase the conceptual loading on the reader. (The reason we use these abstractions in our projects is because they reduce overall complexity of implementation and make us more productive). So infrastructure as a service, containers, microservices or any other potentially useful technique that introduces complexities will not be covered in this. The code for this article is a minimal set that can provide a window into the nature of the deep stack of abstractions we encounter with modern web apps.

Prelude – GET request

Before anything else we have to actually retrieve the page we want to view in our web browser. We navigate to the URL we want and then we submit a GET request to send to the server requesting the page that is at that URL.

Because we are being good with our web standards the GET request only starts the process of retrieving the data in the page, nothing that changes the state is carried out at this point.

We send out a request such as this:

Host: 127.0.0.1:8000
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:55.0) Gecko/20100101 Firefox/55.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Cookie: csrftoken=RAV15luEfhxVAqHRjT1FAUtKzPEizBjOTuPElo3T76fnx1v5tLOWL4sy8J1RTKOj
Connection: keep-alive
Upgrade-Insecure-Requests: 1

To which we get back a response with status code 200 (indicating everything is OK) with the following header:

Date: Mon, 11 Sep 2017 04:21:03 GMT
Server: WSGIServer/0.2 CPython/3.5.2
Vary: Cookie
Content-Type: text/html; charset=utf-8
X-Frame-Options: SAMEORIGIN
Content-Length: 349
Set-Cookie: csrftoken=RAV15luEfhxVAqHRjT1FAUtKzPEizBjOTuPElo3T76fnx1v5tLOWL4sy8J1RTKOj; expires=Mon, 10-Sep-2018 04:21:03 GMT; Max-Age=31449600; Path=/

This response has the payload of the HTML that gets rendered in the browser.

Both these raw headers can be found via the browsers web-developer tools. The payload can be found in the usual show-source command:

<html>
<body>
<h1> Add 2 numbers! </h1> 
<form method="post">
    <input type='hidden' name='csrfmiddlewaretoken' value='8EFi3KDGaWVcLAxwcFTQTZ3LSuTXI3HVayzVjNcV2LDEIblKmxG7492zrogw2ccq' />
    <tr><th><label for="id_number1">number1:</label></th><td><input type="number" name="number1" required id="id_number1" /></td></tr>
<tr><th><label for="id_number2">number2:</label></th><td><input type="number" name="number2" required id="id_number2" /></td></tr>
    <input type="submit" value="Submit" />
</form>
</body>
</html>

We won’t travel through all the layers on this first web request, instead waiting to the POST request to travel down the layers. There’s already a few hints at the things to come, such as the form we see on the page and the mention of WSGI and CPython in the headers.

Frontend – processing on the users machine

At this stage we are dealing with the processing of the information on the users machine. When we talk about frontend we are referring to the category of things that occurs on the users machine. Generally speaking this is just the things that occur within the users web browser.

Page rendering

So now that the page is actually loaded it has to get rendered from markup to a visual representation on our screen where we can enter in some inputs.

Even in the case where our code has no JavaScript we need to invoke the layout engine of the browser in order to get a rendered format of the page on our screen. With the complexities of the modern web there is a huge amount of detail within this step.

Unfortunately I’m not familiar with the internals of these, contributions to this article at this step are greatly appreciated!

Once the button to submit the form is clicked an event will fire that will eventually lead to a POST request being sent to the URL specified in the action.

POST request

The data in the form gets placed into a POST containing the relevant data to send to the server.

This now looks something like https://example.com/jv_example?number1=1&number2=1

The web browser then hands this off to the network to be sent to the domain name. At this point we are now entering into the network stack.

Sidenote: What is the CSRF token?

You may have noticed that in this form there’s a CSRF token that is included with the Django form. This is to prevent a type of security vulnerability called Cross-site request forgery (CSRF) that exploits the trust that a site has in a user’s browser.

See: https://en.wikipedia.org/wiki/Cross-site_request_forgery

Network stack – From frontend to backend

The request has to make it from a client computer over to the server (even if we are running the code locally we will still use the networking stack although a bunch of layers are bypassed when we use localhost. Historically this has been because localhost sockets have needed to be fast and special support was made specifically for this case and by using localhost there is a bypass of the local network interface hardware.) As evidenced by the fact that people have important careers in network admin/engineering there’s quite a lot that goes on to allow network communications between computers. As this post is mostly about programming we are going to mostly gloss over this. The OSI model is a really good conceptual summary of how the networking stack really works.

Intermezzo – the internet

One of the amazing things about the Internet is the way in which you can resolve domain names to IP addresses. This takes place via the Domain Name system (DNS). When you have those IP addresses you then need to get the packets from your address to the other address and the meta level of routing comes into the picture. To create connections spanning multiple networks and routers requires use of routing protocols, this is the way in which the routers get to understand the topology of the network and make the decision on where to send the packets to next. This is a broad topic, as evidenced by the people who’s main profession is dealing with these types of things.

If you are interested in understanding how routing works have a look at some of these pages:

  • https://en.wikipedia.org/wiki/Distance-vector_routing_protocols
  • https://en.wikipedia.org/wiki/Link-state_protocol

TCP/IP

Assuming this is connecting over the internet we have to actually send the request over to the server. In this example we are connecting to localhost so a lot of the networking stack is bypassed. We don’t need to resolve any domain names so we can gloss over DNS. And we can gloss over routing because 127.0.0.1 (or ::1 in ipv6) is a known special address which is always going to resolve to localhost.

Then we have to open a TCP/IP connection to the IP address server. Again because we are using localhost we won’t be needing to do NAT or anything like that, we assume that a direct connection can be made. Also we won’t assume anything complicated is going on in the connection, no SSL, no VPNs or similar.

Once we open the socket we can start sending packets over. There’s the SYN/ACK/SYN-ACK handshake and then the sockets are open. You can see all this in something like Wireshark. At this point the transmission for the data is encoded into a format that can be sent over the wire. The data is then sent over the wire. Because it’s TCP/IP from an application level there’s a lot of things we don’t need to concern ourselves with, data will be re-transmitted/rerouted/reordered without us needing to know or care (unless something breaks).

Intermezzo – missing steps from the network

We deliberately glossed over a few steps before by using localhost to keep things simpler to follow. If we were using a real interface a few things would happen in between the electromagnetic signal coming across the network into the network adapter and it being available to applications on the server. These include but are not limited to:

  • Binary data is encoded and sent across the wire then decoded at the receiving end.
  • The NIC reads incoming packets and places contents into the incoming hardware buffer.
  • Network card driver:
    • Validates incoming packet
    • Copy packet from hardware buffer to operating system
  • Driver sends interrupt to operating system kernel when packet is complete by wrapping it in the appropriate data structure for the operating system, in the case of Linux this is sk_buff and other operating systems have their own data structures.
  • Operating system services interrupt
    • The Ethernet layer checks if the packet is valid, then removes the Ethernet header
    • The IP layer also checks if the packet is valid by checking the checksum, if it is to be handled locally it will remove the IP header.
    • The TCP layer handles all the aspects required to get a a reliable stream of data across including handling re-ordering and re-transmission of packets.

If you are interested in more details this article has a more comprehensive treatment of the networking stack.

The individual parts of the stack are quite complicated, for example TCP as defined by RFC 793 provides highly reliable communications over packet switched networks but comes with significant complexity. This document has for more information on how TCP operates: http://www.medianet.kent.edu/techreports/TR2005-07-22-tcp-EFSM.pdf

After all these steps are done the data is passed over to the application layer. This is what is commonly referred to as the “backend”.

Backend – Processing on the server

The browser has sent the information from the clients browser over the network and made it to the web server. The portion of dealing with those requests is often categorized as the “back end” and we will run with that terminology to group the next few steps.

Handling the web request

Once the server has received the packets those get passed up the networking stack, and the packets with the correct TCP port will then get routed over to the web server. At a very basic level we can write code to handle the accepting of an incoming connection like this:

serversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
serversocket.bind((socket.gethostname(), 80))
serversocket.listen(5)

Then we can create a main loop to handle accepting incoming connections:

while True:
    (clientsocket, address) = serversocket.accept()
    # now do something with the clientsocket

However doing this with CPython wouldn’t scale all that well (we had to re-engineer a system that connected to IoT devices via TCP/IP that handled incoming connections in a naive way because it had problems scaling when they got more customers). For hosting websites we would do well to use specialized software that handles HTTP requests and handles them well, implementing your own code to efficiently handle many connections concurrently is a difficult task.

In practice production HTTP servers are often running something like Apache or nGinx. In the case of NGINX have a look at this article that outlines how the internal architecture works: https://www.nginx.com/blog/inside-nginx-how-we-designed-for-performance-scale/

A quick overview of NGINX is:

  • NGINX creates worker processes
  • These worker processes listen for incoming connections
  • When a connection is accepted the connection is assigned to a state machine that handles the flow of data for that connection
  • When the state machine requires content to be generated it will forward to our Python code via uWSGI

Getting the web request into Python – WSGI

Given that in production we will be using a dedicated HTTP server that’s not implemented in Python we need to get the request over to Python at some point.

The WSGI PEP explains the interface for connecting the web server to the Python world which in the case of Python backend software is often sitting behind a WSGI implementation such as uWSGI. It’s the bridge between the server software and the Python code which processes the request and then forwards it on to the Django URL router to get dealt with.

Django URL routing:

The web framework must create the needed content by first dealing with the incoming URL to decide what content needs to be generated and how errors are to be handled. The first step is parsing the incoming URL paths and then dispatching to the relevant functions that must be called for that URL.

The Django URL dispatcher takes the incoming path and then routes it to the appropriate view function that will handle that incoming request. In Django’s settings.py we have the following:

ROOT_URLCONF = 'ToTheMetal.urls'

This specifies the Python module where Django has the entry point for URL dispatching. This is the first place that will be searched for a variable called  urlpatterns  which should be a list containing instances of django.conf.urls.url() types.

The base urls file (ToTheMetal/ToTheMetal/urls.py) is fairly simple:

from django.conf.urls import url, include
from django.contrib import admin

urlpatterns = [
    url(r'^admin/', admin.site.urls),
    url(r'^$', include('add_two_numbers.urls'))
]

You’ll notice that there are url “tuples” of regular expressions that define URL paths and functions. When a regex gets a match Django will call the associated function. Django’s URL dispatcher works by linearly searching this list until the first regex match occurs at which point the associated URL handler function gets called with any of the results in regex capture groups getting forwarded the function as arguments. If we match '^admin/ it takes us to the admin page and otherwise if we match ^$ (which matches an empty string, the base URL) it will then take us to the URLs defined in the python module add_two_numbers.urls. Because we are using Python module paths match filesystem paths so that module is found at ToTheMetal/add_two_numbers/urls.py :

# URL routing for the add_two_numbers app

from django.conf.urls import url

from . import views

urlpatterns = [
    url(r'^$', views.add_two_numbers_page, name='add-two-numbers-page'),
]

Now if we match the regex ^$ again we call the function in the second argument which is views.add_two_numbers_page. This is our view function.

If no match can be made by the end of the urlpatterns Django will call the 404 error handler and if no custom handler is defined it will use the default one to respond with a 404 status code.

Django view

The Django view is the code that deals with the incoming request and generates the HTTP response.

In this case we have a function add_two_numbers_page in ToTheMetal/add_two_numbers/views.py

from django.template import loader, RequestContext
from django.http import HttpResponse

from .forms import AddTwoNumbersForm

def add_two_numbers(first, second):
    """Add two numbers together
    :first: first number
    :second: second number
    """
    result = first + second
    return result

def add_two_numbers_page(request):
    """Add two numbers"""
    if request.method == "GET":
        form = AddTwoNumbersForm()
        template = loader.get_template("add_two_numbers/add_two_numbers.html")
        return HttpResponse(template.render({'form': form}, request))
    elif request.method == "POST":
        form = AddTwoNumbersForm(request.POST)
        if form.is_valid():
            result = add_two_numbers(
                form.cleaned_data["number1"],
                form.cleaned_data["number2"]
            )
            return HttpResponse("Answer is {}".format(result))

This takes the incoming request and deals with what’s needed to handle HTTP responses and render the page.

When a user submits a POST request we go down the “POST” branch of the conditional and then need to check that the form input is valid. It turns out the properly handling forms is a bit complex so instead of writing a lot of code for processing forms each time we have a new project we can use the well tested forms functionality found in Django. So our form code looks like this (ToTheMetal/add_two_numbers/forms.py):

from django import forms

class AddTwoNumbersForm(forms.Form):
    """A form for adding two numbers together"""
    number1 = forms.IntegerField(label="number1")
    number2 = forms.IntegerField(label="number2")

This generated the HTML for the form earlier and now will generate all the form validation code for when we call .is_valid() as well. This saves us a lot of time (for example it implemented the CSRF protection for us) but introduces another layer of abstraction.

After this we again use the Django forms functionality to clean the data before calling the add_two_numbers function.

In the next steps we will see how the add_two_numbers function creates the answer.

Language implementation/execution

From this point on we have left the realm of “web frameworks” and are now looking at the Python language implementation details that underpin many different paradigms and programs.

Python code execution

Python is a language specification which doesn’t specify exact implementation details. So depending on the implementation of Python being used the execution of the code can vary. For the sake of this example we are using the CPython implementation because it’s the reference implementation but also because the way in which CPython executes the code it’s easier to trace exactly what’s happening. (Figuring out exactly what is happening in an implementation such as PyPy would be substantially more difficult because it would require a discussion about the tracing JIT approach PyPy uses. Specifically because of the JIT strategy that PyPy uses exact execution details can vary from invocation to invocation of the same function.)

As the name hints,The CPython interpreter is written in C and since the CPython repo is now on GitHub you can have a look at the code yourself! From here on we will deal with some of internals of CPython.

Adding the numbers: Python bytecode

Whenever you are using CPython you might have been away of the presence of those *.pyc and *.pyo files. These are the files that contain the Python bytecode of your programs.

One thing every production-ready Python implementation does is a pre-compilation step where the source code is first parsed from text into some intermediate representation that can run faster. In the case of CPython the source is first parsed and then compiled into a cross platform bytecode so that no slow text parsing has to occur at run time. (It is at this step where syntax errors are raised and is the reason why you cannot catch that exception from within your code because this step runs before your code is executed).

When we are adding the 2 numbers together this is accomplished via interpreted Python bytecode. This is implementation dependent and in the case of CPython we can see how in the CPython source code. CPython doesn’t take many liberties with how it executes the bytecode, the design is based on being correct and reliable, not fast. This is a fairly simple stack based approach which, conveniently, makes it really easy to follow how the execution of the bytecode happens. (If you want faster execution check out PyPy or ways to incorporate compiled code)

We can use the dis module to see the bytecode from the function:

def add_two_numbers(first, second):
    """Add two numbers together
    :first: first number
    :second: second number
    """
    result = first + second
    return result

import dis
print(dis.dis(add_two_numbers))

Bytecode:

11           0 LOAD_FAST                0 (first)
             3 LOAD_FAST                1 (second)
             6 BINARY_ADD
             7 STORE_FAST               2 (result)

12          10 LOAD_FAST                2 (result)
            13 RETURN_VALUE

So we see on line 11 we load the 2 variables with LOAD_FAST then we perform a BINARY_ADD then store those into the right hand side of the expression on line 11 with STORE_FAST. Then on line 12 we fetch the result from the addition with LOAD_FAST and RETURN_VALUE to return that from the function.

Execution of the CPython code:

When the CPython interpreter is run it generates the machine code instructions that carry out the addition. This is done via compiled code that constitutes the core of the Python implementation. In practice this step goes from the C source to machine code via some internal state of the compiler. In the case of using GCC the source code is first compiled into GNU Intermediate Language then output with the appropriate backend into the needed machine instructions. In this case we are on x86 so the backend used by GCC is the x86 one and the CPython interpreter has compiled such that we are running x86 machine code.

Specifically on the first run through the code the CPython interpreter compiles the source code down to bytecode. This bytecode is then executed via the interpreter at runtime. Understanding how the bytecode is interpreted is therefore the next step in our journey.

Going into the CPython 3.7 source we find the execution of bytecode occurs in ceval.c within which there’s a switch statement that starts with:

switch (opcode) {

The opcode in this case is the opcode for the Python bytecode currently being encountered. In our case we are dealing with the bytecode to add two entities on the top of the bytecode stack. After falling though the many different preceding cases we get to the case handled by BINARY_ADD (note that TARGET is a macro containing the case statement) which we enter:

TARGET(BINARY_ADD) {
    PyObject *right = POP();
    PyObject *left = TOP();
    PyObject *sum;
    /* NOTE(haypo): Please don't try to micro-optimize int+int on
       CPython using bytecode, it is simply worthless.
       See http://bugs.python.org/issue21955 and
       http://bugs.python.org/issue10044 for the discussion. In short,
       no patch shown any impact on a realistic benchmark, only a minor
       speedup on microbenchmarks. */
    if (PyUnicode_CheckExact(left) &&
             PyUnicode_CheckExact(right)) {
        sum = unicode_concatenate(left, right, f, next_instr);
        /* unicode_concatenate consumed the ref to left */
    }
    else {
        sum = PyNumber_Add(left, right);
        Py_DECREF(left);
    }
    Py_DECREF(right);
    SET_TOP(sum);
    if (sum == NULL)
        goto error;
    DISPATCH();
}

First we retrieve the different numbers off the stack, note that these are PyObject‘s and not integers so some more work must be done before being able to evaluate. We first follow the check to see if they types are strings since the plus operator is overloaded. Because we are not dealing with unicode strings we follow the else branch:

sum = PyNumber_Add(left, right);
Py_DECREF(left);

This calls PyNumberAdd to perform the actual addition. Once PyNumberAdd is finished we decrease the reference count of the left and right with:

    Py_DECREF(left);
}
Py_DECREF(right);

Before setting the top of the bytecode execution stack with SET_TOP which is a macro defined by:

#define SET_TOP(v) (stack_pointer[-1] = (v))

At this point the sum of the numbers is on the top of the stack ready to be used.

How we get from Python objects to adding the underlying integers (PyNumber_Add and down)

Warning rabbit-hole alert!!! skip past this if you aren’t interested in gory Python implementation details.

In abstract.c we see the definition of PyNumber_Add:

PyObject *
PyNumber_Add(PyObject *v, PyObject *w)
{
    PyObject *result = binary_op1(v, w, NB_SLOT(nb_add));
    if (result == Py_NotImplemented) {
        PySequenceMethods *m = v->ob_type->tp_as_sequence;
        Py_DECREF(result);
        if (m && m->sq_concat) {
            return (*m->sq_concat)(v, w);
        }
        result = binop_type_error(v, w, "+");
    }
    return result;
}

So in this case we called this with 2 parameters of python type int. (Note that this is Python’s type system, not C’s, so a Python int is stored in a C PyObject)

We have to then apply binary_op1 (which is also found in abstract.c) to evaluate:

/*
  Calling scheme used for binary operations:
  Order operations are tried until either a valid result or error:
    w.op(v,w)[*], v.op(v,w), w.op(v,w)
  [*] only when v->ob_type != w->ob_type && w->ob_type is a subclass of
      v->ob_type
 */
static PyObject *
binary_op1(PyObject *v, PyObject *w, const int op_slot)
{
    PyObject *x;
    binaryfunc slotv = NULL;
    binaryfunc slotw = NULL;

    if (v->ob_type->tp_as_number != NULL)
        slotv = NB_BINOP(v->ob_type->tp_as_number, op_slot);
    if (w->ob_type != v->ob_type &&
        w->ob_type->tp_as_number != NULL) {
        slotw = NB_BINOP(w->ob_type->tp_as_number, op_slot);
        if (slotw == slotv)
            slotw = NULL;
    }
    if (slotv) {
        if (slotw && PyType_IsSubtype(w->ob_type, v->ob_type)) {
            x = slotw(v, w);
            if (x != Py_NotImplemented)
                return x;
            Py_DECREF(x); /* can't do it */
            slotw = NULL;
        }
        x = slotv(v, w);
        if (x != Py_NotImplemented)
            return x;
        Py_DECREF(x); /* can't do it */
    }
    if (slotw) {
        x = slotw(v, w);
        if (x != Py_NotImplemented)
            return x;
        Py_DECREF(x); /* can't do it */
    }
    Py_RETURN_NOTIMPLEMENTED;
}

At first glance I thought this was over-complicated. But I then remembered that Python has the  NotImplemented language construct which allows an expression with the left hand side not implementing the operation to not fail (Documentation page for NotImplemented). If the left hand side doesn’t support the operation it then gets tried by what the right hand side of the expression defines in  __radd__ . The comment above the implementation in the C source directly references this. This is the way in which Python can handle a binary operation (in this case addition) with only one side being implemented without failing. In the case of adding some integers however we just add them directly. (Note how if the operation can’t be done Py_RETURN_NOTIMPLEMENTED will be returned for us.)

So to see how this flows through the code we have to see what if (v->ob_type->tp_as_number != NULL) evaluates to which requires seeing how how PyObject is defined. PyObject is a typedef that refers to the _typeobect struct. This is found in cpython/Include/object.h

Being that this is a rather large struct we will just put the most relevant parts to what we are doing in here:

typedef struct _typeobject {
    PyObject_VAR_HEAD
    const char *tp_name; /* For printing, in format "<module>.<name>" */
    Py_ssize_t tp_basicsize, tp_itemsize; /* For allocation */

    /* Methods to implement standard operations */

    destructor tp_dealloc;
    printfunc tp_print;
    getattrfunc tp_getattr;
    setattrfunc tp_setattr;
    PyAsyncMethods *tp_as_async; /* formerly known as tp_compare (Python 2)
                                    or tp_reserved (Python 3) */
    reprfunc tp_repr;

    /* Method suites for standard classes */

    PyNumberMethods *tp_as_number;
    PySequenceMethods *tp_as_sequence;
    PyMappingMethods *tp_as_mapping;

    
    /* more code */
} PyTypeObject;

We see some of the standard things like a type needing to define it’s name and it’s size for allocation purposes. Then we see some “method suites” which define what methods this type supports. So we see that tp_as_number is of type pointer to PyNumberMethods. PyNumberMethods is a struct that contains a whole bunch of things like binary functions and unary functions, essentially it’s a table of which functions the type supports. If it’s NULL then the type doesn’t support number operations at all and hence doing something like trying to add a number to it will fail and cause a type error. But more about that in a moment.

So in our case when we go to evaluate:

    if (v->ob_type->tp_as_number != NULL)

We find that since v is of type integer and the integer type defines tp_as_number so we enter into that branch.

So if this is not NULL in the struct we know that the type actually implements the required methods.

typedef struct {
    /* Number implementations must check *both*
       arguments for proper type and implement the necessary conversions
       in the slot functions themselves. */

    binaryfunc nb_add;
    binaryfunc nb_subtract;
    binaryfunc nb_multiply;
    binaryfunc nb_remainder;
    binaryfunc nb_divmod;
    ternaryfunc nb_power;
    unaryfunc nb_negative;
    unaryfunc nb_positive;
    unaryfunc nb_absolute;
    inquiry nb_bool;
    unaryfunc nb_invert;
    binaryfunc nb_lshift;
    binaryfunc nb_rshift;
    binaryfunc nb_and;
    binaryfunc nb_xor;
    binaryfunc nb_or;
    unaryfunc nb_int;
    void *nb_reserved;  /* the slot formerly known as nb_long */
    unaryfunc nb_float;

    binaryfunc nb_inplace_add;
    binaryfunc nb_inplace_subtract;
    binaryfunc nb_inplace_multiply;
    binaryfunc nb_inplace_remainder;
    ternaryfunc nb_inplace_power;
    binaryfunc nb_inplace_lshift;
    binaryfunc nb_inplace_rshift;
    binaryfunc nb_inplace_and;
    binaryfunc nb_inplace_xor;
    binaryfunc nb_inplace_or;

    binaryfunc nb_floor_divide;
    binaryfunc nb_true_divide;
    binaryfunc nb_inplace_floor_divide;
    binaryfunc nb_inplace_true_divide;

    unaryfunc nb_index;

    binaryfunc nb_matrix_multiply;
    binaryfunc nb_inplace_matrix_multiply;
} PyNumberMethods;

Seeing as we are interested in nb_add which is of type binaryfunc so lets look at the typedef for that:

typedef PyObject * (*binaryfunc)(PyObject *, PyObject *);

If you found this is a bit cryptic don’t despair, I had to re-read it a bunch of times too (I tried cdecl first too!). Basically this defines a pointer to a function that takes two parameters of type PyObject pointer and returns a PyObject pointer.

slotv = NB_BINOP(v->ob_type->tp_as_number, op_slot);

To figure out what this does we have yet another macro to look up:

/* Binary operators */

#define NB_SLOT(x) offsetof(PyNumberMethods, x)
#define NB_BINOP(nb_methods, slot) \
(*(binaryfunc*)(& ((char*)nb_methods)[slot]))

NB_BINOP here is doing something a bit nasty, but is essentially allowing us to select the function pointer we want from a number that indexes into the struct. Or in other words because the memory layout in C is deterministic it’s just picking the function out of the struct in the order in which it appears. So it hides us all the implemenation nastiness here which means NB_BINOP is essentially a lookup from the tp_as_number table which picks out the relevant function from the slot passed in as the op_slot parameter. (There is as far as I know no language feature in C that will tell you in which order the members of a struct are defined, if I’m wrong about this please post a comment to correct me)

So if  op_slot == 0 we would take the 0th member of the PyNumberMethods struct. Which in this case now sets  slotv to be the binary function that was at position 0 for the integer type.

Moving along to the next line we find this check:

    if (w->ob_type != v->ob_type &&
        w->ob_type->tp_as_number != NULL)

Because the operand types are the same on each side we skip this branch.

if (slotv) {

Since slotv was successfully fetched above (and is the nb_add for the numbers now) we enter this branch:

    if (slotw && PyType_IsSubtype(w->ob_type, v->ob_type)) {
        x = slotw(v, w);
        if (x != Py_NotImplemented)
            return x;
        Py_DECREF(x); /* can't do it */
        slotw = NULL;
    }
    x = slotv(v, w);
    if (x != Py_NotImplemented)
        return x;
    Py_DECREF(x); /* can't do it */

However slotw was not set before so we skip the if statement here. Leaving us with:

    x = slotv(v, w);
    if (x != Py_NotImplemented)
        return x;
    Py_DECREF(x); /* can't do it */

Now we try to perform the function that the slotv function pointer is referring to. So this effectively in our case is now whatever was stored at nb_add for the integers. Given we are using Python3 here (thankfully!!!) all the integral numbers are unified under the long type (see the official PEP 237 for more information). This is extremely convenient for the developer as they don’t need to worry about integer overflow and other such considerations when coding (anyone who’s dealt with code that overflowed or silently failed because the size of long integers changed from under them on another system will know what I mean).

Going to Objects/longobject.c we can find the appropriate definition for the PyNumberMethods for that type:

static PyNumberMethods long_as_number = {
    (binaryfunc)long_add,       /*nb_add*/
    (binaryfunc)long_sub,       /*nb_subtract*/
    (binaryfunc)long_mul,       /*nb_multiply*/
    long_mod,                   /*nb_remainder*/
    long_divmod,                /*nb_divmod*/
    long_pow,                   /*nb_power*/
    (unaryfunc)long_neg,        /*nb_negative*/
    (unaryfunc)long_long,       /*tp_positive*/
    (unaryfunc)long_abs,        /*tp_absolute*/
    (inquiry)long_bool,         /*tp_bool*/
    (unaryfunc)long_invert,     /*nb_invert*/
    long_lshift,                /*nb_lshift*/
    (binaryfunc)long_rshift,    /*nb_rshift*/
    long_and,                   /*nb_and*/
    long_xor,                   /*nb_xor*/
    long_or,                    /*nb_or*/
    long_long,                  /*nb_int*/
    0,                          /*nb_reserved*/
    long_float,                 /*nb_float*/
    0,                          /* nb_inplace_add */
    0,                          /* nb_inplace_subtract */
    0,                          /* nb_inplace_multiply */
    0,                          /* nb_inplace_remainder */
    0,                          /* nb_inplace_power */
    0,                          /* nb_inplace_lshift */
    0,                          /* nb_inplace_rshift */
    0,                          /* nb_inplace_and */
    0,                          /* nb_inplace_xor */
    0,                          /* nb_inplace_or */
    long_div,                   /* nb_floor_divide */
    long_true_divide,           /* nb_true_divide */
    0,                          /* nb_inplace_floor_divide */
    0,                          /* nb_inplace_true_divide */
    long_long,                  /* nb_index */
};

Since op_slot == 0 we are using the first function that appears in the struct,  long_add , to actually perform the addition. Making our code effectively:

    return long_add(v, w);
    

This logical simplification is because this is implemented and hence will be the return value. The long_add function has now correctly been chosen to perform the actual addition of the longs now, the definition of this function is also found in Objects/longobject.c :

static PyObject *
long_add(PyLongObject *a, PyLongObject *b)
{
    PyLongObject *z;

    CHECK_BINOP(a, b);

    if (Py_ABS(Py_SIZE(a)) <= 1 && Py_ABS(Py_SIZE(b)) <= 1) {
        return PyLong_FromLong(MEDIUM_VALUE(a) + MEDIUM_VALUE(b));
    }
    if (Py_SIZE(a) < 0) {
        if (Py_SIZE(b) < 0) {
            z = x_add(a, b);
            if (z != NULL) {
                /* x_add received at least one multiple-digit int,
                   and thus z must be a multiple-digit int.
                   That also means z is not an element of
                   small_ints, so negating it in-place is safe. */
                assert(Py_REFCNT(z) == 1);
                Py_SIZE(z) = -(Py_SIZE(z));
            }
        }
        else
            z = x_sub(b, a);
    }
    else {
        if (Py_SIZE(b) < 0)
            z = x_sub(a, b);
        else
            z = x_add(a, b);
    }
    return (PyObject *)z;
}

We are first creating a  PyLongObject *z  to store the result of the calculation.

CHECK_BINOP is a macro that does some sanity checking of inputs:

#define CHECK_BINOP(v,w)                                \
    do {                                                \
        if (!PyLong_Check(v) || !PyLong_Check(w))       \
            Py_RETURN_NOTIMPLEMENTED;                   \
} while(0)

So if a non-long type somehow ended up here it would bail out by returning a NotImplemented. But seeing as we have the correct types passed here we pass this check and continue on.

Sincle all integers are unified in Python3 with the long type arithmetic on arbitrary length integers can be performed. So cases where arbitrary precision arithmetic neds to be performed has to be handled in the implementation. But just because we can do arbitrary precision arithmetic having to do all additions with arbitrary precision would be a massive pessimization, so we first have a check for if the numbers are small enough to just add directly. If they are we can just do a fixed precision operation which will be much faster on most hardware. (You can see the code that runs the arbitrary precision arithmetic in the Objects/longobject.c file, it’s rather complex, and this is the sort of overhead you don’t want when enumerating over a small list of objects.)

if (Py_ABS(Py_SIZE(a)) <= 1 && Py_ABS(Py_SIZE(b)) <= 1) {
    return PyLong_FromLong(MEDIUM_VALUE(a) + MEDIUM_VALUE(b));
}

So in this snippet we have a few things going on, so let’s break it down. First we have to check the size of the parameters with Py_SIZE

The structure and Macro for Py_SIZE are found in Include/object.h:

typedef struct {
 PyObject ob_base;
 Py_ssize_t ob_size; /* Number of items in variable part */
} PyVarObject;


#define Py_SIZE(ob) (((PyVarObject*)(ob))->ob_size)

Which is basically accessing the ob_size of the parameter. Now Py_ssize_t is defined in PEP 353 if like me you see this code and ask “why didn’t they just use size_t?” you probably would enjoy reading that PEP. ob_size is the size of the object.

Py_ABS is just an absolute size macro, see Include/pymacro.h for the definition. While I know the benefits of coding defensively in these situations I must admit I don’t entirely understand what situation occurs that requires this check to be performed, if you know please post a comment so we can improve this article.

So now we have to apply the MEDIUM_VALUE macro to the parameter, this macro is found in Objects/longobject.c:

/* convert a PyLong of size 1, 0 or -1 to an sdigit */
#define MEDIUM_VALUE(x) (assert(-1 <= Py_SIZE(x) && Py_SIZE(x) <= 1),   \
         Py_SIZE(x) < 0 ? -(sdigit)(x)->ob_digit[0] :                   \
             (Py_SIZE(x) == 0 ? (sdigit)0 :                             \
             (sdigit)(x)->ob_digit[0]))

sdigit is a typedef to a platform specific type that has the width of a signed digit see Include/longintrepr.h:

#if PYLONG_BITS_IN_DIGIT == 30
typedef uint32_t digit;
typedef int32_t sdigit; /* signed variant of digit */
/* code snipped */
#elif PYLONG_BITS_IN_DIGIT == 15
typedef unsigned short digit;
typedef short sdigit; /* signed variant of digit */

So in this case on my machine sdigit is int32_t and now we have performed the casts to add as though the data types with int32_t‘s.

When the result of this is obtained we construct a Python long object to wrap this and we start going back up the abstraction layers. However we aren’t done just yet, there’s more digging down the abstraction layers yet to come…

TODO: verify in this case if it’s just adding directly via a fast-path

The addition is performed!

If you followed above you will have seen all the intermediary steps that need to be taken to properly deal with dynamic types and dispatching the correct functions.

At this point we are preforming the C code equivalent of:

int32_t a = 1;
int32_t b = 2;
return a + b;

and it will be effectively this because in this code path we passed all the checks for bad types and overflow already. So we know that  a+b  will fit within 32 bits.

A really useful tool for investigating how code maps to assembly is the compiler-explorer project. This tool has a super useful web interface https://gcc.godbolt.org/ which we are using here

Let’s see how the following C++ code disassembles (I know this isn’t C but compiler-explorer does not support C on the website but this is going to have the same assembly output):

#include <cstdint>
int32_t add_two_numbers(int32_t a, int32_t b) {
    return a + b;
}

Assembly in Intel syntax from the above code:

add_two_numbers(int, int):
  push rbp
  mov rbp, rsp
  mov DWORD PTR [rbp-4], edi
  mov DWORD PTR [rbp-8], esi
  mov edx, DWORD PTR [rbp-4]
  mov eax, DWORD PTR [rbp-8]
  add eax, edx
  pop rbp
  ret

(source https://godbolt.org/g/5aQTAJ)

We are dealing with a stack here (much like how the Python bytecode works) that has to deal with the underlying stack pointers.

We first push rbp on to the stack to store the result of the addition. After we have loaded the inputs into registers we execute the add command which will stores the result in rbp and finally we pop that value before returning.

So how is this addition with the add actually performed?

How the machine code is executed:

In an abstract sense to perform an addition we have to get a register and fill it with the first number then get another register and fill with second number. Add the contents of the 2 registers before we can return them. How this is represented in opcodes will depend on the hardware. From here on we are assuming the x86 Instruction set architecture. Unlike Cao I’ve never wrote any x86 directly to hex files and honestly I don’t plan on any time soon 🙂 but here’s a window into the type of info you would need to know in order to be able to do that.

Once again using compiler-explorer: https://godbolt.org/g/QZ4aVT

Binary instructions for this code

Mapping from assembly to the x86_64 hex codes

Note that on the x86 platform we have a few ways of performing an addition each with their own opcode, see http://x86.renejeschke.de/html/file_module_x86_id_5.html for information on the various addition opcodes.

Here we see that the add opcode is represented in hexadecimal as 01 d0. If we consult our ASCII table (man ascii) we see that this could also be written as 01 \r which from the documentation linked above tells us this this opcode performs: “Add r8 to r/m8”.

If you are interested in learning more about machine code with x86-64 I’d recommend having a look at this course handout.

So how does this actually get executed on the hardware?

Hardware implementation details

From here on in what happens can potentially be very different depending on what machine we are running. This is exactly the issue that “high level programming languages” seeks to shield us from when developing applications that do not intrinsically have any connection with the hardware that they are running on. (The “high” in “high level programming languages” refers to a high level of abstraction from the hardware implementation. Contrast this with say writing a sound card driver where your code is intrinsically tied to specific hardware and cannot be written in such a language)

Outside some systems programming, modern software development very rarely has to deal with this level. One of the towering achievements of many of the operating systems we currently use is just how little we need to deal with this level when we are writing our software.

Let’s gloss over the way in which the data is retrieved from RAM, we just will assume that it’s in the CPU registers and can be directly accessed. (The mechanisms behind loading items from memory to the CPU cache to registers are interesting but can be complicated)

As we saw above we fetch the numbers we need from the registers and then we add them together and pop the result from a register. We instruct the CPU to do these operations by sending binary encoded information which are the opcodes as specified in the instruction set architecture. The CPU then executes these commands accordingly.

But how does the CPU actually execute these instructions?

Microcode:

We are now at the hardware level. There’s an interesting step that isn’t really that well known (it’s been a while since the fDiv bug hit the headlines), to save silicon space on the chips some of the less common instructions are actually implemented in software on the chip which is called microcode.

While implementing microcode might initially seem like a step that would make chips more expensive it’s actually the quite economical. By approaching chip design this way we get more powerful chips as we save hardware resources on the less common operations we get more silicon available. In the newly freed up space we have more power available to dedicate to more common operations and more silicon space free for fast cache which overall is a win for performance (with the high clock speeds of modern CPUs the speed of fetching data from RAM including the electrical transmission time matters, caches implemented on the same die can be substantially faster). We also get things like the ability to update the microcode on the chip to fix bugs caused by errata in the chip designs.

Going back to the days where there was a clearer divide between RISC vs CISC explains a few things. At one point when people were writing more code in assembly it helped the productivity of the programmers to have more opcodes available that did more complicated operations. For example you might have an opcode for performing a modulo, which is a specific task but might not be a common task. For example when I was programming the firmware on ATMEGA chips some did not have have hardware support for a modulo operation so the C compiler would create that operation from more simple operations (this was really slow though). Having a specific opcode for a modulo that was fast would have been very convenient for me. Having an opcode for every possible instruction is not the most efficient approach when considering the proliferation of compiled high level languages. For example that code I wrote on the ATMEGA could implement the modulo anyway so taking up valuable silicon for that opcode would take away power from something else (or add to the cost of the chip).

x86 chips is definitely a CISC architecture as the x86 instruction set is large. The original chips would have directly implemented these opcodes in dedicated silicon. As time has gone on more code is being written in high level languages and less in assembly. Compared with the compilers that existed at the birth of x86 current optimizing compilers for modern high level languages are really good at optimizing stuff and happen to make machine code that’s mostly just basic operations. Hardware design has followed these developments in the software world and RISC is the dominant architecture as a result in most modern desktops/servers. However the x86 family is designed with very large amount of care put into maintaining backwards compatibility with older processors. You can run really old x86 code on a modern say Intel i7 or whatever, which is truly remarkable since a lot has happened in the intervening years and yet older code will still run on newer processors. If you send any x86 opcode to the chip it will get processed, the most common opcodes are implemented directly in the fastest paths of the silicon, but the slower more complex opcodes are translated into simpler operations for execution in the CPU itself. Essentially there’s a RISC core in the new x86 chips which maintains the support for all the CISC operations by a translation step on the chip itself.

Some operations can be computed directly like adding 2 registers. This is a very common operation and is implemented directly in dedicated silicon. Less common operations such as taking the inverse tangent of a number might have their own opcode (depending on the hardware system) but might not have their own dedicated silicon. Essentially the microcode is an interpreter in the chip that determines how the opcodes will be executed on the chip. It’s cheaper and more efficient from a silicon point of view to not dedicate too much chip surface area to uncommonly used operations because that surface area takes away from having more power for the more common operations/cache/etc. In this case we have a simple addition so lets say the whole thing is executed directly on the silicon via fast circuitry.

Because the x86 microcode isn’t open source we are forced to gloss over this step here. If you are interested in this step you might be interested in some attempts people have made at reverse engineering this:

Processor pipeline

Most modern CPUs execute code with instruction pipelining and modern x86 processors are no exception. Having such a pipeline lets the processor silicon be used more efficiently by increasing the average amount of time it is utilized instead of sitting idle. This does add yet another layer of complexity, one that many people don’t have to think about. However not thinking about the pipeline can lead to some strange things if you thought that the CPU only executed one instruction at a time. The famous branch prediction failure question on stack overflow is a great example of such as case of potential astonishment.

For more about pipelining see these links:

  • https://cs.stanford.edu/people/eroberts/courses/soco/projects/risc/pipelining/index.html
  • http://archive.arstechnica.com/paedia/p/pipelining-1/m-pipelining-1-1.html

Once the processor is ready to execute the instruction then it will get sent to the appropriate area on the silicon to be executed.

Implementation of addition in silicon

Once the pipeline gets to the appropriate addition instruction we can fairly safely say that addition of integers (at least narrow width ones) will be implemented directly in dedicated circuitry somewhere, but how does this actually happen?

At this point we now have those numbers stored in some registers on the CPU and we have to add them together. This is where the contents of the 2 registers are sent to the adder unit and the result is stored in another register. Specifically we have to get from dealing with integers to dealing with boolean logic operations.

To store numbers in a computer we need to have some format to encode them in. Because we are dealing with binary computers we need to have some way of storing natural numbers. Integers are often stored in 2s compliment, this is due to the simplifications that can be achieved at the silicon level by storing it this way. Specifically 2s complement is useful is because the adder unit described on the level below actually works with negative numbers without needing any additional modification.

Implementation of the adder unit

The adder unit logically looks like this:

Circuit diagram for the addition of a couple of 4 bit integers

Implementation of an adder, A0 through to A3 are the bits comprising the first integer B is the other integer. By en:User:CburnettOwn work. This vector image was created with Inkscape., CC BY-SA 3.0, Link

Conceptually speaking perhaps this might be simplest way to add multiple integers. We can make a series of adder circuits with carry that are chained together.

The addition of the 2 integers will be carried out in an adder. This takes in a series of bits and performs the addition operation on them. So each of those squares looks like this in terms of logic gate:

Circuit diagram of a full adder with carry

Implementation of a full adder with carry

Note that in real implementations this is a bit of a slow way to add integers and the CPU in your computer likely does something a bit more sophisticated like the carry-lookahead adder.

Implementation of AND gate

A logical AND gate combines 2 TTL inputs and produces an output

Transistor-transistor logic just happens to be conceptually simple, though other implementations of this exist too:

Logical AND gate using transistors.

An implementation of a logical AND gate using transistors.

As you can see for the current to flow from V to Out both A and B inputs must be high.

If we represent high voltages as 1s and low voltage levels as 0s we see that this circuit will create the logical-AND.

This one way Boolean logic can be implemented in electronics. There are many other ways in which these logical operations can be implemented in circuitry, they can even be implemented mechanically although this approach doesn’t scale (Unfortunately the economical effects of the poor scaling of mechanical computation probably stopped many inventions of the era of Babbage’s Analytical engine by raising the cost of development beyond what a society that was unaware of the power of Turing-complete computing would be willing to pay).

Transistors

Essentially the transistor is a switch that allows a voltage to flow between two points if and only if a voltage is applied.

This lets us control one current path with another, which as you can see in the logic gate step before lets you construct the Boolean logic primitive operations required for computations.

(Note that your CPU of 201x era is likely using MOSFETs because of the properties of such components, including their ability to be miniaturized effectively.)

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *