POD Types

Written by Alex Guyer | guyera@oregonstate.edu

This lecture is about plain-old-data (POD) types. We'll cover the following:

POD types ("structure types") via classes
Type-annotating class instances
Example: Loading POD types via text I/O

POD types ("structure types") via classes

In our File I/O lecture, we created a program that loads information about cities from a text file, particularly their names and population counts. Suppose you want to store all that information in a list. That sounds easy, but there's a problem: you know how to create a list of strings, and you know how to create a list of integers, but how do you create a list of cities?

There are a few solutions. A naive solution would be to have two lists: a list of strings to store the cities' names, and a list of integers to store their corresponding population counts. For example, perhaps city_names[0] represents the name of the first city read from the text file, and city_populations[0] represents the population of that same city. This solution technically works, but there are a host of problems with it. For one, you (the programmer) have to be very careful to make sure that these two lists are always in alignment. If you ever insert a new city name into city_names, but you forget to insert the population of that city into city_populations (or if you remove a value from one of the lists, or if you change a value in one of the lists, etc), then the two lists will no longer be in alignment with one another; city_names[4] might represent the name of one city while city_populations[4] represents the population of a different city altogether. And at that point, there might be no way of knowing which populations correspond to which cities (you'll also probably run into an IndexError when iterating through the lists if one list is longer than the other).

There are other issues with this solution as well, but let's move on to a better solution. Rather than having a list of strings for the cities' names and a separate list of integers for the cities' populations, wouldn't it be nice if you could just have a single list of cities? Hypothetically, in order to do that, your program would need to know what a "city" is. It already knows what a string is, and it already knows what an integer is. These data types are built into the Python language. But it doesn't know what a city is. If we could somehow tell our program what a city is—that it's some sort of thing that has both a name, which is a string, and a population, which is an integer—then perhaps we could create a single list of cities themselves.

And, indeed, we can do exactly that by making use of classes. A class is a custom, programmer-defined data type. That's right—beyond the built-in types of integers, strings, floats, booleans, and so on, we can create our own types of data by defining classes for them. Once a class has been defined, you can then create variables whose types are those classes. These are often referred to as instances of the given class. For example, once you have defined the City class, you can then proceed to create variables of type City (i.e., instances of the City class), just as you can create int variables, str variables, float variables, and so on. That also implies that we could create a list of cities, just as we can create a list of integers, a list of strings, a list of booleans, and so on. This idea is much more appealing than having two separate lists to store all the information about the cities.

Classes are extremely powerful. They support all sorts of complex features like attributes (member variables), methods (member functions), inheritance, upcasting, polymorphism, and various metaprogramming techniques, as well as philosophies like encapsulation, information hiding, and so on. We'll talk about all these things throughout this course, but we'll keep it simple to start. For now, we'll only use classes in their most limited capacity: we'll use them to define so-called plain-old-data (POD) types. In other programming languages, these are often specifically referred to as structure types.

A POD type is simply a class that declares attributes (member variables) and nothing else. An attribute is essentially a variable that exists inside another variable. Attributes are often said to establish "has-a" relationships. For example, in the program that we're trying to write, every city has a name, and every city has a population. So, when we define our City class in a moment, we should give it two attributes: 1) a name, and 2) a population. Then, whenever we create a variable of type City, it will automatically have two other variables inside it: 1) a name, and 2) a population. If we then create a list of cities, that single list will be capable of storing all the information that we need it to store; we'll no longer need two separate lists.

To define a class that represents a simple POD type, use the following syntax:

class <name>:
    <attribute 1 name>: <attribute 1 type>
    <attribute 2 name>: <attribute 2 type>
    ...
    <attribute N name>: <attribute N type>

Replace <name> with the name of the class (i.e., the name of the data type itself), replace each <attribute X name> with the name of an attribute that you want instances of the class to have, and replace the corresponding <attribute X type> with the type of that attribute.

Let's put this information to use and start writing our program:

classes.py

# Define the City class
class City:
    name: str # Every city has a name, which is a string
    population: int # Every city has a population, which is an integer

def main() -> None:
    # ...
    

if __name__ == '__main__':
    main()

The above code defines the City class such that every city, in the context of our program, has a name and a population. Cities' names are represented as strings, and cities' populations are represented as integers.

It's important that you understand that the above code does not actually create any instances of the City class. That's to say, we have not yet created any cities themselves. We have only defined what a city is: a city is simply a kind of variable that has two other (smaller) variables inside it—one representing the city's name and the other representing its population.

Notice that the class's name, City, is capitalized. Indeed, PEP 8, the official style guide for Python code, recommends that all classes be named using PascalCase, meaning that a class's name shouldn't contain underscores, and each word in the class's name should start with an uppercase letter (e.g., MyCoolClass). This naming convention makes it easy to distinguish the name of a class from the name of a variable since, as per PEP 8, variables are generally named using snake_case. For example, City might be the name of a class, but city might be the name of a variable (possibly even an instance of the City class).

Also notice that the City class is defined in the module (global) scope rather than being defined inside some particular function or another. While it's legal in Python to define a class inside a function, doing so would result in that class only being usable within said function's scope. We want to use the City class throughout our entire program—not just in a single function—so we define it in the module scope.

Now that we've defined the City class, we can proceed to create instances of it (i.e., variables whose type is City). To create an instance of a class, sometimes referred to as instantiating the class, simply write out the name of the class followed by an empty pair of parentheses (almost as if you're calling a function). For example:

classes.py

# Define the City class
class City:
    name: str # Every city has a name, which is a string
    population: int # Every city has a population, which is an integer

def main() -> None:
    my_cool_city = City()
    

if __name__ == '__main__':
    main()

This syntax works for simple POD types, anyway. When we introduce constructors (__init__() methods) in a future lecture, the syntax for instantiating our classes will get slightly more complicated.

Important detail: you'll often run into problems if you try to use a class before it has been defined (especially when type-annotating class instances). That's to say, the definition of a given class should usually appear before (i.e., above) any and all lines of code that use it in any way. This is why I defined the City class at the top of the program.

So, City is a class, which is a custom, programmer-defined data type, and my_cool_city is an instance of the City class. The City class declares two attributes: 1) name, and 2) population. Attributes establish a has-a relationship. Putting all that together, my_cool_city is a variable that has two other smaller variables inside it: 1) name, and 2) population. Question: how do we access those smaller, internal variables?

Enter the dot operator. To access a variable that is contained within another variable, write out the name of the outer / larger variable, followed by a dot (.), followed by the name of the inner / smaller variable:

classes.py

# Define the City class
class City:
    name: str # Every city has a name, which is a string
    population: int # Every city has a population, which is an integer

def main() -> None:
    my_cool_city = City()
    
    # Print the value of the name variable, which is an attribute
    # that's contained inside the my_cool_city variable (every City
    # instance has a 'name' attribute, which is a string, as declared
    # in the City class definition above).
    print(my_cool_city.name)

if __name__ == '__main__':
    main()

Another question: What's printed by the above program? Well, as it currently stands, it actually throws an AttributeError and crashes when we try to print my_cool_city.name:

(env) $ python classes.py 
Traceback (most recent call last):
  File "/home/alex/instructor/static-content/guyera.github.io/code-samples/pod-types/classes.py", line 16, in <module>
    main()
    ~~~~^^
  File "/home/alex/instructor/static-content/guyera.github.io/code-samples/pod-types/classes.py", line 13, in main
    print(my_cool_city.name)
          ^^^^^^^^^^^^^^^^^
AttributeError: 'City' object has no attribute 'name'

The error might seem a bit confusing: 'City' object has no attribute 'name'. How can this be? After all, didn't we define the City class to have a name attribute? Moreover, if we run the above program through Mypy, it reports no errors whatsoever. Well, the reason for the wording of this runtime error is a bit complicated, and it has to do with the discrepancy between Python's extremely dynamic type system and Mypy's static type system. When we define the City class to have an attribute called name of type str, that's mostly just for Mypy's sake—it's telling Mypy that instances of the City class should, for the purposes of static analysis, be treated as having an attribute called name of type str. However, when we actually run the program, Mypy's static type system is irrelevant; all that matters is the Python interpreter's type system, which is entirely dynamic. The interpreter's type system does not care about attribute declarations within class definitions. It only cares about what variables are actually created at runtime, and what types they actually have.

Although the name: str attribute declaration says that all City instances have a name attribute, notice that we never actually assigned a value to my_cool_city.name. And yet, we try to print it anyways—we're trying to print a variable that, from the perspective of the interpreter, has not yet been defined (even if it does exist from Mypy's perspective).

That's all to say, you cannot use a variable if you have not yet defined it (i.e., given it a value at runtime), even if you have declared it (statically) as an attribute of a class. Attempting to do so results in a runtime error.

So, before we can print my_cool_city.name, we have to give it a value. We can do this using the assignment operator, just like when assigning a value to any other kind of variable (remember: attributes are just variables that exist inside other variables). We should do the same thing for my_cool_city.population as well while we're at it:

classes.py

# Define the City class
class City:
    name: str # Every city has a name, which is a string
    population: int # Every city has a population, which is an integer

def main() -> None:
    my_cool_city = City()

    # Give our city a name and a population
    my_cool_city.name = "Chicago"
    my_cool_city.population = 2721000
    
    # Now that my_cool_city.name and my_cool_city.population have been
    # defined at runtime (not just statically declared), we can proceed
    # to use these variables however we'd like (e.g., print them to the
    # terminal, or do anything else that you might want to do with
    # a string or integer)
    print(my_cool_city.name)
    print(my_cool_city.population)

    # my_cool_city.population is an integer, so we can even use it
    # in mathematical operations if we'd like (again, it's just like
    # any other integer---it's just inside another, larger variable
    # called my_cool_city)
    print(f"Half of chicago's population is: "
        f"{my_cool_city.population / 2}")
          
if __name__ == '__main__':
    main()

Running the above program produces the following output:

(env) $ python classes.py 
Chicago
2721000
Half of chicago's population is: 1360500.0

Because attributes do not exist from the interpreter's perspective until they're actually defined (assigned values at runtime), and because using an undefined attribute is a bug and results in an AttributeError being thrown, it's a good idea to always define your attributes as early as possible. For example, in the above program, we create my_cool_city and then immediately proceed to assign values to my_cool_city.name and my_cool_city.population. If we didn't do this immediately—if we created my_cool_city but didn't define its attributes until several lines of code later (or even perhaps several functions later)—then there'd be a potentially large section of code in which my_cool_city is defined, but its attributes are not. If we accidentally attempted to use any of its attributes within that section of code, we'd get an AttributeError. Defining attributes early mitigates these sorts of mistakes, so it's generally a good idea.

So, declaring an attribute but failing to define it results in an AttributeError when the attribute is used at runtime. But what about the other way around? What if you attempt to define an attribute (assign it a value) at runtime but forget to declare it (statically) within the class definition? Well, this also results in an error, but a completely different kind. It results in a static type error, as reported by Mypy:

define_undeclared_attribute.py

# Define the City class
class City:
    # Notice: I've now omitted the name attribute declaration
    population: int # Every city has a population, which is an integer

def main() -> None:
    my_cool_city = City()

    # Give our city a name and a population
    my_cool_city.name = "Chicago" # Static type error here
    my_cool_city.population = 2721000
    
    # Now that my_cool_city.name and my_cool_city.population have been
    # defined at runtime (not just statically declared), we can proceed
    # to use these variables however we'd like (e.g., print them to the
    # terminal, or do anything else that you might want to do with
    # a string or integer)
    print(my_cool_city.name)
    print(my_cool_city.population)

    # my_cool_city.population is an integer, so we can even use it
    # in mathematical operations if we'd like (again, it's just like
    # any other integer---it's just inside another, larger variable
    # called my_cool_city)
    print(f"Half of chicago's population is: "
        f"{my_cool_city.population / 2}")
          
if __name__ == '__main__':
    main()

And here's the error reported by Mypy:

(env) $ mypy define_undeclared_attribute.py 
define_undeclared_attribute.py:10: error: "City" has no attribute "name"  [attr-defined]
define_undeclared_attribute.py:18: error: "City" has no attribute "name"  [attr-defined]
Found 2 errors in 1 file (checked 1 source file)

Funny enough, though, the above program runs just fine (and it produces the same output as it did before). Again, this is because of discrepancies between the interpreter's dynamic type system and Mypy's static type system. This may lead you to think that Mypy is being picky or pedantic. But don't be fooled—Mypy is right. Even though the above program runs and works, the code is flawed: as defined, cities do not have names, yet we're trying to assign a value to my_cool_city.name anyway. If Mypy allowed us to do this (as the Python interpreter does), then nothing would stop us from creating arbitrary attributes within arbitrary objects whenever we want. Imagine—if any object could have any attribute at any time, you'd have no way of knowing which objects have which attributes without scouring the entire codebase to figure out where those attributes might have been defined (or might not have, depending on, say, the conditions of various if statements at runtime). That would make Python programs incredibly hard to maintain. Mypy at least gives us some semblance of confidence about what variables and attributes might exist at any given point in time.

And, as you know, the code you submit for your labs and assignments must pass through Mypy in strict mode without any errors being reported, or else you'll be penalized.

(Before Python static analysis tools like Mypy saw widespread adoption in industry, many large-scale Python codebases were extremely hard to maintain. For example, they would frequently crash unexpectedly due to type errors that are now easily caught by Mypy and similar tools. Functions were often littered with dynamic type assertions as a messy, last-ditch effort to mitigate these sorts of issues.)

Type-annotating class instances

Suppose you want to pass an instance of a class as an argument to a function, or you want to return an instance of a class from a function. Then you'll need to know how to type-annotate such variables. As it turns out, it's trivial: just write the name of the class. And hopefully that makes sense. If you want a parameter to store an integer value, you type-annotate it as an int. If you want it to store a string value, you type-annotate it as a str. If you want it to store a city, you type-annotate it as a City.

For example:

print_city.py

class City:
    name: str # Every city has a name, which is a string
    population: int # Every city has a population, which is an integer


# Given a city, print all of its information to the terminal
def print_city(city: City) -> None:
    print(f'City name: {city.name}')
    print(f'City population: {city.population}')


def main() -> None:
    my_cool_city = City()

    my_cool_city.name = "Chicago"
    my_cool_city.population = 2721000
    
    # Use the print_city() function to print the information about
    # my_cool_city
    print_city(my_cool_city)

          
if __name__ == '__main__':
    main()

Running the above program produces the following output:

(env) $ python print_city.py 
City name: Chicago
City population: 2721000

Again, you'll run into issues if you try to type-annotate a class instance before the class itself has been defined. To avoid such issues, prefer to define classes near the top of your program / module, above any of the functions that use them.

Example: Loading POD types via text I/O

At the start of this lecture, I said that our goal was to write a program that reads a list of cities and their information from a file, storing it all in a single list. Let's combine what we've learned in this lecture with what we learned in the File I/O lecture to do just that. And for the sake of completeness, I'll even add a simple user interface so that the user can choose from a couple options to explore the data in a couple different ways (perhaps that's a bit overboard, but I like to throw in some larger examples every now and then):

from typing import TextIO

class City:
    name: str # Every city has a name, which is a string
    population: int # Every city has a population, which is an integer


# Given a city, print all of its information to the terminal
def print_city(city: City) -> None:
    print(f'City: {city.name}')
    print(f'    Population: {city.population}')


# Given a file, read all of the cities' data from the file, and return
# a list of cities storing all that data.
def read_city_file(file: TextIO) -> list[City]:
    i = 1
    cities = []
    for line in file:
        if i >= 2: # Skip the first line in the file
            # Strip whitespace
            line = line.strip()

            # Extract tokens
            tokens = line.split(',')

            # Create a City variable, storing the name and population
            # values from the line inside its respective attributes
            city = City()
            city.name = tokens[0]
            city.population = int(tokens[1])

            # Append the City variable to our list of cities
            cities.append(city)

        i += 1 # Increment i by 1 (equivalent to 'i = i + 1')

    # The for loop is done. Return the list of cities parsed from the
    # file.
    return cities


# Prompts the user for an integer until they enter one that's valid,
# according to the given list of valid integers. See the Exceptions
# lecture notes for more information.
def prompt_for_integer_in_list(
        prompt: str, # Text to print when prompting the user
        valid_choices: list[int], # List of valid choices
        error_message: str # Text to print when given an invalid input
        ) -> int: # Returns the user's final, valid input
    supplied_valid_input = False
    while not supplied_valid_input:
        try:
            chosen_integer = int(input(prompt))
            if chosen_integer in valid_choices:
                supplied_valid_input = True
            else:
                print(error_message)
        except ValueError as e:
            print(error_message)
       
        print() # Print an empty line to make things easier to read

    return chosen_integer


def option_display_all_cities(cities: list[City]) -> None:
    for city in cities:
        print_city(city)
    print() # Print an empty line to make things easier to read


def option_search_city_by_name(cities: list[City]) -> None:
    chosen_name = input("Enter the city's name: ")
    
    found = False
    for city in cities:
        if city.name == chosen_name:
            # Found the city with the specified name. Print its
            # information
            print_city(city)
            found = True
            break # End the for loop

    if not found:
        print(f'Sorry, I don\'t know anything about the city named '
            f'"{chosen_name}"')

    print() # Print an empty line to make things easier to read


def main() -> None:
    cities_file_name = input('Enter the name of the cities data '
        'file: ')

    quit_program = False

    try:
        with open(cities_file_name, 'r') as cities_file:
            cities = read_city_file(cities_file)
    except FileNotFoundError as e:
        print('Error: File not found')
        quit_program = True
    except OSError as e:
        quit_program = True
        print('Error: Failed to read the file')

    # If we failed to read the file, quit_program will already be True.
    # Otherwise, it'll be False until the user chooses to quit.
    while not quit_program:
        menu_text = ('What would you like to do?\n'
            '    1. Display all cities\n'
            '    2. Search cities by name\n'
            '    3. Quit\n'
            'Enter your choice: ')
        valid_choices = [1, 2, 3]
        input_error_message = 'Error: Invalid choice'
        
        users_choice = prompt_for_integer_in_list(
            menu_text,
            valid_choices,
            input_error_message
        )

        if users_choice == 1:
            option_display_all_cities(cities)
        elif users_choice == 2:
            option_search_city_by_name(cities)
        else:
            quit_program = True


if __name__ == '__main__':
    main()

Suppose there's a text file in our working directory named data.txt with the following content:

City,Population
Corvallis,61993
Eugene,178786
Salem,180406
Portland,635749

Then here's an example run of the above program:

Enter the name of the cities data file: data.txt
What would you like to do?
    1. Display all cities
    2. Search cities by name
    3. Quit
Enter your choice: 4
Error: Invalid choice

What would you like to do?
    1. Display all cities
    2. Search cities by name
    3. Quit
Enter your choice: jfda
Error: Invalid choice

What would you like to do?
    1. Display all cities
    2. Search cities by name
    3. Quit
Enter your choice: 1

City: Corvallis
    Population: 61993
City: Eugene
    Population: 178786
City: Salem
    Population: 180406
City: Portland
    Population: 635749

What would you like to do?
    1. Display all cities
    2. Search cities by name
    3. Quit
Enter your choice: 2

Enter the city's name: jfdsa
Sorry, I don't know anything about the city named "jfdsa"

What would you like to do?
    1. Display all cities
    2. Search cities by name
    3. Quit
Enter your choice: 2

Enter the city's name: Corvallis
City: Corvallis
    Population: 61993

What would you like to do?
    1. Display all cities
    2. Search cities by name
    3. Quit
Enter your choice: 3

The read_city_file() function reads (loads) a list of cities from the given text file using file input techniques. If we wanted to, we could create another function, say write_city_file(), that writes (saves) a given list of cities to a given text file (perhaps in the same comma-separated values format) using file output techniques. But the above example is already fairly long, so I'll leave this as an exercise for the reader.