Software development and beyond

Artificial Intelligence in Python

The article presents the main building blocks in programming AI in Python, and it is loosely based on my notes from Harvard’s course CS50’s Introduction to Artificial Intelligence with Python 2020 and extended to include other interesting concepts and Python libraries that were not mentioned in the course.

AI in Python

There is a debate about what should be considered AI. Some people suggest that the term should be reserved for general artificial intelligence. In contrast, others believe anything smart to be AI, including things that are not that advanced from today’s perspective. Harvard and other universities prefer to use AI as an encompassing term for any building blocks that were or are used today in building intelligent systems, even if they are far from thinking like humans.

The course was meant as an introduction to the most essential and foundational techniques in building AI systems. It is not a comprehensive list of research areas or algorithms but a very nice overview of the main areas we can study and leverage in our applications. Personally, I really liked the course, both the contents and presentation.

Following the course, I divided artificial intelligence into the following sections:


Search refers to the techniques that are used for finding a solution to a problem given some initial state and possible actions that can be taken. These kinds of problems are typically modeled using various graphs like decision trees and solved with graph theory. Finding the solution (the final state) for the purpose of AI is then done using search algorithms like A* that are based on the basic breadth-first and depth-first search algorithms and their variations. These problems can be further extended to ones where we also have an opponent, an adversary where two players try to win against one another. This type of problem is called adversarial search and we can use it to implement AI for playing some games like Tic Tac Toe.

A* search algorithm is a variation of greedy best-first search using heuristic of a “distance”, combined with the number of steps that are being taken to get to the next node. The next node to be explored is the node that has the lower number of steps+distance in the frontier, where the frontier is a data structure to store examined nodes. Heuristic h(n) has to be consistent and admissible. A* is typically associated with path-finding problems in computer games, but it has more applications.

Minimax is an adversarial search algorithm that works for two opponents, one trying to optimize for the minimum and the other for the maximum value where this value is a final state of the game, e.g. (1, 0). The opponents look at the next possible states recursively playing as MAX or MIN players, creating a decision tree until they compute which move to make. It was originally formulated for two-player zero-sum game theory but can be extended to other problems. Alpha-beta pruning is an optimized variation of Minimax where the computation of some nodes is skipped if it is clear that their value won’t be more favorable.

graph-tool and NetworkX are two Python libraries that can be used for searching and manipulating graphs. Both A* and Minimax can also be implemented from scratch with not so much effort.


We want to represent knowledge for computers to allow computers to use logic to derive conclusions about that knowledge. This is where we can use logic programming, which is based on different kinds of formal logic, like propositional logic and first-order logic. Once this knowledge called knowledge base is defined, we are able to ask logical queries and get answers based on it. Typically, applications using formal logic in this way are called expert systems.

In propositional logic, we work with propositional symbols that can be either true or false and a set of truth-functional logical operators like AND, OR, ¬ negation, → implication, or ⇔ equivalence.

In first-order logic, also called predicate logic, we work with constant symbols, predicate symbols (P(x), Q(x, y)) defining relationships, functional symbols (f(x)), and quantifiers (∀ for all, ∃ there exists). This enables us to define relationships like Person(Minerva), BelongsTo(Minerva, Gryffindor) and then ask logical queries like ∀x BelongsTo(x, Gryffindor).

Going through all models in our knowledge base to see if the knowledge is true (all models are true) and if our query is also true is called model checking.

While there are specialized languages for logical programming like Prolog, we can also use Python libraries such as pytholog, pyDatalog, kanren or SymPy. Alternatively, there are also packages that let us interact with Prolog using the Python interface, like pyswip.


Until now, we talked about certain things, but the real world is full of uncertainty about things, and if we want to build a good AI, the AI has to understand that as well. The tools of the trade to do that are Bayesian networks, Markov models, and Hidden Markov Models.

To understand all of this it is best to revise Bayes’s rule P(b|a) = P(b)P(a|b) / P(a), probability distributions, joint probability distributions which are distributions for two or more random variables and conditional probability distributions. Bayesian network is then a data structure that represents the dependencies among random variables by using a directed acyclic graph. In the graph, a random variable X is a parent to Y if there is an edge X → Y. Each node X has a probability distribution P(X | Parents(X)), meaning its probability distribution depends on the parent. Bayesian networks are typically used to update beliefs about states of certain variables when other variables were observed, find the most probable configurations of random variables, or support decision-making under uncertainty.

Markov models are models for randomly changing systems which are systems where we have probability distributions in time. Markov assumption is that the current state depends on a finite fixed number of previous states. We can define the initial state and conditional transition distribution and then sample values from a Markov model (e.g., changing sunny and rainy days).

Hidden Markov models are for situations where we need to infer probability from observation because the original state is hidden to us. For example, we want to predict whether it is sunny or raining based on the observation of an activity (shopping, going for a trip, et cetera) on a particular day.

Pomegranate is a Python package that implements fast and flexible probabilistic models ranging from individual probability distributions to models such as Bayesian networks and Hidden Markov Models. PyMC3 is a library for probabilistic programming.


Optimization is choosing the best option from a set of options. In the applications for Artificial Intelligence, we are typically trying to find the best state in a State space where we can move from a state configuration to another state configuration. There are different types of algorithms for optimizations: Hill climbing, Simulated annealing, Linear programming, Backtracking, and others.

Hill climbing is a technique for finding the local minimum or local maximum. It is an iterative algorithm that starts with a random solution to a problem and then makes incremental changes to find the real solution (at least the local one), basically moving from the initial state to a better state and again to a better state until no better state is found among neighboring states. An example application would be placing hospitals on a grid in such a way that they are close to houses or finding the optimal solution to Travelling salesman problem. Hill climbing has different variations like Stochastic hill climbing or Random-restart hill climbing that will consider different directions to look for the solution.

The difference between Hill climbing and Simulated annealing is that Simulated annealing is based on probability and allows us to accept neighbors that are worse than the current state, at least at the beginning of the algorithm. It doesn’t guarantee that a global minimum or maximum is found, but it has a chance not to get stuck in a local minimum or maximum.

Optimization algorithm Simplex represents what we call Linear programming. It is for problems where decision variables have a linear relationship. In other words, we need to express constraints as linear equations. If we can do that, Simplex will find the optimal solution. An example application can be finding the efficient use of different machines in a factory, where each machine has some output and cost, and we need to combine different machines together in the most effective way.

Constraint Satisfaction Problems are situations where constraints are imposed on variables and where we are trying to find a solution that satisfies those constraints. Type inference in programming or Sudoku classifies as CSPs. The solutions can be solved using Backtracking and other methods.

Hill climbing and Simulated annealing implementation can be found in the Python library Solid. Linear programming solutions can be implemented using SciPy or PuLP.


Machine learning is all about the ability of a computer algorithm to learn from data. We can divide machine learning into supervised, unsupervised, semi-supervised, or reinforcement learning.

Supervised learning is machine learning where we use a human-labeled training set to tell the algorithm the meaning of the data. Once the algorithm is trained on a training set, it will be able to perform tasks on new data as well. In other words, we give a set of inputs and outputs to a computer and let it find a function to match inputs to outputs. Classification is a typical supervised learning problem of classifying items into discreet classes, while regression is used for finding a mapping function from inputs to continuous values (not discreet categories). The most popular type of regression is Linear regression. It can be used to predict or assign a price of a product based on the product’s parameters and data of other products if we have the products’ prices and values of the parameters.

K-nearest neighbors algorithm can be used for both classification and regression. It considers k nearest data points from inputs to derive output. For instance, classification will assign a class based on the plurality vote of the k-nearest other items.

The idea behind Support vector machines is to find a hyperplane (in 2D that means finding a line) that can separate different groups of items. We are usually trying to find the one that represents the largest separation between the groups. Again SVM can also be used for classification as well as regression, and it is especially useful when the data is not linearly-separable.

Loss function also called a cost function is an important concept in machine learning. It is used for evaluating classification solutions where it is the penalty for an incorrect classification of an example. Regularization is a technique to change a learning algorithm by modifying the loss function to favor simpler prediction rules to avoid overfitting. Overfitting is a situation when our trained model works too well on the data that it was trained on but fails on new data.

Another technique to avoid overfitting is Cross validation. In Hold out cross-validation, we divide our data set into train and test data so that we can evaluate our model on test data after it is trained on train data. That way, we can be sure that our model also works with “new” data that the algorithm didn’t see before. K-fold cross-validation is repeating the Holdout process k-times by generating k different sets of train and test data.

Unsupervised learning is a type of machine learning where a computer looks for patterns in data without or with minimal human assistance. A classic example of unsupervised learning is clustering, e.g., K-means clustering. In clustering, items are grouped together into groups called clusters. The practical application could be, for instance, in marketing by grouping website visitors based on their behavior and showing them different products.

Semi-supervised learning is a situation where only a part of the training set is labeled. For instance, we have people tagged on some family photos, and we want to identify people on the other pictures.

Reinforcement learning is learning with an environment where an agent performs actions, gets rewards or penalties, and tries to achieve a cumulative reward. This way, the agent learns by trial and error. An example of this would be a robot trying to learn how to walk. Markov Decision Process is a model for decision-making, representing states, actions, and their rewards. Q learning is an example of a reinforcement learning algorithm.

The most popular machine learning Python library is, without a doubt scikit-learn, but there are more options. For example, XGBoost can be used for regression and classification problems. When using Spark for distributed computations, we can also use MLlib.

Machine learning libraries for Neural networks, Time series, and Natural language processing are mentioned in their respective sections.

Neural networks

Neural networks are computing systems that map a number of inputs to outputs based on the structure and parameters of the network, which is based on highly connected processing elements called neurons. Neural networks are typically organized into layers. They can be used to model complex relationships between inputs and outputs or to find patterns in data. The most important neural networks are Deep networks, Convolutional neural networks (CNN), and Recurrent neural networks (RNN).

Gradient descent is an algorithm for minimizing loss when training neural networks. The gradient is basically a direction that will lead to decreasing loss. In gradient descent, we first pick some random weights for each of the inputs and then repeatedly calculate the gradient of the loss function on all data points, updating the chosen weights. There are some variations of gradient descent such as Stochastic gradient descent or Mini-batch gradient descent.

Multi-layer neural networks are networks with at least one hidden layer. Deep networks are networks with multiple hidden layers. Deep learning is about training deep networks.

Backpropagation in deep learning is a technique to train a neural network with hidden layers. As before, we first pick random weights for each of the inputs. Then we repeatedly calculate the error for the output layer, and for every layer (starting with output layer), we propagate the error back one layer and update the weights.

Convolutional neural networks are networks that use convolution which means that we are looking at neighboring data to reduce the size of the network (say from an image 100×100 pixels we create several feature maps 10×10, one for the shape, one for the red color, one for the blue color, etc.). The classic application of CNNs is in analyzing images, but there is a wide application for them, from recommender systems to natural language processing.

Recurrent neural networks are networks where outputs of neurons are fed as inputs into themselves. RNNs find their application in speech recognition, machine translation, or handwriting recognition.

The classic Python deep learning libraries are Tensorflow and PyTorch.

Keras is an alternative API for Tensorflow, is an alternative API for PyTorch and skorch is a scikit-learn compatible alternative API for PyTorch.

Time series

In machine learning, time series are used for clustering, classification, regression, anomaly detection, and forecasting.

Darts, sktime and tslearn are general purpose time-series ML libraries. There is also Facebook’s forecasting library prophet.


The ability to understand and use human languages is crucial in accepting AI to be intelligent. The area of AI dedicated to this study is called Natural Language Processing (NLP). In more practical terms, we are usually interested in machine translation, information retrieval, information extraction, text classification, text search, and other similar tasks.

The study of languages begins with syntax. Context-free grammars are most often used to represent the syntax of human languages.

In NLP, we use data structures such as n-grams (sequence of n items from a text, usually words) and bag of words (a collection of words representing a larger text such as a sentence, paragraph or article).

Markov model, the same model as discussed in Uncertainty, has its application also in natural language processing and can be used to understand the probabilities of words following other words since in Markov chains, we model a probability of events based on previous events. Additive smoothing is a technique of adding α to all values so that we avoid multiplying by 0 in Naive Bayes).

TD-IDF, in full words term frequency-inverse document frequency, is a way to know how important a word is in a text, given a collection of documents. The principle is simple: the word is important if it is repeated frequently in the text but not in other documents.

Semantics represents meaning in texts, and it is crucial in understanding natural languages.

word2vec is a technique of representing all words by a vector so that by comparing the vectors of two words, we can deduce what is the relationship between them, e.g., if they are synonyms.

WordNet is a widely used and available lexical database of the English language.

The most popular Python libraries for NLP are spaCy and ntlk. fastText is a library for efficient learning of word representations and sentence classification. Transformers provides NLP models for deep-learning libraries Tensorflow and PyTorch.

More libraries and final words

Artificial Intelligence is a topic too big for one article, and I couldn’t possibly cover or explain everything. I chose to include topics mentioned in Harvard’s course because I think they represent the most important concepts and algorithms for starting in AI.

For each topic, I tried to add pointers to Python libraries where some of them were, and some weren’t mentioned in the course. I also added a section on time series machine learning because it is a very hot topic today, and there are many libraries in Python for it.

Clearly, the Python ecosystem is packed with libraries for Artificial Intelligence, and we have already seen plenty of libraries for graphs, logical programming, probabilistic models, machine learning, neural networks, deep learning, and natural language processing. Is there more?

There is! There are also some modern wrappers around existing libraries that focus on high-level APIs and simplicity. So check out Pycaret and thinc!

I hope that the article was useful and gave you a good overview of the AI landscape in Python!

Last updated on 19.3.2022.

ai data popular python