Software development and beyond

Artificial Intelligence in Python

The article presents main building blocks in programming AI in Python and it is loosely based on my notes from Harvard’s course CS50’s Introduction to Artificial Intelligence with Python 2020 and extended to include other interesting concepts and Python libraries that were not mentioned in the course.

There is a debate what should really be considered AI. Some people suggest that the term should be reserved for general artificial intelligence while others consider anything smart to be AI, including things that are not really that advanced from today’s perspective. Harvard and other universities prefer to use AI as an encompassing term for any building blocks that were or are used today in building intelligent systems even if they are far from thinking like humans.

The course was meant as an introduction into the most important and foundational techniques in building AI systems. It is not a comprehensive list of research areas or algorithms, but a very nice overview of the main areas that we can study and leverage in our applications. Personally, I really liked the course, both the contents and presentation.

Following the course, I divided artificial intelligence into following sections:


Search refers to the techniques that are used for finding a solution to a problem given some initial state and possible actions that can be taken. This kind of problems are typically modeled using various graphs like decision trees and solved with graph theory. Finding the solution (the final state) for the purpose of AI is then done using search algorithms like A* that are based on the basic breadth-first and depth-first search algorithms and their variations. This can be further extended to problems where we also have an opponent, an adversary where two players try to win against one another. This is called adversarial search and we can use it to implement AI for playing some games like Tic Tac Toe.

A* search algorithm is a variation of greedy best-first search using heuristic of a “distance”, combined with a number of steps that are being taken to get to the next node. The next node to be explored is the node which has the lower number of steps+distance in the frontier where frontier is a data structure to store explored nodes. Heuristic h(n) has to be consistent and admissible. A* is typically associated with path-finding problems in computer games, but it has more applications.

Minimax is an adversarial search algorithm which works for two opponents, one trying to optimize for the minimum and the other for the maximum value where this value is a final state of the game, e.g. (1, 0). The opponents look at the next possible states recursively playing as MAX or MIN player, basically creating a decision tree until they compute which move to make. It was originally formulated for two-player zero-sum game theory, but can be extended to other problems. Alpha-beta pruning is a optimized variation of Minimax where computation of some nodes is skipped if it is clear that their value won’t be more favorable.

graph-tool and NetworkX are two Python libraries that can be used for searching and manipulating graphs. Both A* and Minimax can be also implemented from scratch with not so much effort.


What we want to do is to represent knowledge for computers in a way that allow computers to use logic to derive conclusions about that knowledge. This is where we can use logical programming which is based on different kinds of formal logic, like propositional logic and first-order logic. Once this knowledge called knowledge base is defined we are able to ask logical queries and get answers based on it. Typically, applications using formal logic in this way are called expert systems.

In propositional logic we work with propositional symbols that can be either true or false and a set of truth-functional logical operators like AND, OR, ¬ negation, → implication or ⇔ equivalence.

In first-order logic, also called predicate logic, we work with constant symbols, predicate symbols (P(x), Q(x, y)) defining relationships, functional symbols (f(x)) and quantifiers (∀ for all, ∃ there exists). This enables us to define relationships like Person(Minerva), BelongsTo(Minerva, Gryffindor) and then ask logical queries like ∀x BelongsTo(x, Gryffindor).

The process of going through all models in our knowledge base to see if the knowledge is true (all models are true) and if our query is also true is called model checking.

While there are specialized languages for logical programming like Prolog, we can also use Python libraries such as pytholog, pyDatalog, kanren or SymPy. Alternatively, there are also packages that let us interact with Prolog using Python interface, like pyswip.


Up until now, we talked about things that are certain, but the real world is full of uncertainty about things and if we want to build a good AI, the AI has to understand that as well. The tools of the trade to do that are Bayesian networks, Markov models and Hidden Markov models.

To understand all of this it is best to revise Bayes’s rule P(b|a) = P(b)P(a|b) / P(a), probability distributions, joint probability distributions which are distributions for two or more random variables and conditional probability distributions. Bayesian network is then a data structure that represents the dependencies among random variables by using a directed acyclic graph. In the graph, a random variable X is a parent to Y if there is an edge X → Y. Each node X has a probability distribution P(X | Parents(X)), meaning its probability distribution depends on the parent. Bayesian networks are typically used to update beliefs about states of certain variables when other variables were observed, for finding most probable configurations of random variables or to support decision-making under uncertainty.

Markov models are models for randomly changing systems which are systems where we have probability distributions in time. Markov assumption is that the current state depends on only finite fixed number of previous states. We can define initial state and conditional distribution of transition and then sample values from a Markov model (e.g. changing sunny and rainy days).

Hidden Markov models are for situations where we need to infer probability from observation, because the original state is hidden to us. For example we want to predict whether it is sunny or raining based on the observation of an activity (shopping, going for a trip, …) on the particular day.

Pomegranate is a Python package that implements fast and flexible probabilistic models ranging from individual probability distributions to models such as Bayesian networks and Hidden Markov models. PyMC3 is a library for probabilistic programming.


Optimization is an activity of choosing the best option from a set of options. In the applications for Artificial Intelligence we are typically trying to find the best state in a State space where we can move from a state configuration to another state configuration. There are different types of algorithms for optimizations: Hill climbing, Simulated annealing, Linear programming, backtracking and others.

Hill climbing is a technique for finding local minimum or local maximum. It is an iterative algorithm that starts with a random solution to a problem and then make incremental changes in order to find the real solution (at least the local one), basically moving from the initial state to a better state and again to a better state until no better state is found among neighboring states. An example application would be placing of hospitals on a grid in such a way that they are close to houses or finding optimal solution to Travelling salesman problem. Hill climbing has different variations like Stochastic hill climbing or Random-restart hill climbing that will consider different directions to look for the solution.

The difference between Hill climbing and Simulated annealing is that Simulated annealing is based on probability and allows us to accept neighbors that are worse than the current state, at least at the beginning of the algorithm. It doesn’t guarantee that a global minimum or maximum is found, but it has a chance not to get stuck in a local minimum or maximum.

Optimization algorithm Simplex represents what we call Linear programming. It is for problems where decision variables have linear relationship, in other words we need to express constraints as linear equations. If we can do that, Simplex will find the optimal solution. An example application can be finding the efficient use of different machines in a factory, where each machine has some output and cost and we need to combine different machines together in the most effective way.

Constraint Satisfaction Problems are situations where constraints are imposed on variables and where we are trying to find a solution that satisfies those constraints. Type inference in programming or Sudoku classify as CSPs. The solutions can be solved using Backtracking and other methods.

Hill climbing and Simulated annealing implementation can be found in the Python library Solid. Linear programming solutions can be implemented using SciPy or PuLP.


Machine learning is all about the ability of a computer algorithm to learn from data. We can divide machine learning into supervised, unsupervised, semi-supervised or reinforcement learning.

Supervised learning is machine learning where we use human-labeled training set to tell the algorithm the meaning of the data. Once the algorithm is trained on a training set, it will be able to perform tasks on new data as well. In other words we give a set of inputs and outputs to a computer and let it find a function to match inputs to outputs. Classification is a typical supervised learning problem of classifying items into discreet classes while regression is used for finding a mapping function from inputs to continuous values (not discreet categories). The most popular type of regression is Linear regression. It can be used to predict or assign a price of a product based on the product’s parameters and data of other products, if we have the products’ prices and values of the parameters.

K-nearest neighbors algorithm can be used for both classification and regression. It considers k nearest data points from inputs to derive output. For instance, in classification it will assign a class based on the plurality vote of the k-nearest other items.

The idea behind Support vector machines is to find a hyperplane (in 2D that means finding a line) that can separate different groups of items. We are usually trying to find the one that represents the largest separation between the groups. Again SVM can be used also for classification as well as regression and it is especially useful when the data is not linearly-separable and we cannot use a simpler method.

Loss function also called a cost function is an important concept in machine learning. It is used for evaluating classification solutions where it is the penalty for an incorrect classification of an example. Regularization is a technique to change a learning algorithm by modifying the loss function to favor simpler prediction rules to avoid overfitting. Overfitting is a situation when our trained model works too well on the data that it was trained on, but fails on new data.

Another technique to avoid overfitting is Cross validation. In Hold out cross validation we divide our data set into train and test data, so that we can evaluate our model on test data after it is trained on train data. That way we can be sure that our model also works with “new” data that the algorithm didn’t see before. K-fold cross validation is repeating the Hold out process k-times by generating k different sets of train and test data.

Unsupervised learning is a type of machine learning where a computer looks for patterns in data without or with minimal human assistance. A classic example of unsupervised learning is clustering, e.g. K-means clustering. In clustering, items are grouped together into groups called clusters. The practical application could be for instance in marketing by grouping website visitors based on their behavior and showing them different products.

Semi-supervised learning is a situation where only a part of the training set is labeled. For instance, we have people tagged on some family photos and we want to identify people on the other pictures.

Reinforcement learning is learning with an environment where an agent performs actions, gets rewards or penalties and tries to achieve a cumulative reward. This way the agent learns by trial and error. An example of this would be a robot trying to learn how to walk. Markov Decision Process is a model for decision-making, representing states, actions and their rewards. Q learning is an example of reinforcement learning algorithm.

The most popular machine learning Python library is without a doubt scikit-learn, but there are more options. For example, XGBoost can be used for regression and classification problems. When using Spark for distributed computations we can also use MLlib.

Machine learning libraries for Neural networks, Time series and Natural language processing are mentioned in their respective sections.

Neural networks

Neutral networks are computing systems that map number of inputs to outputs based on the structure and parameters of the network which is based on highly connected processing elements called neurons. Neural networks are typically organized into layers. They can be used to model complex relationships between inputs and outputs or to find patterns in data. The most important neural networks are Deep networks, Convolutional neural networks (CNN) and Recurrent neural networks (RNN).

Gradient descent is an algorithm for minimizing loss when training neural networks. Gradient is basically a direction that will lead to decreasing loss. In gradient descent we first pick some random weights for each of the inputs and then repeatedly calculate gradient of the loss function on all data points, updating the chosen weights. There are some variations of gradient descent such as Stochastic gradient descent or Mini-batch gradient descent.

Multi-layer neural networks are networks with at least one hidden layer. Deep networks are networks with multiple hidden layers. Deep learning is about training deep networks.

Backpropagation in deep learning is a technique to train a neural network with hidden layers. As before, we first pick random weights for each of the inputs. Then we repeatedly calculate error for the output layer and for every layer (starting with output layer), we propagate error back one layer and update the weights.

Convolutional neural networks are networks that use convolution which basically means that we are looking at neighboring data in order to reduce the size of the network (say from an image 100×100 pixels we create several feature maps 10×10, one for shape, one for red color, one for blue color etc.). The classic application of CNNs is in analyzing images, but there is wide application for them from recommender systems to natural language processing.

Recurrent neural networks are networks where outputs of neurons are fed as inputs into themselves. RNNs find their application in speech recognition, machine translation or handwriting recognition.

The classic Python deep learning libraries are Tensorflow and PyTorch.

Keras is an alternative API for Tensorflow, is an alternative API for PyTorch and skorch is a scikit-learn compatible alternative API for PyTorch.

Time series

In machine learning, time series are used for clustering, classification, regression, anomaly detection and forecasting.

Darts, sktime and tslearn are general purpose time-series ML libraries. There is also Facebook’s forecasting library prophet.


The ability to understand and use human languages is crucial in accepting AI to be really intelligent. The area of AI dedicated to this study is called Natural Language Processing (NLP). In more practical terms, we are usually interested in things like machine translation, information retrieval, information extraction, text classification, text search and other similar tasks.

The study of languages begins with syntax. Context-free grammars are most often used to represent the syntax of human languages.

In NLP we use data structures such as n-grams (sequence of n items from a text, usually words) and bag of words (a collection of words representing a larger text such as a sentence, paragraph or article).

Markov model, the same model as discussed in Uncertainty, has its application also in natural language processing and can be used to understand the probabilities of words following other words since in Markov chains we model a probability of events based on previous events. Additive smoothing is a technique of adding α to all values so that we avoid multiplying by 0 in Naive Bayes).

TD-IDF, in full words term frequency–inverse document frequency is a way to know how important a word is in a text, given a collection of documents. The principle is simple: the word is important if it is repeated frequently in the text, but not in other documents.

Semantics represents meaning in texts and it is crucial in understanding natural languages.

word2vec is a technique of representing all words by a vector so that by comparing the vectors of two words we can deduce what is the relationship between them, e.g. if they are synonyms.

WordNet is a widely used and available lexical database of English language.

The most popular Python libraries for NLP are spaCy and ntlk. fastText is a library for efficient learning of word representations and sentence classification. Transformers provides NLP models for deep-learning libraries Tensorflow and PyTorch.

More libraries and final words

Artificial Intelligence is a topic too big for one article and I couldn’t possibly cover or explain everything. I chose to include topics that were mentioned in the Harvard’s course because I think they represent the most important concepts and algorithms for starting in AI.

For each topic I tried to add pointers to Python libraries where some of them were and some weren’t mentioned in the course. I also added a section on time series machine learning because it is a very hot topic today and there are many libraries in Python for it.

Clearly, Python ecosystem is packed with libraries for Artificial Intelligence and we have already seen plenty of libraries for graphs, logical programming, probabilistic models, machine learning, neural networks, deep learning and natural language processing. Is there more?

There is! There are also some modern wrappers around existing libraries that focus on high-level APIs and simplicity. So check out Pycaret and thinc!

This is the end! I hope that the article was useful to some of you and feel free to contact me on twitter or elsewhere with suggestions or comments.

Last updated on 3.10.2020.

ai data popular python