1 Introduction

Welcome to AE 4803 AIM: Foundations of Scientific Machine Learning. This is a new course offered in the Spring 2025 term at Georgia Tech and is intended to be a “foundations” level course in the new College of Engineering AI for Engineering minor.

1.1 What’s in a name? Artificial intelligence, machine learning, data science, and the scope of this course.

I googled “artificial intelligence” (often abbreviated AI) and the first result comes from Google’s new “AI overview” feature¹: Artificial intelligence (AI) is a field of study that focuses on developing machines that can learn, reason, and act in ways that are usually associated with human intelligence.

Because I closed the tab with my first search and wanted to go back to the webpage to copy the text for the image alt-text above, I repeated my search, and got a somewhat different answer: Artificial intelligence (AI) is the ability of a machine to perform tasks that are usually associated with human intelligence.

These are probably both reasonable definitions for casual conversation, but that’s not what we’re here for. Instead, we’re here to really learn deeply about what AI is, and for that we’re going to need precision – that is, of our definitions.

Exercise

Consider the two different definitions of “AI” above carefully. In what (if any) senses are they (a) exactly the same, (b) similar, (c) somewhat different, (d) very different? What consequences might these differences have (a) developing AI algorithms, (b) evaluating AI impacts, (c) creating AI policies?

I’m not going to provide a definition of AI for now and instead I’m going to throw two more terms into the mix that you may have heard. The first is “machine learning” (often abbreviated ML). Rather than get an AI definition for this too, I decided to go to the dictionary Merriam-Webster, which provides the following primary definition (Merriam-Webster Dictionary 2024):

a computational method that is a subfield of artificial intelligence and that enables a computer to learn to perform tasks by analyzing a large dataset without being explicitly programmed

And finally I’ll add the term “data science” (curiously, rarely abbreviated DS). I wanted to give you a Merriam-Webster definition here, but the term isn’t in their dictionary as of writing this on October 28, 2024. So instead I’m going to use Cambridge Dictionary’s definition (Cambridge Dictionary 2024):

the use of scientific methods to obtain useful information from computer data, especially large amounts of data

Exercise

For both the terms “machine learning” and “data science”, find an alternative definition from a source that is not a generative AI. What differences exist between the new definitions you’ve found and the ones I’ve cited above?

Clearly, different people/entities may have different ideas of what AI, ML, and data science may mean. Some people may use these terms to describe generative tools like ChatGPT, GitHub copilot, and Dall-E (or similar products developed by other entities). Others use these terms to refer to more purpose-built algorithms like AlphaGo (for playing Go) or GraphCast (for weather prediction).² Many academics use these terms to describe the study of the underlying mathematical and programming ideas on which such products are built.

My goal is not to give you a single definition of any of these terms and then to argue that my definition is more correct than any other definition. The point I want to make is that it’s worth being clear about how we define these terms in any given context, whether it be in a textbook, a news article, or perhaps a spirited discussion between friends. To that end, I now want to make clear what it is that we will and will not cover in this class, which is a “foundations”-level course in the College of Engineering’s AI minor.

The focus of this class will be the mathematical and programming foundations of scientific machine learning, which I define as the study of algorithms which use scientific data to define computational tools that perform useful scientific tasks. The main scientific task that we will focus on in this course is the task of predictive simulation, which seeks to predict the behavior or outcomes of scientific or engineering systems. Predictive simulation is a key task in engineering disciplines like design and control, where we seek to predict design outcomes and control responses in order to make decisions about design parameters and control inputs. The class of machine learning methods that we will focus on in this class will therefore be regression methods, which use data to define relationships between inputs and outputs that allow us to take a specified input and issue a prediction for the output.

1.2 Regression methods: an overview

1.2.1 Motivation and problem description

We use the notation \(z\in \mathbb{R}^d\) to denote a real-valued vector of \(d\) input variables, and the notation \(y \in\mathbb{R}\) to denote a real-valued scalar output variable. Our goal is to be able to predict the value of \(y\) if we know the value(s) of the input variable \(z\). Some examples:

in aerodynamic modeling, \(z\) could contain airfoil geometry parameters like camber and thickness, and \(y\) could represent the lift coefficient. Being able to predict \(y\) from \(z\) enables engineers to choose more aerodynamically efficient designs.
in orbital dynamics, \(z\) could represent orbital parameters like altitude and inclination, and \(y\) could represent the total time a satellite spends outside the sun’s shadow in a given time period. Being able to predict \(y\) from \(z\) enables engineers to determine if a satellite’s solar panels will generate enough power to support the satellite’s mission.
in chemical process engineering, \(z\) could represent conditions within a reactor like temperature and chemical mixture properties and \(y\) could represent the reaction rate. Being able to predict \(y\) from \(z\) enables engineers to design more efficient reactors.

Exercise

Propose your own example scenario where it would be useful to predict real-valued outputs from inputs, drawing on your own experience, e.g. in personal projects, previous coursework, work/internship/co-op experiences, or extracurriculars. What would \(z\) and \(y\) represent? What does predicting \(y\) from \(z\) enable in your scenario?

Mathematically, predicting \(y\) from \(z\) amounts to defining a function \(f\) that takes in a \(d\)-dimensional input and outputs a scalar. Mathematical notational shorthand for this is \(f:\mathbb{R}^d\to\mathbb{R}\). This function \(f\) is often called a model. The question at hand is: how do we choose \(f\)? There are many ways to do so:

At a baseline, you could just make up a model: let’s just say \(y = f(z) = \|z\|^2\). This is mathematically valid, but it’s probably a bad model for the three example scenarios above, because this model probably issues predictions that are very different from the true outputs.
Alternatively, you could develop a model based on physical principles (a “physics-based” model) – if you have taken classes in (aero/thermo)dynamics, then you would have learned some ways to calculate \(y\) from \(z\) in the above examples, e.g. based on potential flow theory, rigid body dynamics, or the Arrhenius equation. These models are likely to be more accurate than our made-up model above, although they are often imperfectly accurate because they make simplifying assumptions. One drawback of physics-based models is that they may be computationally expensive (some computational fluid dynamics simulations require many thousands of hours of supercomputing time). Additionally, fully physics-based models may not even exist in some applications (e.g., there are many aspects of plasma physics for which we currently lack a complete theoretical understanding).
Finally, our focus will be on using data to define a model (a “data-driven” model or a “learned” model). We assume that we have a data set consisting of \(N\) pairs of input and output data, \((z_i,y_i)\) for \(i = 1,\ldots,N\). To define our model, we want to choose an \(f\) so that the predicted output \(f(z_i)\) is close to the output data \(y_i\) for all the data in our data set. This is sometimes called “fitting” the model to data. The advantages of data-driven models are that they may be significantly cheaper than physics-based models, and they can be fit to experimental data even when we lack scientific theory for developing physics-based models. The disadvantages are that data-driven models require data – and the amount of data that would be “enough” to ensure that the learned model is accurate or useful is highly dependent on the application.

Learning how to fit models to data and assess their accuracy and usefulness is the focus of this course.

1.2.2 Mathematical setting and problem formulation

Hot take: Machine learning methods, at their core, are simply very complicated calculators.

Our goal is to understand the calculations that these methods are carrying out, which is going to enable us to understand and assess both the successes and failures of the methods. This means we want to be precise about the mathematical problem formulation, that is, the characterization of the math problem that is being solved by the methods.

Formulating a regression problem requires three ingredients:

Paired input and output data: let \(N\) be the number of pairs, and let \(z_1,\ldots,z_N\) denote the \(N\) input data and \(y_1,\ldots,y_N\) denote the \(N\) corresponding outputs. It is common to notate the set of all these data pairs as \(\{(z_i,y_i)\}_{i=1}^N\).
The definition of a parametrized model class – more details on this in a moment, but you should think of this as a set of possible functions that map \(z\) to \(y\).
A method for choosing a model from within the chosen model class – in regression problems this is most frequently done by solving an appropriate optimization problem to pick the “best” possible model from within the class.

Let’s focus on the second ingredient, the parametrized model class. We use the mathematical term “class” in a similar way to the mathematical term “set”. For example, perhaps you will have heard before that the numbers \(1,2,3,\ldots\) all the way to infinity form the “set of natural numbers” (often denoted \(\mathbb{N}\)). As another example, \(\{H,T\}\) can denote the set of possible outcomes of flipping a two-sided coin (H for heads, T for tails). Note that it is common to use curly braces \(\{\cdot\}\) to enclose a set (notice where it appears in our shorthand notation for the data set!). Similarly, we can define a set of possible functions that map \(z\) to \(y\), for example (assuming \(d=1\) so that \(z\) is a scalar for the moment):

\[ \mathcal{F}:= \{z^2, \sin(z), \exp(z)\} \]

In the above expression, the notation \(:=\) indicates that \(\mathcal{F}\) is being defined as the expression that follows \(:=\), that is, \(\mathcal{F}\) is the set of three possible functions: \(f(z)=z^2\), \(f(z)=\sin(z)\), and \(f(z)=\exp(z)\). You see that it might be cumbersome to write the “\(f(z)=\)” part of every function, so that part often gets dropped.

But so far I have still only been saying “set” — what distinguishes a “class” of potential models from a “set” of potential models? Usually we use the term “class” to denote a set of functions that share certain properties or functional forms, for example (still assuming \(z\) is scalar for the moment) the class of all quadratic functions:

\[ \{c_0 + c_1 z + c_2 z^2 : c_0,c_1,c_2\in\mathbb{R}\} \qquad(1.1)\]

The way to read this notation above is as the set of all functions of the form \(f(z) = c_0 + c_1 z + c_2 z^2\) that can be obtained when \(c_0,c_1,c_2\) are allowed to take on any real values.³ We call this the class of all quadratic functions because by varying \(c_0,c_1,c_2\) we always get a quadratic function (noting that with our mathematics hats on, linear functions are just special cases of quadratic functions where the second-order coefficient is zero). Thus, functions within the class share a functional form, unlike the set \(\mathcal{F}\) we defined above. However, we could use the functions in the set \(\mathcal{F}\) to construct an alternative model class:

\[ \{a z^2 + b\sin(z) + c\exp(z) : a,b,c\in\mathbb{R}\} \qquad(1.2)\]

or, allowing \(z\) to be a \(d\)-dimensional vector again, we could consider the following model class:

\[ \{\max(0,a^\top (Wz + b)): W\in\mathbb{R}^{d\times d}, \,a,b\in\mathbb{R}^d\} \qquad(1.3)\]

Exercise

How would you read and understand the two classes of functions notated above?

Notice that the notation we have been using to define function classes follows a pattern: within the brackets, before the colon comes the form of the function, followed by a list of some variables that appear in the function definition, together with a specification of what values those variables are allowed to take on. We call the post-colon variables parameters, and the set of all values that the parameters are allowed to be chosen from is called the parameter space. Any of the classes of functions we have defined above can be described as a parametrized model class, because they define sets of functions with a shared form, where the individual functions within the set arise from varying the values of the parameters.

Let me describe some common notational conventions that I want you to be familiar with.⁴ ⁵ It is common to use a single lowercase letter (usually a Greek one) to denote the set of all parameters – e.g., \(\beta = \{c_0,c_1,c_2\}\) or \(\theta = \{a,b,W\}\), and the corresponding capital letter to denote the corresponding parameter space, i.e., \(B = \mathbb{R}^3\) and \(\Theta = \mathbb{R}\times \mathbb{R}\times \mathbb{R}^{d\times d}\). Where we might write \(y = f(z)\) for a general abstract model \(f\), we will typically write \(y = f(z;\beta)\) or \(y = f(z;\theta)\) to indicate a model parametrized by \(\beta\) or \(\theta\) (note that we separate inputs from parameters using a semicolon).

If you’ve heard the term “model architecture” used to describe a machine learning model, what this means is the specification of the parametrized model class. As the term “architecture” suggests, choosing the model class is a design choice, and different model classes have varying advantages and disadvantages in terms of best possible performance, difficulty to train (more on that shortly), computational cost, and more. We’ll cover many possible choices of model class, and explore their pros and cons, throughout the course.

Now we will turn our attention to the third ingredient, the method of selecting a model from within the chosen class. Because parametrized models are defined by their parameters, the choice of model amounts to choosing the parameter values. Again, you could just make up the parameter values (choose your favorite numbers), but this is unlikely to match the data well and unlikely to yield accurate predictions. In some cases you could also derive the parameter values based on physical principles (this is possible when models in the chosen class have some physical interpretation). But our focus will be on choosing parameters that define a model that “best matches” the available data. A basic and standard version of this parameter selection task is to solve the following minimization problem:

\[ \theta^* = \arg\min_{\theta\in\Theta}\frac 1N \sum_{i=1}^N (f(z_i;\theta)-y_i)^2 \]

Note that for the above expression to be well-defined, we need our first two ingredients – that is, we need our data set, and we need to have already chosen the parametrized model class, which gives us an expression for the parametrized function \(f(z;\theta)\). Once we have those ingredients, this expression defines \(\theta^*\) as the parameter that minimizes the average square error of the learned model over the available data. In this sense, this parameter \(\theta^*\) defines a learned model \(f(z;\theta^*)\) that “best fits” the data (it is better than all other models in its class).

As we progress through this course, we will learn more about how we can solve the above optimization (it is different for different model architectures), and about alternative ways to select the parameter based on alternative minimization problems.

1.3 A foundation for learning about ML beyond parametrized regression

I want to emphasize that this class is not and indeed cannot (realistically) be an exhaustive introduction to machine learning, artificial intelligence, or data science. Even the community of professionals who describe themselves as working in “scientific machine learning” is quite broad in scope and would include folks working on topics that I do not intend to cover in this class. There are many topics within AI/ML/data science we will not cover (beyond perhaps at most a very surface level), including generative modeling, decision trees, unsupervised learning methods like clustering, and many more. The philosophy of this course is to empower you to learn about these other methods through future coursework or independent study, by providing a foundation in what I term the `three pillars of artificial intelligence’:

Mathematics: all machine learning methods are fundamentally solving mathematical problems at their core. The first pillar of AI is to be able to mathematically define the problem one seeks to solve.
Computation: once a mathematical problem is defined, an ML method must define an algorithm that carries out the calculations to solve the problem. The second pillar of AI is to define an algorithm that (approximately) solves the mathematical problem.
Human intelligence: this is the crucial third pillar of AI. Human intelligence is vital for defining mathematical problem formulations, designing algorithms, and assessing the results and iteratively updating the mathematical problem formulation and computational algorithm to achieve desired outcomes.

We will seek to develop a firm foundation in these three pillars of AI throughout the course.

It seems apt to use AI to define AI, and I’m quoting it for illustrative purposes here, but I do want to point out that Google’s AI Overview is not at all transparent about how these responses are generated and thus does not meet the standards for being cited as a source in most publishing venues.↩︎
this paragraph lists examples that came to me most quickly and thus reflects to some extent my own cognitive biases based on the media I’ve consumed. I welcome suggestions of other examples of AI/ML/data science to include that are less well-known – Email me!↩︎
Following the same notational convention, we could write the data set as \(\{(z_i,y_i) : i =1,\ldots,N\}\), which we would read as the set of all pairs \((z_i,y_i)\) for \(i = 1,\ldots,N\), but the notation \(\{(z_i,y_i)\}_{i=1}^N\) is more compact and more common. I don’t make the rules. 🤷‍♀️↩︎
These are common conventions, but the ML community is comprised of a diverse group of scientists and engineers across disciplines, and it’s very common to read things that use entirely different notation. This is unfortunately the reality of things, and one of the reasons I am going to push you in this class to be comfortable describing the underlying concepts precisely using different notations - that is, you are going to develop flexibility in re-assigning meaning to different symbols as we go along (I hope).↩︎
I’m making an effort to explicitly define notation when I introduce it, and to give examples of how to read and understand it, but it’s easy for me to miss things, so please speak up or write us a message if you have a question.↩︎