Ditch the command-line arguments from your machine learning code: use a config file

Jonny Jackson
10 min readJan 27, 2021

--

Real examples that you can incorporate into your code base

Sometimes, it’s the simple, unsung heroes that can make all the difference. Machine learning configuration — that is, the network architecture, the dataset root folder, the learning rate, et al. — is probably not something that’s going to get a lot of airtime when you’re rushing to get your next data analytics software to production. But, with all the work it’s doing, it’s only fair to call your config a hero.

Photo by Eun-Kwang Bae on Unsplash

In this article, I’ll show you ready-to-go examples of how moving your hyper-parameter selections, command-line arguments, and default options to a YAML or JSON config file will hugely simplify your life while developing and experimenting with your machine learning pipelines.

We’ll look at an exemplary specimen of a config file and a simple wrapper to set up a helpful syntax for accessing configuration options. Then, we’ll cover a couple of strategies for handling default values. Finally, I’ll show you a real-world example of how a config file can really help make your experiments dynamic with very little overhead: specifying an entire data augmentation pipeline.

Contents:

  1. Why bother with a configuration file?
  2. Loading your config into your code
  3. A simple wrapper for your config dictionary
  4. Setting default values
  5. Structuring your config files
  6. Bonus! Dynamically specify your data augmentation pipeline

I’ll be using Python to demonstrate code snippets, given its gold-standard status for machine learning, but all of the information in this article can be applied to your language (and framework) of choice.

1. Why bother with a configuration file?

Here’s a quick heads-up: if you’ve already got a code base, some of these steps are likely to require some refactoring. It won’t take a crazy amount of time — especially if you’re starting from a relatively blank slate — but it’s good to know what the benefit will be before we start.

Here are a few unpleasant situations I’ve found myself in in the past while working on my PhD (in AI for medical imaging):

  1. Returning to a highly ‘command-line-argumented’ script after a few weeks and having to find or figure out which arguments make the thing work properly.
  2. Realising, after running several experiments, that my output folders are now mixed up, in some cases requiring me to re-run entire experiments just for peace-of-mind that I’m looking at the right thing.
  3. Writing and re-writing a sensible ‘schema’ for my output folder names, usually ending up with something like resnet151_nodataaug_10000epochs_batchsize256 and so on until I start to lose my sanity.
  4. Filling my read-me files or emails to colleagues with hand-written instructions on what parameters to use to achieve replication — and banking on people reading them carefully.
  5. Having a trail of manual flags littering my code while doing micro-experiments or during development, turning on or off code features like the one below:
useFocalLoss = True
loss = focalLoss if useFocalLoss else binaryCE

Of course, all of these things might be solved by planning and writing code more carefully — but if you’re working on a data project, chances are your pipeline is: unpredictable, with the next step depending on the outcome of the current one; time-dependent, where beautiful, frameable code is for later down the line, once you’ve actually demonstrated the results you need.

Ok, so how can a configuration file help?

Let’s look at points 1–5 and consider how they might be resolved if all of the settings and options are instead listed in a configuration file, such as a JSON or YAML file, such as the examples below:

  1. Returning to code after a few weeks, months, or years is no problem any more — your settings are now an explicit part of your code base (and not just a part of the read-me).
  2. Each time you run an experiment, it’s following your config file. All you need to do is make sure the config file is (automatically) copied into the output folder each time your experiments run and you now have a full, written record of what the experiment was.
  3. Since your config tracks every single tiny detail of the modifications you’ve made, you no longer depend on your folder names. Just date them, or give them a readable reference name that you can tabulate elsewhere.
  4. Publishing your code or sharing it with a colleague? They now have your config files as well. No questions needed.
  5. Here’s one of the most powerful reasons to use a config file. With the process we’ll look at in the next steps of this article, you can almost immediately punt any — even temporary — code switches or test logic back to your config file. No more hard-coded hacks!
Photo by Nghia Le on Unsplash

2. Loading your config into your code

Ideally, your config will be in the form of a mildly-nested dictionary (we’ll look a bit later at choosing a config structure). That means the first task is getting the data from your YAML, JSON, or other static file into your application. In Python this is extremely simple, using pyyaml and json packages, respectively.

json is a built-in library, pyyaml can be installed with pip install pyyaml.

To load the config in is simple as anything:

From here on, there is no difference in the config file type you choose — YAML, JSON, or any other — as it is now fully loaded into your application and exists as a Python dictionary. For this reason, from here on I’ll be using YAML for demo purposes, since I feel it is closer in style to Python than JSON. However, it is completely your choice!

FYI: All code in this article is written and tested in Python 3.9. We’ll be exclusively using the Python standard library so you’re unlikely to run into compatibility issues, even in much later versions of Python.

3. A simple wrapper for your config dictionary

In the above snippets, config is now a (probably mildly-nested) Python dictionary containing the settings defined in your .yaml or .json file. So, let’s say I wanted to get the batch size for this experiment. One option would be to use the square bracket operator repeatedly:

This is acceptable if you don’t mind the exceptions that might be thrown if any of the keys don’t exist in your config file. A possible alternative might be the dict.get(key[, default]) method of Python dictionaries:

Clearly the intention in that snippet is to set a default value of 128 as a fallback, rather than throwing a KeyError. However, as you can probably see, this doesn’t deal well with the situation where one of the intermediate keys such as ‘training’ is non-existent.

Furthermore, it’s a little bit… expansive.

So, here’s an example of a class that you can wrap your raw config dictionary in, that achieves three things:

  1. It controls how config data is accessed within your code
  2. It allows a default option to be easily set if the value is missing from the config file
  3. It exposes a convenient and readable syntax to access nested data: in the above example, we’ll allow the batch size to be retrieved using this command: config.get('experiment/training/batch_size', 128)

I’m going to split this up slightly, to explain what each method of the class is doing.

# 1. Load in the config from the file as part of the constructor

# 2. Add a get method that parses the 'experiment/training/batch_size'string by splitting it into a list and recursively accessing the nested sub-dictionaries

Now this can be used throughout your code to access config options:

This code demonstrates one of the ways in which default values can be specified explicitly throughout your code. In the next section I’ll show you an alternative — and for me, preferred — way of setting default values.

First, however, hopefully you can now see how it would be possible to set up a much simpler command-line interface for your application where the only argument is the path of the configuration file:

$ python train.py configs/experiment1.yaml

**Style choice alert**

You’ll need to decide whether you want the functions throughout your application to take the whole config object as an input:

…or to split out the individual parameters when defining each function:

The benefit of the first approach is that you will find it much easier to propagate your config object downwards if you are calling subroutines and sub-subroutines. This is particularly relevant if you are in a hackathon-style phase of development and don’t have a clearly defined structure in your code yet.

On the other hand, the second approach is more explicit about what information it needs and is also more SOLID (in a nutshell: more independent of how the rest of your code is written).

In my experience, I’ve found the first approach to work well since it allows me to perform the “Step 5” mentioned above:

you can almost immediately extract any — even temporary — code switches or test logic back to your config file.

…by giving me immediate access to any new entries I add to my config file.

Photo by Ed Robertson on Unsplash

4. Setting default values

So, we’ve seen one way to explicitly set default values throughout your code by using the default argument of the config.get method. This approach is good at being explicit in a way that keeps information local to the area in which it is relevant.

However, there are two ways in which there is still some things left to be desired:

  1. There is no single-source-of-truth to the default values being used. If you need to access the config item in two or more places in your code, there is no way to enforce consistency.
  2. There is no clearly defined list of all the parameters being used throughout your code base, unless your example config file is totally comprehensive or you meticulously document the parameters in your read-me. This means the full power and parameterisation may not be apparent to users of your code.

So, here’s a second way to set default parameters which is my personal go-to when it comes to configuration — the default config file.

What we want is a second YAML or JSON file that contains all of the default values of all of the parameters in the code. This solves the above two issues nicely, by setting the single-source-of-truth and also serving almost as the documentation for the parameters available.

The behaviour we want is that:

  1. If the main config file (which I’ll now refer to as custom.yaml) is missing a parameter, the default value should be used
  2. If the parameter is in custom.yaml, that should overwrite the default value.

First, here’s a helper function that will (recursively) merge two nested dictionaries, overwriting the values of dict1 whenever there’s a overlap.

To make use of a default config file, now all that’s needed is to update the constructor for the Config class:

Now you can ship your default.yaml with the rest of your code. Then, for each new experiment create a new custom config file that will set the specifics of that experiment:

This structure is actually very powerful — at the end of this article, I’ll show you how you can specify your entire data augmentation pipeline directly in your config file. But before we get to that, let’s quickly talk about structuring your config files.

5. Structuring your config files

This won’t be a long section as generally this will probably be pretty clear to you and your use cases. The main decision will be deciding on how nested to make your config.

Less nested config will clearly be less structured, especially when considering that Python dictionaries have no concept of order. However, overly nesting the config file may make life difficult when, later down the line, you realise that a particular parameter is needed in two different contexts — for example, the number of output features of your encoder network is likely to be the same as the number of input features of your decoder network, so choose a structure that’s general enough for each context.

6. Bonus!: Dynamically specify your data augmentation (or other pipeline structures)

Finally, to really show you how to unlock the power of your config files, we’ll look at an example of specifying a highly-configurable and often highly-modified section of a machine learning workflow: data augmentation.

Data augmentation is generally treated as a pipeline, with raw, unedited training or testing data going in at the beginning, and augmented — noisified, blurred, rotated, flipped, etc. — data coming out the other end. It has been shown to be hugely impactful to the performance of a machine learning algorithm by preventing overfitting.

Many machine learning libraries such as torchvision for PyTorch have their own API for data augmentation that can be “composed” (joined into a sequence), by specifying an ordered list of transformations:

In the above example, this would be used later on in a PyTorch DataLoader to automatically preprocess image data before they reach the machine learning algorithm.

What we’re aiming for is to be able to specify any particular augmentation pipeline in the config file, such as:

…and to then load that in with a single line:

Since the augmentation options are now set in the configuration file, it is much easier to reproducibly experiment and find the optimal data augmentation pipeline for the task at hand.

To achieve this, I wrote the definition below — feel free to use it in your own code:

For more information about how this code works, it is based on the final method described in my other article. Feel free to check it out!:

Conclusion

In this article we looked at the advantages of using configuration files over command line arguments and hard-coded options.

By using a simple wrapper class for the configuration file’s data, we were able to craft a handy syntax to improve the developer experience of accessing configuration data. This class also allowed default parameters to be set in another configuration file. Finally, we saw an example of how we can dynamically generate procedural code defined in the configuration file and hence reap all of the rewards already gained: reproducibility, transparency, and single-source-of-truth.

Thanks for reading! When it comes to configuration for machine learning, a bit of legwork can really benefit your development workflow and make life easier when it comes to experimentation.

I’d love to hear your thoughts about this article — please leave your comments below!

--

--

Jonny Jackson

PhD student in Artificial Intelligence and Medicine. Teacher of coding and machine learning to children and adults alike. -> jonny.jxn.co