Generating MIDI music with Recurrent Neural Networks

Posted on by

Generating MIDI music with Machine Learning

One of the most well-known Machine learning techniques is the recurrent neural networks, invented by John Hopfield in 1982. Next we will explain his mathematical fundamentals and we will develop an example that generates music in MIDI format.

Recurrent neural networks are used to predict the next element in a series. For example it can be a character, a musical note, the temperature or the value of a stock brokerage. But to predict that element is necessary to train our system, so that given a series learn what is the element most likely to happen.

The structure of these networks is similar to feed-forward networks with the difference that the recurring networks have a hidden state whose activation depends on the previous state in each time instant.

Next we will see what are the mathematical foundations of such networks based on an example to generate text with python.

Generate text with Recurrent Neural Networks

The formula of an RNN (Recurrent Neural Network) is defined by the current hidden state h (t), this being a function dependent on the previous hidden state h (t-1), and the current state x (t). Theta (θ) are the parameters of the function f.

 \Large \boxed{h(t) = f[h(t-1), x(t); \Theta]}

The loss function is given by the sequence of values of x paired with their respective values of y for all time values up to t.

 \Large \boxed{L({x(1),...,x(t)},{y(1),..., y(t)}) = \sum_{n=1}^{t} L(t) = \sum_{n=1}^{t} -\log y(t) }

But to understand it better we are going to see it with the development of code in python.

Load the training file

First we have to load a text file large enough for our system to learn. In this case we will load the shakespeare.txt file as follows:

data = open('shakespeare.txt', 'r').read()
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)

Where data_size is the size of our training text and vocab_size is the number of unique characters the text contains.

Neural networks operate with vectors, so we need to create a way to encode our characters into vectors and vice versa.

char_to_ix = { ch:i for i,ch in enumerate(chars)}
ix_to_char = { i:ch for i, ch in enumerate(chars)}

These dictionaries allow us to create a vector of the size of vocab_size where all the values will be zero except the position of the character that will have a one. Let’s look at an example with the character ‘a’.

import numpy as np

vector_for_char_a = np.zeros((vocab_size, 1))
vector_for_char_a[char_to_ix['a']] = 1
print vector_for_char_a.ravel()


[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]


Define the Neural Network

The neural network consists of three layers: the input layer, the hidden layer, and the output layer. Each layer is connected to the next layer, so each node in the layer is connected to all nodes in the next layer as we can see in the image.

Neural Network

In order to train the neural network we have to define its hyperparameters:

  • seq_length: Length of the input sequence.
  • hidden_size: Hidden layer size.
  • learning_rate: Learning rate.


We also have to define the parameters of the model:

  • Wxh: Are the parameters that connect the vector that contain an input to the hidden layer.
  • Whh: Are the parameters that connect the hidden layer to itself. This is the key of the recurrent neural networks since this affects the injection of the previous values of the output of the hidden state with itself in the next iteration.
  • Why: Are the parameters that connect the hidden layer to the output layer.
  • bh: Contains the hidden bias.
  • by: Contains the output bias.


All this is defined in python as follows:


hidden_size = 100
seq_length = 25
learning_rate = 1e-1

#model parameters
Wxh = np.random.randn(hidden_size, vocab_size) * 0.01 
Whh = np.random.randn(hidden_size, hidden_size) * 0.01
Why = np.random.randn(vocab_size, hidden_size) * 0.01
bh = np.zeros((hidden_size, 1))
by = np.zeros((vocab_size, 1))

Define the loss function

The loss is a fundamental key to evaluating the neural network. The lower the value of the loss will the better the prediction of our model.

During the training phase our goal will be to minimize the loss.

The function of losses calculates: the next character giving a character of the training data, the loss comparing the character that has calculated with the character that corresponds according to the training data and also the gradients.
The input data of the function are:

  • List of input characters.
  • List of characters in the training data.
  • The previous hidden state.

The output data of the function are:

  • The loss.
  • The gradients for each parameter between layers.
  • The last hidden state.

Forward pass

The forward pass uses the model parameters (Wxh, Whh, Why, bh, by) to calculate the next character of a series obtained from the training data.

  • xs[t] : Vector that encodes the character of the position t.
  • ps[t] : Are the probabilities of the next character.

 \Large \boxed{h_t = \phi[Wx_t + Uh_{t-1}]}

Backward pass

The simplest way to calculate all gradients would be to recalculate the loss in small variations for each parameter. But it lose much time.

By propagating backdrop we can calculate all gradients for all parameters at once. In this way the gradients are calculated in the opposite direction to the forward pass.

This is the definition of the loss function in python.

def lossFun(inputs, targets, hprev):
  xs, hs, ys, ps, = {}, {}, {}, {} #Empty dicts
  xs, hs, ys, ps = {}, {}, {}, {}
  hs[-1] = np.copy(hprev)
  loss = 0
  for t in xrange(len(inputs)):
    xs[t] = np.zeros((vocab_size,1))                                                                                                                     
    xs[t][inputs[t]] = 1
    hs[t] = np.tanh(, xs[t]) +, hs[t-1]) + bh)                                                                                                       
    ys[t] =, hs[t]) + by                                                                                                    
    ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars                                                                                                              
    loss += -np.log(ps[t][targets[t],0])                                                                                            
  dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
  dbh, dby = np.zeros_like(bh), np.zeros_like(by)
  dhnext = np.zeros_like(hs[0])
  for t in reversed(xrange(len(inputs))):
    dy = np.copy(ps[t])
    dy[targets[t]] -= 1 # backprop into y  
    dWhy +=, hs[t].T)
    dby += dy
    dh =, dy) + dhnext                                                                                                                                         
    dhraw = (1 - hs[t] * hs[t]) * dh                                                                                                                     
    dbh += dhraw
    dWxh +=, xs[t].T)
    dWhh +=, hs[t-1].T)
    dhnext =, dhraw) 
  for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
    np.clip(dparam, -5, 5, out=dparam)                                                                                                                 
  return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs)-1]


Generate samples based on the model

Once we have the model the function that will calculate results will be the following:

def sample(h, seed_ix, n):
  x = np.zeros((vocab_size, 1))
  x[seed_ix] = 1
  ixes = []
  for t in xrange(n):
    h = np.tanh(, x) +, h) + bh)
    y =, h) + by
    p = np.exp(y) / np.sum(np.exp(y))
    ix = np.random.choice(range(vocab_size), p=p.ravel())
    x = np.zeros((vocab_size, 1))
    x[ix] = 1
  txt = ''.join(ix_to_char[ix] for ix in ixes)

Train the model and generate text

In this way we will train the model and generate samples for every 1000 interactions.

n, p = 0, 0
mWxh, mWhh, mWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
mbh, mby = np.zeros_like(bh), np.zeros_like(by) # memory variables for Adagrad                                                                                                                
smooth_loss = -np.log(1.0/vocab_size)*seq_length # loss at iteration 0                                                                                                                        
while n<=1000*100: if p+seq_length+1 >= len(data) or n == 0:
    hprev = np.zeros((hidden_size,1))                                                                                                                                       
    p = 0                                                                                                                                                        
  inputs = [char_to_ix[ch] for ch in data[p:p+seq_length]]
  targets = [char_to_ix[ch] for ch in data[p+1:p+seq_length+1]]
  loss, dWxh, dWhh, dWhy, dbh, dby, hprev = lossFun(inputs, targets, hprev)
  smooth_loss = smooth_loss * 0.999 + loss * 0.001
  if n % 1000 == 0:
    print 'iter %d, loss: %f' % (n, smooth_loss) # print progress
    sample(hprev, inputs[0], 200)
  for param, dparam, mem in zip([Wxh, Whh, Why, bh, by],
                                [dWxh, dWhh, dWhy, dbh, dby],
                                [mWxh, mWhh, mWhy, mbh, mby]):
    mem += dparam * dparam
    param += -learning_rate * dparam / np.sqrt(mem + 1e-8)                                                                                                                   
  p += seq_length # move data pointer                                                                                                                                                         
  n += 1 # iteration counter 

Save the model

Once we have our model trained, it may take us hours, if we want to save it to make use of it another moment we can do it as follows:

outfile = open('model.dat', 'w')
np.savez(outfile, chars=chars, hprev=hprev, Wxh=Wxh, Whh=Whh, bh=bh,by=by, Why=Why, vocab_size=np.array([vocab_size]))

Load the trained model

When we have the trained model we can load it into another program as follows.

model_file  =  open('model.dat', 'r')
model =  np.load(model_file)

chars= model['chars']
#hprev= model['hprev']
hprev = np.zeros((100,1))

char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }

Once loaded we add the function to generate samples and we can already make use of our recurrent neural network.

Generate MIDI Music with Recurrent Neural Networks

Once we have seen the functions and parameters necessary to create a Recurrent Neural Network we can create a model that generates music in MIDI format.

If you want to download the project and try ithere is the code in github.

The project consists of the following files:

  • More than 600 songs in text format converted from MIDI format.
  • Generator of the neural recurrent network model based on
  • Program that will load the model generated by and generate a MIDI file.
  • rnn_midi_25_100_50000.dat : Model saved with 50,000 iterations.
  • rnn_midi_25_100_200000.dat : Model saved with 200,000 iterations.
  • rnn_midi_25_100_250000.dat : Model saved with 250,000 iterations.


The requirements to use this project are:

  • Python 2.7or higher.
  • Numpy:Mathematical library for python.
  • Mido: MIDI file library for python.

Model Training

Although the project contains several models already trained with different number of iterations we can train it ourselves.


This generates the trained model in a file with .dat extension.

Generate MIDI Files with the trained model


This generates a file with .mid extension.

Play MIDI files

To play MIDI files we have several options, although one of the simplest in timidity, which in addition to playing MIDI can convert them to WAV.

timidity song1.mid

Generate MIDI file, convert it to wav and play it on a single line.

python; timidity --output-24bit --output-mono -A120 song1.mid -Ow -o song1.wav; aplay song1.wav

Convert MIDI files to MP3

ffmpeg -i song1.wav -acodec libmp3lame song1.mp3

Tests performed

These are several of the tests generated by our Artificial Intelligence.

Trumpet song created by Artificial Intelligence:

Harp song created by Artificial Intelligence:

Violin song created by Artificial Intelligence:

Comments are disabled