In this MP, you'll learn to
In order to make sure everything works, you might want to go to the command line, and run
pip install -r requirements.txt
This will install the modules that are used on the autograder, including numpy, h5py, and the gradescope utilities.
pip install -r requirements.txt
Requirement already satisfied: soundfile in d:\anaconda\lib\site-packages (from -r requirements.txt (line 1)) (0.12.1)Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: gradescope-utils in d:\anaconda\lib\site-packages (from -r requirements.txt (line 2)) (0.5.0) Requirement already satisfied: numpy in d:\anaconda\lib\site-packages (from -r requirements.txt (line 3)) (1.24.4) Requirement already satisfied: h5py in d:\anaconda\lib\site-packages (from -r requirements.txt (line 4)) (3.7.0) Requirement already satisfied: librosa in d:\anaconda\lib\site-packages (from -r requirements.txt (line 5)) (0.10.1) Requirement already satisfied: cffi>=1.0 in d:\anaconda\lib\site-packages (from soundfile->-r requirements.txt (line 1)) (1.15.1) Requirement already satisfied: audioread>=2.1.9 in d:\anaconda\lib\site-packages (from librosa->-r requirements.txt (line 5)) (3.0.0) Requirement already satisfied: scipy>=1.2.0 in d:\anaconda\lib\site-packages (from librosa->-r requirements.txt (line 5)) (1.10.1) Requirement already satisfied: scikit-learn>=0.20.0 in d:\anaconda\lib\site-packages (from librosa->-r requirements.txt (line 5)) (1.4.2) Requirement already satisfied: joblib>=0.14 in d:\anaconda\lib\site-packages (from librosa->-r requirements.txt (line 5)) (1.2.0) Requirement already satisfied: decorator>=4.3.0 in c:\users\jinen\appdata\roaming\python\python311\site-packages (from librosa->-r requirements.txt (line 5)) (5.1.1) Requirement already satisfied: numba>=0.51.0 in d:\anaconda\lib\site-packages (from librosa->-r requirements.txt (line 5)) (0.57.0) Requirement already satisfied: pooch>=1.0 in d:\anaconda\lib\site-packages (from librosa->-r requirements.txt (line 5)) (1.4.0) Requirement already satisfied: soxr>=0.3.2 in d:\anaconda\lib\site-packages (from librosa->-r requirements.txt (line 5)) (0.3.6) Requirement already satisfied: typing-extensions>=4.1.1 in d:\anaconda\lib\site-packages (from librosa->-r requirements.txt (line 5)) (4.11.0) Requirement already satisfied: lazy-loader>=0.1 in d:\anaconda\lib\site-packages (from librosa->-r requirements.txt (line 5)) (0.2) Requirement already satisfied: msgpack>=1.0 in d:\anaconda\lib\site-packages (from librosa->-r requirements.txt (line 5)) (1.0.3) Requirement already satisfied: pycparser in d:\anaconda\lib\site-packages (from cffi>=1.0->soundfile->-r requirements.txt (line 1)) (2.21) Requirement already satisfied: llvmlite<0.41,>=0.40.0dev0 in d:\anaconda\lib\site-packages (from numba>=0.51.0->librosa->-r requirements.txt (line 5)) (0.40.0) Requirement already satisfied: requests in d:\anaconda\lib\site-packages (from pooch>=1.0->librosa->-r requirements.txt (line 5)) (2.31.0) Requirement already satisfied: packaging in c:\users\jinen\appdata\roaming\python\python311\site-packages (from pooch>=1.0->librosa->-r requirements.txt (line 5)) (23.1) Requirement already satisfied: appdirs in d:\anaconda\lib\site-packages (from pooch>=1.0->librosa->-r requirements.txt (line 5)) (1.4.4) Requirement already satisfied: threadpoolctl>=2.0.0 in d:\anaconda\lib\site-packages (from scikit-learn>=0.20.0->librosa->-r requirements.txt (line 5)) (2.2.0) Requirement already satisfied: charset-normalizer<4,>=2 in d:\anaconda\lib\site-packages (from requests->pooch>=1.0->librosa->-r requirements.txt (line 5)) (2.0.4) Requirement already satisfied: idna<4,>=2.5 in d:\anaconda\lib\site-packages (from requests->pooch>=1.0->librosa->-r requirements.txt (line 5)) (3.4) Requirement already satisfied: urllib3<3,>=1.21.1 in d:\anaconda\lib\site-packages (from requests->pooch>=1.0->librosa->-r requirements.txt (line 5)) (1.26.16) Requirement already satisfied: certifi>=2017.4.17 in d:\anaconda\lib\site-packages (from requests->pooch>=1.0->librosa->-r requirements.txt (line 5)) (2023.7.22)
Below is the functions used for visualization purpose only.
# This function is using librosa for display purpose only,
# you should not import librosa or use it in submitted.py as this will break the autograder
import librosa
import librosa.display
import numpy as np
def display_spectrogram(spec, sr):
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(12,4),sharex=True)
dB = librosa.amplitude_to_db(np.abs(spec.T), ref=np.max)
librosa.display.specshow(dB, y_axis='linear', x_axis='time',sr=sr, ax=ax)
ax.set(title='spectrogram')
ax.label_outer()
def display_waveform(waveform):
fig, ax = plt.subplots(1,1,figsize=(14,2))
timestamp = np.arange(len(waveform))/samplerate
ax.plot(timestamp,waveform)
Let's load in a audio file for testing and visualize it in the time domain.
import IPython.display
import matplotlib.pyplot as plt
import importlib, submitted
import soundfile as sf
data, samplerate = sf.read('visible.wav')
print(data.shape)
IPython.display.Audio(data, rate=samplerate)
(320026,)
display_waveform(data)
In this part, you will complete the function stft
in submitted.py
. Note that you are only allowed to use the numpy
library to complete the function. You might find np.fft.fft
, np.fft.rfft
, np.hanning
useful. You are welcomed to read the stft implementation from other packages, such as librosa
or scipy
, but you should not copy it. To avoid dimension error when we grade your work, we explicitly ask you to implement output dimension of the stft matrix as ((sig_len - window_size) // hop_size + 1), (window_size // 2 + 1)
.
The high-level idea of implementing a STFT is that you should multiply the hanning window by each window, and then compute the FFT of each of them. Each window is of length window_length
and apart from each other by hop_size
samples. Note that we are requiring you to output the non-negative frequency terms of the stft matrix, so if you are using np.fft.fft
, you should discard half of the values.
importlib.reload(submitted)
help(submitted.stft)
Help on function stft in module submitted: stft(signal, window_size=2048, hop_size=512)
spec = submitted.stft(data)
print('The shape of the spectrogram is', spec.shape)
display_spectrogram(spec, samplerate)
The shape of the spectrogram is (622, 1025)
In this part, you will complete the function istft
in submitted.py
. Note that you are only allowed to use the numpy
library to complete the function. You might find np.fft.ifft
, np.fft.irfft
useful. You are welcomed to read the istft implementation from other packages, such as librosa
or scipy
, but you should not copy it. To avoid dimension error when we grade your work, we explicitly ask you to implement the output dimension of the reconstructed signal to be (num_windows - 1) * hop_size + window_size
.
The high-level idea of implementing a ISTFT is that you should calculate the IFFT of each window in the frequency doamin, and add the reconstructed time-domain signal to the corresponding window in the time doamin.
importlib.reload(submitted)
help(submitted.istft)
Help on function istft in module submitted: istft(stft_matrix, hop_size=512)
istft_sig = submitted.istft(spec, 512)
print('The shape of the reconstructed signal is', istft_sig.shape)
The shape of the reconstructed signal is (320000,)
display_waveform(istft_sig)
IPython.display.Audio(istft_sig, rate=samplerate)
When you have the complete complex spectrogram, you use istft to convert the signal back to time domain. What if the phase information is missing and you only have the magnitude spectrogram?
In this part you are going to implement the Griffin-Lim Vocoder, which is an well-known algorithm to reconstruct waveforms when only the magnitude spectrogram is available. The high-level steps of Griffin Lim are:
importlib.reload(submitted)
help(submitted.griffin_lim)
Help on function griffin_lim in module submitted: griffin_lim(magnitude_spectrogram, phase=None, n_iter=100, window_size=2048, hop_size=512)
Griffin-Lim is a iterative algorithm which refines the phase information gradually. Try to listen to the reconstructed audio at different iterations. Initially the reconstructed audio will have a lot of artifacts since the phase information is completely random, but the artifacts will be alleviated through the number of iterations increases.
reconstructed_iter1 = submitted.griffin_lim(np.abs(spec), n_iter=1)
print('Reconstructed signal has a shape of', reconstructed_iter1.shape)
IPython.display.Audio(reconstructed_iter1, rate=samplerate)
(622, 1025) (622, 1025) (622, 1025) Reconstructed signal has a shape of (320000,)
reconstructed_iter10 = submitted.griffin_lim(np.abs(spec), n_iter=10)
print('Reconstructed signal has a shape of', reconstructed_iter10.shape)
IPython.display.Audio(reconstructed_iter10, rate=samplerate)
(622, 1025) (622, 1025) (622, 1025) Reconstructed signal has a shape of (320000,)
reconstructed_iter100 = submitted.griffin_lim(np.abs(spec), n_iter=100)
print('Reconstructed signal has a shape of', reconstructed_iter100.shape)
IPython.display.Audio(reconstructed_iter100, rate=samplerate)
(622, 1025) (622, 1025) (622, 1025) Reconstructed signal has a shape of (320000,)
You can change the speed of the audio signal by time-stretching or time-shrinking your spectrogram before you applying the Griffin-Lim. In this part, you should complete the function time-stretch Griffin-Lim in the file submitted.py
.
You just need to interpolate existing frames to form a new spectrogram before feeding it to the griffin-lim algorithm. You would want the new spectrogram having shape (int(stretch_factor * num_frames), num_freq_bins)
. You might find the function np.interp
helpful for interpolating frames. Specifically, the nth frame in the stretched spectrogram will be mapped to the n / stretch_factor
th frame of the original frame, note that n / stretch_factor
might not be an int, and np.interp
will take care of the interpolation for you for non-int index. (e.g. if the stretch factor is 2, the 3rd frame of the stretched spectrogram will be mapped to the 1.5th of the original spectrogram)
importlib.reload(submitted)
help(submitted.time_stretch_griffin_lim)
Help on function time_stretch_griffin_lim in module submitted: time_stretch_griffin_lim(magnitude_spectrogram, stretch_factor, phase=None, n_iter=100, window_size=2048, hop_size=512)
reconstructed_stretch = submitted.time_stretch_griffin_lim(np.abs(spec), stretch_factor=2, n_iter=100)
print('Reconstructed signal has a shape of', reconstructed_stretch.shape)
IPython.display.Audio(reconstructed_stretch, rate=samplerate)
(1244, 1025) (1244, 1025) (1244, 1025) Reconstructed signal has a shape of (638464,)
display_waveform(reconstructed_stretch)
reconstructed_shrink = submitted.time_stretch_griffin_lim(np.abs(spec), stretch_factor=0.5, n_iter=100)
print('Reconstructed signal has a shape of', reconstructed_shrink.shape)
IPython.display.Audio(reconstructed_shrink, rate=samplerate)
(311, 1025) (311, 1025) (311, 1025) Reconstructed signal has a shape of (160768,)
display_waveform(reconstructed_shrink)
In the Previous part, you stretch the time dimension of the spectrogram to control the speed of the audio. What happens if you stretch the frequency dimension instead? In this part, you should complete the function freq_stretch_griffin_lim
in the file submitted.py
.
Stretching the frequency axis of the spectrogram is also known as vocal tract length normalization, a method to account for the variances in the size of vocal tract for different people. Just by stretching the frequency axis, basic voice conversion can be done by manipulating the fundamental frequency of the speech signal.
Similar to the previous part, you just need to interpolate along the frequency axis instead of the time axis to form a new spectrogram before feeding it to the griffin-lim algorithm. Note that You would not want to change the dimension of the spectrogram (The new spectrogram should still be of the same shape as the original spectrogram). Please crop the or pad the resulting spectrogram to get the appropriate size. You might find the function np.interp
helpful for interpolating frequency bins.
importlib.reload(submitted)
help(submitted.freq_stretch_griffin_lim)
Help on function freq_stretch_griffin_lim in module submitted: freq_stretch_griffin_lim(magnitude_spectrogram, stretch_factor, phase=None, n_iter=100, window_size=2048, hop_size=512)
vtln_wave = submitted.freq_stretch_griffin_lim(np.abs(spec), stretch_factor=1.5, n_iter=100, hop_size=512)
IPython.display.Audio(vtln_wave, rate=samplerate)
(622, 1025) (622, 1025) (622, 1025)
D = submitted.stft(vtln_wave)
display_spectrogram(D, 32000)
vtln_wave = submitted.freq_stretch_griffin_lim(np.abs(spec), stretch_factor=0.75, n_iter=100, hop_size=512)
IPython.display.Audio(vtln_wave, rate=samplerate)
(622, 1025) (622, 1025) (622, 1025)
D = submitted.stft(vtln_wave)
display_spectrogram(D, 32000)
As you can see, this file contains a lot of objects. These are named more or less as in the Jupyter Notebook above, so you can look at each object in detail to figure out why your code is different from the solution.