Have you ever wondered about the name of the bird you just heard singing? There are more than 10,000 bird species in the world, and they can be found in nearly every environment, from untouched rainforests to suburbs and cities. Wilder Sensing is an organisation with the bold ambition to enable the regeneration of environments at scale. This will create thriving ecosystems, enhanced biodiversity and sustainable communities for nature and people. They have not only considered the issue, but also decided to create a workable solution; to be able to detect birds species based on the sounds they make.
The Wilder Sensing project is designed to integrate nature and technology – helping make informed, sustainable choices in land use and ecological conservation. Wilder Sensing introduced 3 Data Science MSc students from the University of Salford, to work on this Birdsong Recognition and Analysis using Artificial Intelligence as part of their Nurture Programme. I was fortunate to be one of these students, thus I had to make the most of the opportunity.
First, I would like to give a special acknowledgement to Wilder Sensing for giving me the opportunity to leverage my skills and gain hands-on experience with their firm. I also want to express my gratitude to Geoff Carss, founder of Ethos Wilder, for his guidance and instructions throughout the study process. The process followed a typical data science project structure, including: data research and analysis, data preparation, model creation, analysis of results (or model improvement). This was followed by the final presentation. The project was intended to be a collaboration on a real-world problem that machine learning can help to solve. After working on a solution for several weeks, I was able to predict the name of the correct bird on the test sample with an accuracy of 89%.
Here’s how I did it. These were the steps taken for this project:
The analysis and classification of bird songs is a really intriguing issue to work on. Birds have a variety of voices, and each voice type serves a different purpose. Machine learning algorithms have been used to classify and identify bird sounds and sound emotion recognition. To achieve the objective of this work, which is to develop a machine learning model for distinguishing bird species sounds, data was collected from Xeno-canto, a popular database of bird sounds, and analyzed for use. The Xeno-canto collaborative database as of the latest 2015 edition had about 1000 species with a huge range of different types of bird songs and calls, with a growing rate of contributors and recordings.
While the bird species available on the Xeno-canto database span across different continents: Africa, America, Asia, Australia and Europe, this research focused on bird species in Europe. However, this project focused on 5 bird sounds, which had peculiar vocal properties. For each species, 1000 sound recordings were collected from the Xeno-canto database, making a total of 5000 extracted audio files. The duration of the records were mostly a few minutes with their individual file sizes ranging between 5 to 20 megabytes (MB). The collected bird sound data, consisting of 5000 files and 5 specie classes, were analysed to understand the properties of the data.
For feature analysis, the audio record file format is in .mp3 and loaded with librosa, a Python package for audio and music signal processing. When using librosa, an audio signal is represented as a one-dimensional NumPyarray, denoted as y and accompanied by the sampling rate (sr). To convert the bird record signals into its frequency components, I used Fast Fourier Transformation (FFT), which takes in discrete-time signals as an input, unlike the Fourier Transform (FT) which takes in continuous-time signals. The FFT was implemented using the scipy Python library and returns a list of complex-valued amplitudes of the frequencies found in the signal.
In this work, data cleaning was the most taxing and frustrating part of the entire project. The crux of the problem was the lack of quality recordings. Based on the analysis of the recordings, I made the following decisions:
Chunking each audio into 5 seconds. All the audio files were chunked into 5 seconds. However, for each one, the first 5 seconds were removed because the analysis of the randomly selected samples suggested that the amplitudes are usually 0 at the first few seconds of the audio. By generating the Mel-Spectrogram of each 5 second chunk, the audio files were represented in Mel spectrograms and stored in their respective classes. To obtain the Mel-spectrograms of all samples, I used the librosa Python library to convert 5 second chunks of the audio to spectrogram images and exported them as .png. The exported Mel-spectrograms are split into the train, validation and test portions.
For using the data as an input to a machine learning algorithm, the data was converted to a format that the algorithm can interpret. Machines understand numbers. The images were converted to arrays using the NumPy Python library. This, however, did not return a uniform array shape for the images. To keep it uniform, I defined a shape of 224x224 as the width and height pixel numbers. After pre-processing, all image arrays had a fixed shape of 224x224x3. These arrays were fed into the neural network to learn the array values that represented each bird species.
With data cleaning out of the way, here came the simpler part; designing a Convolutional Neural Network (CNN). Most of the winning solutions made used CNN. I decided to follow suit. My dataset was split into 80% train, 10% validation, 10% test. This work adopted the EfficientNet convolutional neural model. EfficientNet has the ability to tune a CNN in a balanced way, putting into consideration the depth, width and resolution at once. For this project, I implemented EfficientNet B4, which is one of the variants of the EfficientNets first introduced in. To give an overview of the EfficientNet B4, which was generated using the compound scaling and AutoML approach, I grouped model architecture into 3 modules.
The first model (Module 1) which was the starting point for the sub-blocks, consisting of the mobile inverted bottleneck MBConv with depthwise Conv2D, batch normalization and activation function. While Module 2 was a combination of two module 1 networks separated by zero padding, module 3 served as a skip connection to all the sub-blocks, it consisted of the Global average pooling, rescaling factors and 2 Conv2D networks.
The model architecture had an input shape of (224 X 224 X 3) representing the width, height, and the number of channels for each input image. The pre-trained model was trained on ImageNet, a dataset of over 14 million images with 1000 classes. I created data batches for train, validation and test data and then initialised the model with pre-trained EfficientNet B4 weights, and fine-tuned it on my own dataset by transfer learning. My model took a batch of Mel’s spectrograms of a given data point as the input data.
This experiment can be viewed as being pretty successful given the paucity of data. Of course, there is still a lot to be done to improve. First of all, more time could be devoted to recording sourcing. The performance of the model would be significantly improved with more records. Second, we could try the data augmentation strategy. Last but not least, we might also consider additional minor and more subtle upgrades, such as model fit modifications.
Wishing you all a great project and I can’t wait to read and see what where Wilder Sensing ends up!