Sound (Signal Processing)

description42 papers

group2,933 followers

lightbulbAbout this topic

Sound signal processing is the analysis, manipulation, and synthesis of sound signals using algorithms and mathematical techniques. It encompasses various methods for enhancing, transforming, and interpreting audio data, facilitating applications in fields such as telecommunications, music production, and acoustic engineering.

lightbulbAbout this topic

Key research themes

1. How do physical modeling and wave equation solutions enhance spatial sound synthesis and instrument acoustics?

This research area explores the application of physical acoustics modeling, particularly solutions to the wave equation and physical vibration models, to understand and synthesize the spatial characteristics of sound produced by musical instruments. It is significant because accurately modeling sound radiation and propagation enables high-fidelity audio synthesis and spatial sound reproduction, which enhances realism in musical applications and psychoacoustic sound field synthesis.

Spatial Sound of Musical Instruments

by Tim Ziemer

2019, Psychoacoustic Music Sound Field Synthesis

Key finding: This work provides a comprehensive theoretical foundation by deriving the homogeneous wave equation, Helmholtz equation, plane wave solutions, and introduces the complex point source model as a simplification of sound... Read more

articleView Paper downloadDownload

Digital sound synthesis by physical modelling

by Rudolf Rabenstein

2024, ISPA 2001. Proceedings of the 2nd International Symposium on Image and Signal Processing and Analysis. In conjunction with 23rd International Conference on Information Technology Interfaces (IEEE Cat. No.01EX480)

Key finding: This article surveys digital sound synthesis approaches, emphasizing physical modeling methods that simulate vibrating structures via partial differential equations. It discusses discrete-time implementations suitable for... Read more

articleView Paper downloadDownload

Integrated Modal and Granular Synthesis of Haptic Tapping and Scratching Sounds

by Stephen Barrass

2021

Key finding: The study integrates modal synthesis (based on modal frequencies of objects) with granular synthesis to produce continuous, interactive tapping and scratching sounds that vary dynamically with user input such as force and... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. What are effective machine learning and signal processing approaches for automatic sound classification and speech emotion recognition using spectral decomposition?

This theme focuses on the application of machine learning techniques, especially deep neural networks, combined with advanced signal processing strategies such as variational mode decomposition and acoustic feature extraction, to automatically classify natural and environmental sounds and to perform speech emotion recognition. The importance lies in developing robust automated systems that extract meaningful features from complex and nonstationary audio signals, enabling applications in human-computer interaction, surveillance, and multimedia retrieval.

Sound Classification Using Python

by Siuli Das

2022, ITM Web of Conferences

Key finding: This work evaluates environmental sound classification through Mel Frequency Cepstral Coefficients (MFCCs) and neural network classifiers. It employs spectrogram-derived feature extraction mimicking human auditory frequency... Read more

articleView Paper downloadDownload

An Extended Variational Mode Decomposition Algorithm Developed Speech Emotion Recognition Performance

by David HASON RUDD and

2023, The Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD)

Key finding: The study proposes VGG-optiVMD, an enhanced variational mode decomposition technique that autonomously optimizes the number of modes and balancing parameters to extract informative signal components relevant for speech... Read more

articleView Paper downloadDownload

Multichannel Singing Voice Separation by Deep Neural Network Informed DOA Constrained CNMF

by Konstantinos Drossos

2020, IEEE 22nd International Workshop on Multimedia Signal Processing

Key finding: This paper combines deep-learning monophonic spectral separation with multichannel complex NMF source separation informed by direction-of-arrival (DOA) constraints. The Masker-Denoiser Twin Network (MaD TwinNet) estimates... Read more

articleView Paper downloadDownload

COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations

by Konstantinos Drossos

2020, International Conference on Machine Learning (ICML), Workshop on Self-supervised learning in Audio and Speech

Key finding: The paper presents a contrastive loss-based framework aligning latent representations of audio spectrograms and associated tags via co-aligned autoencoders. This multimodal embedding approach captures both semantic and... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

3. How can sonification techniques and auditory display theories be advanced through nonlinear sound propagation models and interdisciplinary aesthetics?

This research area investigates the enhancement of sonification methods (data-to-sound transformations) by incorporating nonlinear acoustics models, such as solutions based on Burgers equation, and by examining the aesthetic, musical, and interdisciplinary aspects of auditory display. The focus includes improving inverse problems in sonification, expanding the theoretical foundations, and addressing how sound as a medium conveys information across scientific and artistic domains, which is crucial for applications in medical imaging, scientific data analysis, education, and electroacoustic music.

On the sonification technique

by Veturia Chiroiu

2024, Journal of Engineering Sciences and Innovation

Key finding: Introducing a novel sonification operator grounded in the nonlinear Burgers equation rather than traditional linear sound propagation, this paper demonstrates improved inverse sonification capable of enhancing medical images... Read more

articleView Paper downloadDownload

Sonification (Editorial, Organsied Sound)

by margaret schedel

2024, Organised Sound

Key finding: This editorial synthesizes diverse perspectives on sonification, emphasizing the continuum between faithful auditory display and artistic composition. It foregrounds the necessity of aesthetic decision-making in transforming... Read more

articleView Paper downloadDownload

Revisiting the Canon of Sound Theory (Michel Chion, Audio-Vision: Sound on Screen)

by Sara Pinheiro

2023, Iluminace

Key finding: The paper revisits Michel Chion's seminal work theorizing the complex interplay between sound and image in film, introducing foundational terminology such as the 'audiovisual contract' and 'added value' of sound. By framing... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

All papers in Sound (Signal Processing)

Blind audio source counting and separation of anechoic mixtures using the multichannel complex NMF framework

by Yaser Norouzi

2026, Signal Processing

In this paper, we address the tasks of audio source counting and separation for a stereo anechoic mixture of audio signals. This will be achieved in two stages. In the first stage, a novel approach is introduced for estimating the number... more

descriptionView Paper arrow_downwardDownload

Long-Range Combustion Engine UAV Detection and Tracking Using IoT Acoustic Sensors and Multi-Hop VHF/UHF Mesh Networks

by Traian Nicula

2026, Self published on Linkedin

This project explores the feasibility of detecting and tracking long-range combustion-engine UAVs using a distributed network of acoustic IoT sensors and multi-hop wireless communications. The idea is inspired by the increasing use of... more

descriptionView Paper arrow_downwardDownload

Increasing Acces to Sound-based Music – www.eEMS

by Leigh Landy

2026

This paper introduces the ongoing ElectroAcoustic Resource Site Pedagogical Project, or EARS II, in some detail. EARS II is to become an online educational resource for two groups of users: children of ca. 11-14 years of age as well as... more

descriptionView Paper arrow_downwardDownload

Machine learning for music genre: multifaceted review and experimentation with audioset

by Jaime Salido Ramírez

2026, Journal of Intelligent Information Systems

Music genre classification is one of the sub-disciplines of music information retrieval (MIR) with growing popularity among researchers, mainly due to the already open challenges. Although research has been prolific in terms of number of... more

descriptionView Paper arrow_downwardDownload

Differentiable Consistency Constraints for Improved Deep Speech Enhancement

by alva rif

2026, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

In recent years, deep networks have led to dramatic improvements in speech enhancement by framing it as a data-driven pattern recognition problem. In many modern enhancement systems, large amounts of data are used to train a deep network... more

descriptionView Paper arrow_downwardDownload

Web-Based Recommendation Strategy in a Cadastre Information System

by Dariusz Król

2025, Lecture Notes in Computer Science

Web-based recommendation strategy implemented in a cadastre information system is presented in the paper. This method forms the list of page profiles recommended to a given user. The idea of page recommendation uses the concept of a page... more

descriptionView Paper arrow_downwardDownload

Objective Assessment of Covid-19 Severity Affecting the Vocal and Respiratory System Using a Wearable, Autonomous Sound Collar

by Hamza Aziz

2025, Cellular and Molecular Bioengineering

Introduction-Since the outbreak began in January 2020, Covid-19 has affected more than 161 million people worldwide and resulted in about 3.3 million deaths. Despite efforts to detect human infection with the virus as early as possible, the confirmatory test still requires the analysis of sputum or blood with estimated results available within approximately 30 minutes; this may potentially be followed by clinical referral if the patient shows signs of aggravated pneumonia. This work aims to implement a soft collar as a sound device dedicated to the objective evaluation of the pathophysiological state resulting from dysphonia of laryngeal origin or respiratory failure of inflammatory origin, in particular caused by Covid-19. Methods-In this study, we exploit the vibrations of waves generated by the vocal and respiratory system of 30 people. A biocompatible acoustic sensor embedded in a soft collar around the neck collects these waves. The collar is also equipped with thermal sensors and a cross-data analysis module in both the temporal and frequency domains (STFT). The optimal coupling conditions and the electrical and dimensional characteristics of the sensors were defined based on a mathematical approach using a matrix formalism. Results-The characteristics of the signals in the time domain combined with the quantities obtained from the STFT offer multidimensional information and a decision support tool for determining a pathophysiological state representative of the symptoms explored. The device, tested on 30 people, was able to differentiate patients with mild symptoms from those who had developed acute signs of respiratory failure on a severity scale of 1 to 10. Conclusion-With the health constraints imposed by the effects of Covid-19, the heavy organization to be implemented resulting from the flow of diagnostics, tests and clinical management, it was urgent to develop innovative and safe biomedical technologies. This passive listening technique will contribute to the non-invasive assessment and dynamic observation of lesions. Moreover, it merits further examina-tion to provide support for medical operators to improve clinical management.

descriptionView Paper arrow_downwardDownload

Comparison and Analysis of Deep Audio Embeddings for Music Emotion Recognition

by Shlomo Dubnov

2025, National Conference on Artificial Intelligence

Emotion is a complicated notion present in music that is hard to capture even with fine-tuned feature engineering. In this paper, we investigate the utility of state-of-the-art pre-trained deep audio embedding methods to be used in the... more

descriptionView Paper arrow_downwardDownload

Naturalistic Music Decoding from EEG Data via Latent Diffusion Models

by Natalia Polouliakh

2025, arXiv (Cornell University)

In this article, we explore the potential of using latent diffusion models, a family of powerful generative models, for the task of reconstructing naturalistic music from electroencephalogram (EEG) recordings. Unlike simpler music with... more

descriptionView Paper arrow_downwardDownload

Transfer learning for music classification and regression tasks

by Mark Sandler

2024, arXiv (Cornell University)

descriptionView Paper arrow_downwardDownload

A Coustic P Seudospectrum Based F Ault L Ocalization in M Otorcycles

by Veerappa Pagi

2024

Vehicles generate dissimilar sound patterns under different health conditions. The sound generated by the vehicles gives a clue of some of the faults. Automotive experts diagnose the faults in vehicles based on the produced sound. This... more

descriptionView Paper arrow_downwardDownload

Multi-stage Acoustic Fault Diagnosis of Motorcycles using Wavelet Packet Energy Distribution and ANN

by Veerappa Pagi

2024

Motorcycles generate different sound patterns under dissimilar working conditions. The generated sound pattern gives a clue of the fault. Mainly the parts of the engine that lead to change in sound are cylinder kit, crank, timing chain,... more

descriptionView Paper arrow_downwardDownload

Fault Detection of Motorcycles Using the Slopes of the Estimated Pseudospectrum of the Produced Sounds

by Veerappa Pagi

2024, International Journal on Computational Science & Applications

In this paper we propose two generic mechanisms implemented in a cadastre internet information system. The first one is the list of last queries submitted by a given user and the second one is the list of page profiles recommended to a... more

descriptionView Paper arrow_downwardDownload

Objective Assessment of Covid-19 Severity Affecting the Vocal and Respiratory System Using a Wearable, Autonomous Sound Collar

by Georges Nassar

2024, Cellular and Molecular Bioengineering

Introduction-Since the outbreak began in January 2020, Covid-19 has affected more than 161 million people worldwide and resulted in about 3.3 million deaths. Despite efforts to detect human infection with the virus as early as possible, the confirmatory test still requires the analysis of sputum or blood with estimated results available within approximately 30 minutes; this may potentially be followed by clinical referral if the patient shows signs of aggravated pneumonia. This work aims to implement a soft collar as a sound device dedicated to the objective evaluation of the pathophysiological state resulting from dysphonia of laryngeal origin or respiratory failure of inflammatory origin, in particular caused by Covid-19. Methods-In this study, we exploit the vibrations of waves generated by the vocal and respiratory system of 30 people. A biocompatible acoustic sensor embedded in a soft collar around the neck collects these waves. The collar is also equipped with thermal sensors and a cross-data analysis module in both the temporal and frequency domains (STFT). The optimal coupling conditions and the electrical and dimensional characteristics of the sensors were defined based on a mathematical approach using a matrix formalism. Results-The characteristics of the signals in the time domain combined with the quantities obtained from the STFT offer multidimensional information and a decision support tool for determining a pathophysiological state representative of the symptoms explored. The device, tested on 30 people, was able to differentiate patients with mild symptoms from those who had developed acute signs of respiratory failure on a severity scale of 1 to 10. Conclusion-With the health constraints imposed by the effects of Covid-19, the heavy organization to be implemented resulting from the flow of diagnostics, tests and clinical management, it was urgent to develop innovative and safe biomedical technologies. This passive listening technique will contribute to the non-invasive assessment and dynamic observation of lesions. Moreover, it merits further examination to provide support for medical operators to improve clinical management.

descriptionView Paper arrow_downwardDownload

MuSLCAT: Multi-Scale Multi-Level Convolutional Attention Transformer for Discriminative Music Modeling on Raw Waveforms

by Kai Middlebrook

2024, ArXiv

In this work, we aim to improve the expressive capacity of waveform-based discriminative music networks by modeling both sequential (temporal) and hierarchical information in an efficient end-to-end architecture. We present MuSLCAT, or... more

* Denotes that the model used an ensemble of nine pretrained CNNs. Table 2: Performance of MuSLCAT compared to state-of-the-art waveform-based models.

Figure 1: Music tagging performance (PR-AUC) on the MTG-Jamendo large dataset as a function of model com- plexity for state-of-the-art models (grey) and our proposed MuSLCAT/MuSLCAN architecture (orange). All mod- els learn directly from raw waveform input. MuSLCAT improves performance relative to the top waveform-based network (SampleCNN + SE) yet requires fewer parameters (reduction of 34.2%). While MuSLCAN yields compet- itive results with approximately 86.8% fewer parameters than SampleCNN + SE.

1 Numbers listed in parentheses show the total number of tags used in experiments. In all cases. the top-n most frequent tags were selected. ? The original MuMu set contains ~ 147k clips/tracks, but we were only able to retrieve audio previews for ~66k clips/tracks. 3 Denotes the Amazon 4-level genre taxonomy. 4 30-second audio previews for some tracks can be downloaded from streaming services. Table 1: Music auto-tagging and genre recognition datasets used in our experiments.

Figure 2: Center: The MuSLCAN/MuSLCAT architecture. Left: The small filter and stride in the first layer encourage highCAN to model high frequencies. Right: The relatively large filter and stride in the first layer encourage lowCAN to model low to mid frequencies. The high and low CANs use SE with max-pooling to recalibrate (via channel-wise statistics) and downsample feature maps. While AAC is used to recalibrate features and model long-term interactions by jointly attending to both temporal and channel subspaces. The multi-level ouput embeddings from both CANs are channel-wise concatenated creating multi-scale and level feature maps, which are then recalibrated by either AAC or BERT.

Table 3: Performance of MuSLCAT’s different architec- tural components on MTAT.

1 Denotes that the model used an ensemble of 3 pretrained CNNs. We include Spec-SampleCNNs (x3) because it incorporates multi-scale and level features and was shown to perform well on a large-scale dataset [30]. ** denotes Short-Chunk CNN with residual connections (Res) [48]. Table 4: Performance of MuSLCAT compared to state-of- the-art spectrogram-based models.

descriptionView Paper arrow_downwardDownload

Hit song classification with audio descriptors and lyrics

by ishank sharma

2024

Hit Song Science aims to predict a songs popularity based on song structure and externalfeatures. To help provide an efficient and accurate tool for Annual Top-100 Billboard SongClassification, we apply fine-tuned BERT transformer and a... more

FIGURE 2.1: Illustration of a Deep Neural Network. Adapted from [1] w.r.t to the underlying loss objective of the problem. Although, there are many variants of

FIGURE 4.2: Hit Song classification model with audio descriptors input

FIGURE 4.3: Hit Song classification model with BERT based lyrics embed- ding input FIGURE 4.3: Hit Song classification model with BERT based lyrics embed- ding input

FIGURE 2.5: Joint learning of Lyrics embedding and Audio descriptor features

FIGURE 3.2: Total label samples, where 1(Hit songs) and Label 0 (non-hit songs).

descriptionView Paper arrow_downwardDownload

Sequential initialization of multichannel nonnegative matrix factorization for sound source separation

by Yuuki Tachioka

2024

This paper proposes an effective sequential initialization for multichannel nonnegative matrix factorization to address the difficulty of initial value dependency of the conventional method. The proposed method sets initial values of... more

Fig. 1. Example of a decomposed matrix by using MNMEF (Gray denotes complex values) bases, and the activation matrix V (€ R**7) consists of the activations of each basis. The spatial correlation matrix H indicates the spatial information of the sound sources, and the latent variable matrix Z (€ R’**) associates the spatial information of the sound sources with each basis. Similar to X, the matrix H is a hierarchical Hermitian positive semi-definite matrix whose elements are 1 x M complex matrices. This decomposition is defined as

descriptionView Paper arrow_downwardDownload

Design of All-digital Phase-locked Loop

by Rustam Khalirbaginov

2024, Problemy razrabotki perspektivnyh mikro- i nanoèlektronnyh sistem

A design methodology of an all-digital phaselocked loop based on standard library cells is presented. The design route includes development of a scalable architecture to enable migration to various technology libraries. The design... more

descriptionView Paper arrow_downwardDownload

Марк Райс ФУНКЦИОНАЛЬНОСТЬ В МУЗЫКЕ ТЕМБРОВ

by Марк Райс

2023

В ХХ веке, наряду с традиционными музыкальными складами – монодическим, полифоническим и гомофонно-гармоническим, возникли и три новых – сонористика, электроакустическая музыка и мультимедиа. Музыка, написанная в первом из них,... more

descriptionView Paper arrow_downwardDownload

Anomalous Sound Detection using unsupervised and semi-supervised autoencoders and gammatone audio representation

by Pedro Zuccarello

2023, arXiv (Cornell University)

Anomalous sound detection (ASD) is, nowadays, one of the topical subjects in machine listening discipline. Unsupervised detection is attracting a lot of interest due to its immediate applicability in many fields. For example, related to... more

descriptionView Paper arrow_downwardDownload

CNN-based Segmentation and Classification of Sound Streams under realistic conditions

by eleni tsalera

2023

Audio datasets support the training and validation of Machine Learning algorithms in audio classification problems. Such datasets include different, arbitrarily chosen audio classes. We initially investigate a unifying approach, based on... more

descriptionView Paper arrow_downwardDownload

Autoregressive Moving Average Jointly-Diagonalizable Spatial Covariance Analysis for Joint Source Separation and Dereverberation

by Aditya Nugraha

2023, IEEE/ACM Transactions on Audio, Speech, and Language Processing

This article describes a computationally-efficient statistical approach to joint (semi-)blind source separation and dereverberation for multichannel noisy reverberant mixture signals. A standard approach to source separation is to... more

Fig. 1. The probabilistic generative model of a multichannel reverberant mix- ture spectrogram based on an arbitrary source model, a jointly-diagonalizable full-rank spatial model, and autoregressive (AR) and moving average (MA) reverberation models for joint source separation and dereverberation. Kouhei Sekiguchi ®, Member, IEEE, Yoshiaki Bando®, Member, IEEE, Aditya Arie Nugraha ©, Member, IEEE, Mathieu Fontaine ®, Member, IEEE, Kazuyoshi Yoshii®, Member, IEEE, and Tatsuya Kawahara®, Fellow, IEEE

1) Progressive Update: For ARMA-FastMNMF, we can use a modified version of a progressive update technique proposed for FastMNMF [13]. First, the NMF parameters W and H are initialized randomly and the spatial parameters Q and G and the reverberation parameters B are initialized as follows:

THE SEPARATION AND DEREVERBERATION PERFORMANCES OF THE = CONVENTIONAL AND PROPOSED METHODS FOR M = 3 TABLE V

Fig. 2. The evolutions of average SDRs. The dotted lines indicate the rank-constrained versions.

1) Updating A: Using the current estimates of Q, G, and B, we update A. Since A is involved only in the first and second terms of (33), the maximization of (33) with respect to A is equivalent to the minimization of the Itakura-Saito (IS) d+e ,F,T,M adte Yf,T',M divergence between {77°} and {yore be, B. Maximum Likelihood Estimation OD ine fF gles ES gf ERE eee ee a) Frequency- -Invariant Source Model: In the same way as the NMF-based source model (explained later), we use a convergence-guaranteed minorization-maximization (MM) al- gorithm for deriving multiplicative update (MU) rules of 7+:

Since the latent variables Z are hard to optimize such that (33) is maximized, we use a stochastic gradient descent method based on backpropagation as in [12]. 2) Updating G: We also use an MM algorithm for deriving MU tules of G: 3) Updating Q and B: Instead of alternately updating Q and B as proposed in [23], we jointly update Q and B with IP or ISS for significantly better time-space complexity. IP and ISS were originally used for jointly updating demixing and dereverberation matrices (corresponding to Q and B) in AR-ILRMA based on the rank-1 spatial model [28]. [29].

Fig.6. The average SDRs of ARMA-FastMNMF with ISS2. Fig.5. The average SDRs of ARMA-FastMNMF with ISS1. AR- and ARMA-FastMNMF with ISS1 are more likely to get stuck at bad local optima. This was because only MW elements of P yf are updated with (48) for each m (> M) in ISS1, while M (Lag +1) and MLag elements are updated at once with IP and ISS2, respectively. We also found that there were strong correlations between the SDRs of ISS1 and ISS2, resulting in the small p values. TABLE III THE ELAPSED TIME [ms] PER ITERATION OF ARMA-FASTMNME FOR PROCESSING A 9.2 SECONDS SIGNAL ON GPU

where DLar-l ma =[1,LZmal, [fia = {0} Ula, Tar = [A, A+ , [fg = {0} Ular are index sets, A > 0 is the de- lay of the late reverberation [24], a, 7 € C™ is the steering vector 0 [bpn,.-. matrix of si source n with delay / at frequency f and, By; = bri" e€ C!*x™ is an AR coefficient. Let B & where B fo & —Ty, and Ij, denotes an identity

Fig. 3. The average SDRs of AR-FastMNMF with ISS2.

Fig. 4. The average SDRs of ARMA-FastMNMF with IP.

THE jp VALUES OBTAINED BY THE DEPENDENT ONE-SIDED t-TESTS FOR THE SDRS OF 40 SAMPLES OBTAINED BY IP, ISS1, AND ISS2 WITH THE BEST CONFIGURATIONS

COMPARISON OF JOINT SOURCE SEPARATION AND DEREVERBERATION METHODS (SEMI-BLIND METHODS ARE INDICATED BY “*”’) Fixing the parameters of the speech model, the latent variables and a spatial model are adaptively estimated in an unsupervised manner at run-time. This approach was originally proposed for single-channel speech enhancement [15], and then used for multichannel speech enhancement and separation based on a full-rank spatial model [16]-[18], a rank-1 spatial model (called MVAE) [17], [19], and a JD spatial model [12]. assumed to follow a degenerate multivariate complex Gaussian distribution whose covariance matrix is given by the product of a TF-varying PSD and a frequency-dependent rank-1 SCM. ICA [4 matrix 2] is the most basic method that estimates a demixing in each frequency bin such that the separated sources are made independent. To avoid the permutation problem, in- depend ent vector analysis (IVA) [4], [5] jointly considers all frequency bins. Assuming the low-rankness of source PSDs, LRMA [6] introduces an NMF-based source model. These BSS methods, however, are applicable to only a determined condition because aS many source images with rank-1 SCMs as micro- phones should be added up to yield a mixture with a full-rank SCM. In an overdetermined condition, OverI VA [43], [44] and OverILRMA [36] internally recover a determined condition by padding additional sources of no interest.

THE SEPARATION AND DEREVERBERATION PERFORMANCES OF THE CONVENTIONAL AND PROPOSED METHODS FOR M = 8 mixture was obtained by superimposing a reverberant speech signal randomly taken from the development subset, another one from the evaluation subset, and a real diffuse noise signal (mainly caused by air conditioners) from the development o1 evaluation subset. The SNR of the clean speech mixture was set to 0 dB. 2) Experimental Results: We first validate the effectiveness of the combination of the AR and MA reverberation models. Tables IV and V show the SDRs, PESQs, FWSegSNRs, and CDs averaged over all conditions. In most cases, ARMA-FastMNMF outperformed AR-FastMNMF in terms of all measures. How- ever, ARMA-FastMNMF attained only a marginal gain (0.4 dB when Kk = 4) over AR-FastMNMF for M = 8, in which VN = 8 sources were estimated. These extra sources were exploited by AR-FastMNMF to represent the early reflection and the residual late reverberation that was not represented with the AR model. In ARMA-FastMNMF, these reflection and reverberation were represented by the MA model. Therefore, the dereverberation performance of ARMA-FastMNMF and AR-FastMNMF were not so different. Nonetheless, ARMA-FastMNMF is still con- sidered to be advantageous in estimating the actual number of speech sources from the separated signals under a noisy reverberant condition thanks to the little leakage of speech components to noise components. Note that ARMA-FastwNMF has a clear performance advantage under a noise-free condition (Section V-C).

descriptionView Paper arrow_downwardDownload

Semi-Supervised Multichannel Speech Enhancement With a Deep Speech Prior

by Aditya Nugraha

2023, IEEE/ACM Transactions on Audio, Speech, and Language Processing

This paper describes a semi-supervised multichannel speech enhancement method that uses clean speech data for prior training. Although multichannel nonnegative matrix factorization (MNMF) and its constrained variant called independent... more

descriptionView Paper arrow_downwardDownload

An effective analysis of deep learning based approaches for audio based feature extraction and its visualization

by Rohit Biswas

2023, Multimedia Tools and Applications

Visualizations help decipher latent patterns in music and garner a deep understanding of a song's characteristics. This paper offers a critical analysis of the effectiveness of various state-of-the-art Deep Neural Networks in visualizing... more

Fig. 7 Features across all tracks obtained from NET-AE-SHARED scaled across all the components using Standard Normalization

Fig. 4 The raw encodings across all tracks obtained from NET-AE-SHARED well as scaling the features across all the encoding components. Minimax Normalization can be easily employed to scale data to a given range and it was thus our first choice. Figure 6 shows that scaling the raw encodings also highlights the lifferences between each encoding component which are otherwise very subtle in the raw sncoding data. One drawback of Minimax Normalization is that it is extremely susceptible to outliers and the standard deviation of the scaled encodings is still very low on account of being suppressed. We can improve the standard deviation of the encodings by using Standard Normalization. The aforementioned technique is typically used to standardized data to a mean of 0 and a standard deviation of 1. But, we have instead used a mean of 0.5 and standard Jeviation of 0.4 so that the encodings lie in the desired range from 0 to | and the standard deviation is scaled up to a reasonable value. Any values scaled beyond the desired range have been clipped. An example is illustrated in Fig. 7 Both the aforementioned methodologies have been used in different parts of our study. In Section 5.3, we have illustrated several such examples.

Fig. 22 Shows the visualizations of tracks from 2 different genres, Electronic - ‘ELT’ by ‘Borful Tang’ (Left) and Rock - ‘Prototype’ by ‘The Modern Airline’ (Right)

Rohit Biswas has received a bachelor’s degree in Computer Science from Birla Institute of Technology and Science, Pilani. He has worked at Opera Solutions for two years as Full-Stack Developer. His research interests lie in Machine Learning, Computer Vision, Embedded Systems and Big Data Technologies.

Fig. 8 Compare the features across all tracks obtained from the 3 autoencoder architectures after scaling them across all the components using Minmax Normalization. NET-AE (Left) shows varying means across the components but the standard deviation of the components is still very low. NET-AE-SHARED (Middle) shows a lot of variation in the standard deviation of the components. NET-AE-SKIP (Right) shows the least variation in both mean and standard deviation across the components The most straightforward way to map the raw audio features to the visual parameters is to perform a deterministic one to one mapping between them. The mapping is arbitrarily performed but it does not change throughout the analysis. However, the features need to be scaled before mapping them. We have experimented with scaling by and across features both yielding interesting results. Note that direct mapping requires the feature space to be identical

Fig. 15 Compares the means of the PCA transformed encodings of two musical genres, Rock (Left) and Electronic (Right), using features obtained from Layer 10 of NET-VGG13, scaled across the components using Standard Normalization

Nischay Ghattamaraju has received a bachelor’s degree in Computer Science from Birla Institute of Techno! ogy and Science, Pilani. He has worked at Gwynniebee for three years as Data Engineer for Predictive Machin Learning Systems. His research interests lie in Machine Learning, Artificial Intelligence and Distribute Computing.

Fig. 13 Shows the visualization of the same segment of the track ‘Max Awe’ by ‘Pot-C’ on VIZ-BAR using Random Mapping One of the limitations, highlighted in the previous experiment with Deterministic Mapping. was related to the lack of control over the visual parameters. Typically, in a visualization. certain visual elements need to be very dynamic, for example, the motion of a bouncing bal! while others, such as the color or size of the ball, can be subtler. In this experiment, we incorporated the underlying variance of the data while performing the mapping. We employed Principal Component Analysis (PCA) to transform the encodings of all the tracks in the test set to its principal components. This resulted in a new feature space in which the encoding components were ordered by variance. Figure 14 shows the mean and variance of the transformed encodings. The visual parameters were also ordered by relevance such that the visual elements which were required to be more dynamic were mapped to the initial principal! components and then a one to one mapping was performed deterministically between them.

Fig. 6 Features across all tracks obtained from NET-AE scaled across all the components using Minimax Normalization ees ir ae ae features will result in similar looking visualizations which are not very dynamic. As compared to the raw encodings, illustrated in Fig. 4, the means are well distributed between 0 and 1. Moreover, the standard deviations have also reasonably scaled up. These features will result in dynamic visualizations. Note that scaling across the components has resulted in suppression of the mean of component | and the standard deviations of components 2 and 5. Since the inputs to the feature extractors were normalized and considering the symmetry of the deep learning architectures it is reasonable to expect that the output features are of the same scale. The feature statistics further reinstate our assumption as all the encoding components have means and standard deviations of approximately the same scale. Considering this, we could scale the feature values across all the encoding components as illustrated in Figs. 6 and 7. This has the additional effect of preserving the relative means and deviations between the encoding components. Figure 8 compares the scaled encodings obtained from all the autoencoders. A downside o f scaling across all the components is that if the distribution of a component differs from the average distribution across the components, then it can get suppressed. Such a component can potentially exhibit characteristic features which we might lose by adopting this method o f scaling. To address this problem, we can scale each feature independently to the desired range as shown in Fig. 9. This method ignores the inter-feature means and deviations. However, the deviations within each feature are highlighted and each component is expressed well. 7 encodings but in the subsequent transformed encodings. This section discussed scaling techniques applied to the raw sections, we will see that the same techniques can be used on

Fig. 5 The raw encodings across all tracks obtained from the final layer of NET-ALEX

Fig. 18 Shows the means of the components of the encodings transformed with K-Means across two genres, Electronic (Left) and Rock (Right), obtained from NET-AE and scaled by features using Standard Normalization A LNAI ID Moreover, the K-Means approach works well with the different combinations of scaling nethodologies previously discussed. Figure 20 exemplifies the observation that the visualiza- ions produced with this method are especially contrasting when using the features obtained from the genre classifiers. It should, however, be noted that information is stored in the encoding data in a non-linear way and hence a non-linear clustering strategy might yield setter results than K-Means Clustering and we intend to revisit this idea in the future. The variation in the means of few of the components is apparent. An autoencoder will typically

Fig. 2. Shows the architecture of Alexnet (NET-ALEX) adapted for our study

Fig. 3 The raw encodings across all tracks obtained from NET-AE

Fig. 9 The encodings across all tracks obtained from NET-AE scaled independently for each component using Standard Normalization

Fig. 10 Shows the visualizations of two different segments of the track ‘Homemade Rap’ by ‘Pot-C’ on VIZ- REAL a See oo eee ee ee a ae The statistics are extremely similar for the two musical genres suggesting that the autoencoder architecture did not capture genre-specific features and the resulting visualization of tracks from these genres will be rather similar. We observed a few drawbacks of mapping the autoencoder features using Deterministic Mapping. It was illustrated in Fig. 7, that when scaling across features, the mean and standard deviation of some components are suppressed. Correspondingly, the mapped visual parameters are not expressed well. This can be handed by scaling each feature independently, though, at the cost of variaitions between the encoding components. Moreover, Fig. 12 also shows that the statistics of entire musical genres can be

Fig. 11 Shows the visualizations of two consecutive segments of the track ‘L’s’ by ‘Fleslit’

Fig. 12 Shows the features across two different musical genres, Electronic (Left) and Folk (Right), obtained from NET-AE-SKIP and scaled by features using Standard Normalization very similar. Even our track level analysis revealed that the means and deviations are not too different for different tracks. Though the intrinsic variation of the features yields dynamic visualizations but these visualizations tend to similar across the tracks irrespective of the scaling methodology used. In Section 5.3.4, we will explore methods for creating different visual experiences for different tracks. Finally, the approach of deterministic mapping does not allow us to control the extent to which a visual parameter is expressed. We could sort the components by their mean or variance and selectively map them to visual parameters but in Section 5.3.3, we will look at a more sophisticated technique for prioritizing the visual Narameters.

Fig. 14 Shows the mean and standard deviation of the encodings across all the tracks after being PCA transformed An additional advantage of PCA is that it enables us to bring down the dimensionality of the feature space. This is extremely useful in the context of the features extracted from the genre classifiers which are very large, averaging at around 4000 features. After the PCA transformation, we have reduced the number of components to 10 which is equal to the number of visual parameters. This results in the loss of up to 60% of the variance of the data but our experiments show that the retained components can easily capture the nuances between genre-specific features. Figure 15 shows that some genre level differences are captured by the PCA transformations on the features extracted from the genre classifiers. However, these differences are subtle and can be accredited to the feature extractor rather than the mapping methodology as such differences are absent in the case of the features extracted from the autoencoders. Moreover, the visualizations produced by the features obtained from different classifiers also yielded slightly different visual experiences, as shown in Fig. 16. The features have been obtained from NET-AE-SHARED and scaled across all the components using Standard Normalization after the PCA transformation. The mean of the of

Fig. 16 Compares the PCA transformed features across the genre Folk, obtained from three different classifier layers, Layer 6 of NET-ALEX (Left), Layer 13 of NET-VGG16 (Middle) and Layer 15 of NET-VGG16 (Right)

Fig. 17 Shows the visualizations of two segments of the same track ‘It Was Me’ by ‘Derek Clegg’ on VIZ- REAL created using the PCA transformed features obtained from NET-AE-SHARED and scaled across the components using Standard Normalization

Fig. 19 Shows the means of the components of the encodings transformed with K-Means across the two genres, Electronic (Left) and Rock (Right), obtained from Layer 15 of NET-VGG16 and scaled by features using Standard Normalization

Fig. 20 Shows the visualizations of tracks from 2 different genres, Electronic - ‘ELT by ‘Borful Tang’ (Left) and Rock - ‘Prototype’ by ‘The Modern Airline’ (Right) but it does not come with the aforementioned control over the expression of visual parameters. The PCA transformations allowed us to control the expression of visual parameters but the differences between the visualizations of different tracks were not very prominent. The K- Means transformations allowed us to produce different visual experiences for different tracks but it does not come with the aforementioned control over the expression of visual parameters.

Fig. 21 Shows the statistics of the encoding components transformed with K-Means followed by PCA across two genres, Electronic (Left) and Rock (Right), obtained from Layer 15 of NET-VGG16 and scaled across the encoding components using Standard Normalization

descriptionView Paper arrow_downwardDownload

Écoute Réduite – a wrong turn in the history of electroacoustic music?

by Leigh Landy

2023

In many ways, all non-representational arts have distanced themselves to a greater or lesser extent from their potential public over the centuries due to the fact that art and life have been largely separated. For example, those who have supported the notion of art for art's sake for over two hundred years have been rather explicit about this separation. Nevertheless, most human beings still enjoy and find it natural to make links between the artistic and lived experience. The inclusion of the sound as potential musical material has not only led to new and radical forms of soundbased music making, but also to the opportunity for life to become part of music. This talk focuses on the impact, perhaps unintended, Pierre Schaeffer had when he coined the term, écoute réduite and considered it to be of importance in terms of the success of what is known today as acousmatic music. An opposing view is presented, namely that of the use of real-life sounds across the innovative sound-based musical spectrum, primarily those genres employing electroacoustic or related new media approaches. It will be suggested that sampling is one case where musical experimentation may actually lead towards increased appreciation and artistic participation in new forms of music making. Regardless of this suggestion, the talk's aim as evidenced in its conclusion is one of synthesis, not opposition. Preface One of the idées fixes throughout my career has been my fairly lonely attack against today's reality of many contemporary artists working their way into a corner due to a lack of connection with a public larger than that of their peers. It is almost as if artists are actively working towards various art musics' own demise. Things need not be so gloomy, however. This talk will take advantage of my view that through sampling some forms of musical experimentation may find greater access than has been achieved by a good deal of contemporary 'academic music' around the globe. The reason for this is sample-based music's ability to connect with human experience, that is, in the sense of using 'recycled' samples from the real world. The discussion will be structured as follows. It will commence with some thoughts regarding the relationship between art and life, both historically and in terms of today's art making to set the context. Then I shall investigate some perceived tensions between reduced and what I call heightened listening. The section that follows concerns music as 'organised samples' discussing issues related to sample-based work which may lead some of you to believe that I am taking sides, but this is only a step towards the talk's conclusion which is one of synthesis. The key focus of this talk, as it is in my scholarly and artistic work, is that of 'soundbased music'. This term is defined as follows: sound-based music typically designates the art form in which the sound, that is, not the musical note, is its basic unit (Landy 2007a, 17). One may query why notes are being excluded here and, also, why the more common terms electroacoustic music and sonic art are being ignored. To cut two long stories short, it was proposed in this book and its successor (Landy 2007b) that sound-based music possesses its own paradigm, as does note-based music. This paradigm is highly associated with that of other new media arts and takes into account both poietic and esthesic aspects. The reason to avoid the two more widely used terms, discussed in my two recent books at length, has to do with the term, electroacoustic music being used inconsistently and also its inclusion of certain notebased works. The issue with sonic arts is that the term can be used as an excuse for works to be considered not to be music, a view that leaves me feeling uncomfortable. The discussion will privilege the reception of sample-based works above the oftendiscussed areas of construction, tools and channels of dissemination. The reason for brought to you by CORE View metadata, citation and similar papers at core.ac.uk

descriptionView Paper arrow_downwardDownload

An SVM—ANN Hybrid Classifier for Diagnosis of Gear Fault

by Sunil Tyagi

2023, Applied Artificial Intelligence

A hybrid classifier obtained by hybridizing Support Vector Machines (SVM) and Artificial Neural Network (ANN) classifiers is presented here for diagnosis of gear faults. The distinctive features obtained from vibration signals of a... more

Test success by SVM and hybrid classifier

Table 1. Success and training time of SVM with RBF (o = 4.0).

Figure 12. Performance of hybrid classifier with raw signal.

Figure 9. SVM with polynomial kernel train and test success.

Figure 7. Training success and training time by ANN classifier.

Finally, by introducing Lagrange multipliers and employing the optimal constraints, Equation (7) is denoted in the explicit form as

Figure 5. Experimental setup. Experimental setup

Figure 8. Effect of signal type on training time.

The network is trained iteratively. In each iteration, Mean Square Error (MSE) between target and network output is calculated. The MSE at the kth iteration is given by:

Figure 6. Test success by ANN classifier. MULLS ULC VOU wy FUNIN SIMOQOTTICE The results of successes obtained by feeding these fifteen trained ANNs by test vectors created from raw signal, preprocessed signal, and noise-added signals are presented in Figure 6. The test success obtained by raw signal (i.e. unprocessed and without addition of noise) is fairly good. 77% of 96 test vectors were correctly classified when the ANN with only one node in hidden layer was used as classifier. The test success increased as the nodes in hidden layer were increased. Network achieved 92.7% success in case of ANN with 15 nodes in hidden layer. It can be seen from Figure 6 that the test success of ANN reduced significantly with increase in magnitude of the added noise. In case of signal with 20% added noise, the test success varied from 57% (in case of 1 node in hidden layer) to 72% (15 nodes in hid the network. It can be seen that the success achieved with DWT preprocessed signal was higher than that with raw signal. Figure 6 shows that when nodes in hid better success than raw signal in 13 of 15 cases. den layer). The preprocessing with DWT has improved the test performance of den layers are varied from 1 to 15, the DWT preprocessed signal achieved Fioure 7 presents the success achieved bv the classifier when it was input

where L,(y, f(x)) is the loss function that measures the approximate errors between expected output y; and the calculated output f(x;), and C is a regularization constant. 1/2||w||2 determines the trade-off between the train- ing error and the generalization performance. The second term in Equation (8) is used as a measure of flatness of the function. If we introduce relaxation factor &, & * it transforms Equation (8) to the following constrained function:

Figure 10. Test and train success by SVM with RBF kernel.

descriptionView Paper arrow_downwardDownload

Passive Shallow Water Automated Target Recognition using Deep Convolutional Bi directional Long Short Term Memory

by SURAJ KAMAL

2023, Defence Science Journal

The extremely challenging nature of passive acoustic surveillance makes it a key area of research in Naval Non-Co-operative Target Recognition especially in Anti-Submarine Warfare systems. In shallow waters, the complex acoustics due to... more

descriptionView Paper arrow_downwardDownload

Звуковая парадигма в видеоиграх жанра Хоррор

by Vladislav Kirichenko

2023

The current work is devoted to the analysis of the sound paradigm in video games of the horror genre. The sound in computer games is an important component, since the game is a syncretic medium. The aim of the work is an attempt to... more

descriptionView Paper arrow_downwardDownload

Show Me the Instruments: Musical Instrument Retrieval From Mixture Audio

by Seonghyeon Go

2023, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

As digital music production has become mainstream, the selection of appropriate virtual instruments plays a crucial role in determining the quality of music. To search the musical instrument samples or virtual instruments that make one's... more

Table 1: Comparison with other datasets.

Fig. 2: The overall process of the suggested method. (a) Single-Instrument Encoder is trained to classify which instrument played the input audio. We take the penultimate layer’s activation of the trained network as instrument embedding. (b) Multi-Instrument Encoder extracts multiple instrument embeddings from the mixture audio. The Single-Instrument Encoder provides the set of target embeddings. (c) At inference time, we first extract the instrument embeddings of each instrument in the instrument library for a single time. Then we extract the multiple embeddings from the mixture audio query and retrieve the most similar instruments from the instrument library.

Fig. 4: The t-SNE results of Single-Instrument Encoder on Nlakh- single (a) training and (b) validation dataset.

Fig. 1: Comparison between musical instrument recognition and re- trieval task. ‘Department of Intelligence and Information, Seoul National University *Interdisciplinary Program in Artificial Intelligence, Seoul National University 3 Artificial Intelligence Institute, Seoul National University

Fig. 3: The process of rendering a sample of (a) Nlakh-single and (b) Nlakh-multi

Table 2: Performance of the Multi-Instrument Encoder. Small/Large indicates the size of the model. Nlakh/Random indicates which dataset is used for training.

descriptionView Paper arrow_downwardDownload

Show Me the Instruments: Musical Instrument Retrieval from Mixture Audio

by Seonghyeon Go

2023, arXiv (Cornell University)

descriptionView Paper arrow_downwardDownload

Research on Fault Diagnosis System of a Diesel Engine Based on Wavelet Analysis and LabVIEW Software

by Ahmed Elbashier

2023, Research Journal of Applied Sciences, Engineering and Technology

Experiment presented in this study, used vibration data obtained from a four-stroke, 295 diesel engine. Fault of the internal-combustion engine was detected by using the vibration signals of the cylinder head. The fault diagnosis system... more

descriptionView Paper arrow_downwardDownload

Robust Binaural Localization of a Target Sound Source by Combining Spectral Source Models and Deep Neural Networks

by Jose antonio Gonzalez

2023, IEEE/ACM Transactions on Audio, Speech, and Language Processing

Despite there being clear evidence for top-down (e.g., attentional) effects in biological spatial hearing, relatively few machine hearing systems exploit top-down model-based knowledge in sound localisation. This paper addresses this... more

Fig. 2. Ratemap representations of various masker sounds used in this study.

Fig. 3. Schematic diagram of the virtual listener configuration, showing a typical arrangement of target source (here at +30°) and masker (at -15°). Target source positions were limited to the range [-90°,+90°] as indicated by the gray arrows. Potentially, a target source in front of the head could be incorrectly attributed to a location behind the head — a front-back error — as shown by the open circle. THE NUMBER OF GAUSSIAN MIXTURE COMPONENTS USED FOR EACH SOURCE MODEL.

Fig. 1. Schematic diagram of the proposed system.

Fig. 4. Localisation error rates for localising the target source in the presence of various maskers, at a target-to-masker ratio (TMR) of 0 dB. The proportions of front-back errors are indicated as white bars.

Fig. 5. Localisation error rates for localising the target source in the presence of various maskers, at a target-to-masker ratio (TMR) of -6 dB. The proportions of front-back errors are indicated as white bars.

given less weight, and the estimated mask is closer to the oracle mask. Fig. 6. Localisation weights estimated for target speech mixed with alarm sound at a target-to-masker ratio (TMR) of -6dB using the UBM with and without adaptation. The ‘oracle’ mask was shown to indicate the spectro- temporal regions dominated by the target speech (blue regions). The locali- sation weights larger than 0.5 were shown in blue in the bottom 2 panels.

ROOM CHARACTERISTICS OF THE SURREY BRIR DATABASE [29]. TABLE I B. Target and masker signals

DESCRIPTIONS OF MASKER SOUNDS USED IN NOISE SET A. TABLE II TABLE III

descriptionView Paper arrow_downwardDownload

Tensorflow Audio Models in Essentia

by Dmitry Bogdanov

2023, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Essentia is a reference open-source C++/Python library for audio and music analysis. In this work, we present a set of algorithms that employ TensorFlow in Essentia, allow predictions with pre-trained deep learning models, and are... more

Table 4: Balanced accuracies for 5-fold cross-validation and eval- uation on a manually annotated subset of MTG-Jamendo-test. Sta- tistically significant improvements over the SVMs according to an independent samples t-test (P > 0.05) are marked in bold.

Table 3: Cross-collection evaluation results. The best balanced accuracies are marked in bold.

descriptionView Paper arrow_downwardDownload

Musical Audio Similarity with Self-supervised Convolutional Neural Networks

by Carl Thomé

2023, arXiv (Cornell University)

We have built a music similarity search engine that lets video producers search by listenable music excerpts, as a complement to traditional full-text search. Our system suggests similar sounding track segments in a large music catalog by... more

Figure 1. An overview of our self-supervised similarity learning approach for musical audio. We construct training data by transforming audio clips with a randomized audio effects chain of musical transformations.

We have computed mean average precision (mAP) as in [7] in Table 1 for our encoder, a random encoder, and a baseline encoder. The random encoder draws an embed- ding from a uniform distribution per clip and uses that as its representation. The baseline encoder computes 20 MFCC coefficients from 128 band Mel spectrograms and decor- relates the MFCC dimensions with incremental PCA with librosa and scikit-learn [8, 13].

descriptionView Paper arrow_downwardDownload

An Extended Variational Mode Decomposition Algorithm Developed Speech Emotion Recognition Performance

by David HASON RUDD and

2023, The Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD)

Emotion recognition (ER) from speech signals is a robust approach since it cannot be imitated like facial expression or text based sentiment analysis. Valuable information underlying the emotions are significant for human-computer... more

descriptionView Paper arrow_downwardDownload

EEG correlates of perception of tonal modulation in musical fragments

by A. Fedotchev

2023, International Journal of Psychophysiology

В данной работе с помощью техники регистрации событийно-связанных потенциалов (ССП), исследованы особенности нейрокогнитивных процессов при обработке расстояния тональной модуляции. В исследовании приняли участие 20 добровольцев (6 мужчин, средний возраст участников 19.7 ± 2.3 года). Все испытуемые были правшами; ни у кого из них не было профессиональной музыкальной подготовки. Испытуемым предлагалось прослушать ряд гармонических последовательностей с тональной модуляцией в субдоминанту (близкая степень модуляции, замена 1 тона по отношению к начальной гамме), в малую сексту (дальняя степень модуляции, замена 4 тонов по отношению к начальной гамме) и в тритон (дальняя модуляция, замена 6 тонов по отношению к начальной гамме). Для оценки эффекта модуляции использовалась последовательность без модуляции. Показано уменьшение амплитуды волны N200 при прослушивании гармонических последовательностей вне зависимости от степени модуляции. Было выявлено увеличение амплитуды Р600 в ответ на увеличение тонального расстояния между начальной и конечной тониками, т.е., в ответ на увеличение степени модуляции. Это позволяет предположить, что амплитуда Р600 связана с уровнем нарушения тональных ожиданий, что, в свою очередь, связано с возрастанием сложности при мысленной переориентации тональной схемы с начального тонального центра на новый тональный центр.

descriptionView Paper arrow_downwardDownload

Attention is All You Need? Good Embeddings with Statistics are enough:Large Scale Audio Understanding without Transformers/ Convolutions/ BERTs/ Mixers/ Attention/ RNNs or

by Prateek Verma

2023, arXiv (Cornell University)

This paper presents a way of doing large-scale audio understanding without traditional state-of-the-art neural architectures. Ever since the introduction of deep learning for understanding audio signals in the past decade, convolutional... more

descriptionView Paper arrow_downwardDownload

Development of a Fault Detection Approach Based on SVM Apllied to Industrial Data

by Mohieddine Jelali

2023

In existing production plants, sensor systems and other sources provide information about the plant condition. This paper presents methods for how data can be conveniently summarized, treated, and evaluated to retain characteristic... more

descriptionView Paper arrow_downwardDownload

Adaptive Loudness Compensation in Music Listening

by Vesa Valimaki

2023, Proceedings of the SMC Conferences

The need for loudness compensation is a well known fact arising from the nonlinear behavior of human sound perception. Music and other sounds are mixed and mastered at a certain loudness level, usually louder than the level at which they... more

descriptionView Paper arrow_downwardDownload

Federated Self-Supervised Learning of Multisensor Representations for Embedded Intelligence

by Johan Lukkien

2023, IEEE Internet of Things Journal

Smartphones, wearables, and Internet of Things (IoT) devices produce a wealth of data that cannot be accumulated in a centralized repository for learning supervised models due to privacy, bandwidth limitations, and the prohibitive cost of... more

descriptionView Paper arrow_downwardDownload

Efficient Transformer-based Speech Enhancement Using Long Frames and STFT Magnitudes

by Timo Gerkmann

2023, Interspeech 2022

The SepFormer architecture shows very good results in speech separation. Like other learned-encoder models, it uses short frames, as they have been shown to obtain better performance in these cases. This results in a large number of... more

descriptionView Paper arrow_downwardDownload

Между звуком и шумом: особенности формирования городских cаундскейпов

by Daria Vasileva Lessing

2023, Между звуком и шумом: особенности формирования городских cаундскейпов

Саундскейпы (звуковые ландшафты) города как культурно-символические пространства, опосредующие наше восприятие звуков и шумов, динамичны, изменчивы и неоднородны. Специфика повседневной жизни, культурный опыт, потребляемые аудиовизуальные тексты намечают для горожан различные траектории взаимодействия с аудиальным миром и делают границу между звуком и шумом довольно подвижной. В данной статье представлены важные для понимания процесса формирования городских саундскейпов этнографические данные, полученные в рамках исследования взаимодействия горожан со звуками и шумами Санкт-Петербурга. Исследование было проведено авторами в 2016–2021 гг. В нем продемонстрированы возможности использования таких методов, как «дневник звука» и «звуковая прогулка». Представлен анализ около 380 часов наблюдений, сделанных 38 горожанами, и материалы 19 интервьюпрогулок с петербуржцами, проживающими в 15 различных микрорайонах. Важным результатом стало выявление особенностей категоризации звуков и шумов в городских саундскейпах, а также роли аудиального опыта в формировании режимов интеракции жителей Петербурга с городской средой. Исследование дает инструменты для анализа социальной и культурной надстройки восприятия городских звуков и шумов, что может представлять интерес для социологов, урбанистов, городских антропологов, экологов и исследователей локальных сообществ.

descriptionView Paper arrow_downwardDownload

Программно-аппаратный комплекс формирования и исследования случайных и псевдослучайных чисел

by Александр Потий

2023

Considers a creation task of program and hardware components of generators of accidental and psevdoaccidental numbers, built in information defense systems, describe the componental generators models of accidental and psevdoaccidental... more

descriptionView Paper arrow_downwardDownload

Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation

by Noboru Harada

2023, Cornell University - arXiv

Recent general-purpose audio representations show state-of-the-art performance on various audio tasks. These representations are pre-trained by self-supervised learning methods that create training signals from the input. For example,... more

descriptionView Paper arrow_downwardDownload

Deep Learning Framework Applied For Predicting Anomaly of Respiratory Sounds

by Khoa Tran

2023, 2021 International Symposium on Electrical and Electronics Engineering (ISEE)

This paper proposes a robust deep learning framework used for classifying anomaly of respiratory cycles. Initially, our framework starts with front-end feature extraction step. This step aims to transform the respiratory input sound into... more

descriptionView Paper arrow_downwardDownload

A deep representation for invariance and music classification

by Tomaso Poggio

2023, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Representations in the auditory cortex might be based on mechanisms similar to the visual ventral stream; modules for building invariance to transformations and multiple layers for compositionality and selectivity. In this paper we... more

descriptionView Paper arrow_downwardDownload

Representation Learning Using Artist labels for Audio Classification Tasks

by Jung-Woo Ha

2023

In this work, we use a deep convolutional neural network (DCNN) trained with a public dataset, the Million Song Dataset, as a feature extractor. We trained the network from audio mel-spectrogram using artist labels in a discriminative... more

descriptionView Paper arrow_downwardDownload

Cross-modal Embeddings for Video and Audio Retrieval

by Amanda Duarte

2022, Lecture Notes in Computer Science

The increasing amount of online videos brings several opportunities for training self-supervised neural networks. The creation of large scale datasets of videos such as the YouTube-8M allows us to deal with this large amount of data in... more

descriptionView Paper arrow_downwardDownload

Sound (Signal Processing)

Key research themes

1. How do physical modeling and wave equation solutions enhance spatial sound synthesis and instrument acoustics?

2. What are effective machine learning and signal processing approaches for automatic sound classification and speech emotion recognition using spectral decomposition?

3. How can sonification techniques and auditory display theories be advanced through nonlinear sound propagation models and interdisciplinary aesthetics?

Related Topics

All papers in Sound (Signal Processing)