Thilo's homepage - research // http://www.mathematik.uni-marburg.de/~stadelmann

Home

Direct link to my publications (outdated - only up to mid-2010).

Direct link to my talks (outdated).

Direct link to my source code (outdated).

I once began to study computer science because I was fascinated by artificial intelligence, the wonderful combination of technology and creativity. The thrill remained. So for my Ph.D. I worked on a project where all the interesting matters from AI, machine learning, patter recognition, statistics, signal processing, psychoacoustics, linguistics, phonetics and biology (I'm sure I forgot some) converge at one point:

The Mediana Project. Building a software that can assist media scientists with their work by providing algorithms for robust video- and audio-analysis is our goal.

My domain have been the audio-algorithms, especially building a speaker-indexing system that is able to operate on arbitrary videos (movies, drama, news, documentation, ...). To achieve this, algorithms have to be build that are robust under varying conditions: background noise (think about the music and sound effects in movies), speaking rate and style, number of speakers, utterance length. Besides, they should work fast (because of the large amount of data present in a full-length movie).

Doing all of this is (at least up to 2010) still utopia. But heading towards it comprises research in the following areas (my general research interests):

Signal processing: How to convert an audio-file into good feature-vectors? How to separate speech from noise?
Blind source separation: How to clean a noise-corrupted speech-signal?
Pattern recognition: How to compare to audio stream parts? Are they from the same speaker? Is it speech?
Statistics: How to Model a sequence of feature vectors of a speaker in case of limitied sample size?
Linguistics/phonetics: What are the specific attributes of speech? And how to transform them to good features in the technical sense?
AI and data mining: How to extract new knowledge? How to reason?
Experimental computer science: I admit, I dislike experiments in agreement with T.H. Huxley, who said: "The great tragedy of Science - the slaying of a beautiful hypothesis by an ugly fact." But as a matter of fact they're part of my work.
...

I have been puzzled by the divergence between increasingly complex techniques (especially in speech processing, but it applies to data analysis in general) and the lack of potent tools to perceptually grasp what they convey... haven't you noticed that? You use a technique, say MFCC features and GMM models (to name a common, not-so-complex example from speaker recognition), and as long as you get exptected (good) results, everything is fine. But if the results are awful, you don't know what went wrong - because you have no idea what is really going on inside (sure, you can define it mathematically, and you can reconstruct it - but that's not the point, because it takes too long to be employed on a regular basis). What you wish is to gain perceptual insight into your techniques - to see (visualization) and hear (resynthesis) what is going on, and what is going wrong. You wish to have some tools that makes your techniques somehow vivid and perceptible - that allow you to do learning, debugging and experimentation (parameter setting, eh?) intuitively... really interesting. This thoughts actually led to the concept of Eidetic Design.

My Marburgian publications can be found among those of my group and here (PDFs). Addtionally, here's a textual list up to mid-2010:

Christian Beecks, Thilo Stadelmann, Bernd Freisleben, and Thomas Seidl, Visual Speaker Model Exploration, In Proceedins of the IEEE International Conference on Multimedia and Expo (ICME'2010), pp. 727-728, Singapore, July 19-23, 2010, IEEE.
Thilo Stadelmann, Yinghui Wang, Matthew Smith, Ralph Ewerth, and Bernd Freisleben. Rethinking Algorithm Development and Design in Speech Processing. In Proceedings of the 20th International Conference on Pattern Recognition (ICPR'10), pp. 4476-4479, Istanbul, Turkey, August 2010a. IAPR.
Thilo Stadelmann and Bernd Freisleben. On the MixMax Model and Cepstral Features for Noise-Robust Voice Recognition. Technical Report, Marburg University, July 2010.
Thilo Stadelmann and Bernd Freisleben. Dimension-Decoupled Gaussian Mixture Model for Short Utterance Speaker Recognition. In Proceedings of the 20th International Conference on Pattern Recognition (ICPR'10), pp. 1602-1605, Istanbul, Turkey, August 2010a. IAPR.
Markus Mühling, Ralph Ewerth, Thilo Stadelmann, Bing Shi, and Bernd Freisleben. University of Marburg at TRECVID 2009: High-Level Feature Extraction. In Proceedings of TREC Video Retrieval Evaluation Workshop (TRECVid'09). Available online, 2009. URL http://www-nlpir.nist.gov/projects/tvpubs/tv.pubs.org.htm.
Ernst Juhnke, Dominik Seiler, Thilo Stadelmann, Tim Dörnemann, and Bernd Freisleben. LCDL: An Extensible Framework for Wrapping Legacy Code. In Proceedings of International Workshop on @WAS Emerging Research Projects, Applications and Services (ERPAS'09), pages 638-642, Kuala Lumpur, Malaysia, December 2009.
Dominik Seiler, Ralph Ewerth, Steffen Heinzl, Thilo Stadelmann, Markus Mühling, Bernd Freisleben, and Manfred Grauer. Eine Service-Orientierte Grid-Infrastruktur zur Unterstutzung Medienwissenschaftlicher Filmanalyse. In Proceedings of the Workshop on Gemeinschaften in Neuen Medien (GeNeMe'09), pages 79-89, Dresden, Germany, September 2009.
Thilo Stadelmann and Bernd Freisleben. Unfolding Speaker Clustering Potential: A Biomimetic Approach. In Proceedings of the ACM International Conference on Multimedia (ACMMM'09), pages 185-194, Beijing, China, October 2009. ACM.
Thilo Stadelmann, Steffen Heinzl, Markus Unterberger, and Bernd Freisleben. WebVoice: A Toolkit for Perceptual Insights into Speech Processing. In Proceedingsof the 2nd International Congress on Image and Signal Processing (CISP'09), pages 4358-4362, Tianjin, China, October 2009.
Steffen Heinzl, Markus Mathes, Thilo Stadelmann, Dominik Seiler, Marcel Diegelmann, Helmut Dohmann, and Bernd Freisleben. The Web Service Browser: Automatic Client Generation and Efficient Data Transfer for Web Services. In Proceedings of the 7th IEEE International Conference on Web Services (ICWS'09), pages 743-750, Los Angeles, CA, USA, July 2009a. IEEE Press.
Steffen Heinzl, Dominik Seiler, Ernst Juhnke, Thilo Stadelmann, Ralph Ewerth, Manfred Grauer, and Bernd Freisleben. A Scalable Service-Oriented Architecture for Multimedia Analysis, Synthesis, and Consumption. International Journal of Web and Grid Services, 5(3):219-260, 2009b. Inderscience Publishers.
Markus Mühling, Ralph Ewerth, Thilo Stadelmann, Bing Shi, and Bernd Freisleben. University of Marburg at TRECVID 2008: High-Level Feature Extraction. In Proceedings of TREC Video Retrieval Evaluation Workshop (TRECVid'08). Available online, 2008. URL http://www-nlpir.nist.gov/projects/tvpubs/tv.pubs.org.htm.
Markus Mühling, Ralph Ewerth, Thilo Stadelmann, Bing Shi, Christian Zöfel, and Bernd Freisleben. University of Marburg at TRECVID 2007: Shot Boundary Detection and High-Level Feature Extraction. In Proceedings of TREC Video Retrieval Evaluation Workshop (TRECVid'07). Available online, 2007a. URL http://www-nlpir.nist.gov/projects/tvpubs/tv.pubs.org.htm.
Ralph Ewerth, Markus Mühling, Thilo Stadelmann, Julinda Gllavata, Manfred Grauer, and Bernd Freisleben. Videana: A Software Toolkit for Scientific Film Studies. In Proceedings of the International Workshop on Digital Tools in Film Studies, pages 1-16, Siegen, Germany, 2007. Transcript Verlag.
Markus Mühling, Ralph Ewerth, Thilo Stadelmann, Bernd Freisleben, Rene Weber, and Klaus Mathiak. Semantic Video Analysis for Psychological Research on Violence in Computer Games. In Proceedings of the ACM International Conference on Image and Video Retrieval (CIVR'07), pages 611-618, Amsterdam, The Netherlands, July 2007b. ACM.
Ralph Ewerth, Markus Mühling, Thilo Stadelmann, Ermir Qeli, Björn Agel, Dominik Seiler, and Bernd Freisleben. University of Marburg at TRECVID 2006: Shot Boundary Detection and Rushes Task Results. In Proceedings of TREC Video Retrieval Evaluation Workshop (TRECVid'06). Available online, 2006. URL http://www-nlpir.nist.gov/projects/tvpubs/tv.pubs.org.htm.
Thilo Stadelmann and Bernd Freisleben. Fast and Robust Speaker Clustering Using the Earth Mover's Distance and MixMax Models. In Proceedings of the 31st IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'06), volume 1, pages 989-992, Toulouse, France, April 2006. IEEE.
Ralph Ewerth, Christian Behringer, Tobias Kopp, Michael Niebergall, Thilo Stadelmann, and Bernd Freisleben. University of Marburg at TRECVID 2005: Shot Boundary Detection and Camera Motion Estimation Results. In Proceedings of TREC Video Retrieval Evaluation Workshop (TRECVid'05). Available online, 2005. URL http://www-nlpir.nist.gov/projects/tvpubs/tv.pubs.org.htm.

Some talks I gave during my Ph.D. time:

Poster on Eidetic Design at ICPR 2010, Istanbul, Turkey, 23-26 August 2010 (presented by Ralph Ewerth)
Talk on the Dimension-Decoupled GMM at ICPR 2010, Istanbul, Turkey, 23-26 August 2010 (presented by Ralph Ewerth)
My disputation (defense of thesis) on 08. July 2010 at Marburg University.

Voice Modeling Methods for Automatic Speaker Recognition on Prezi
Talk on Unfolding Speaker Clustering Potential at ACM Multimedia 2009, Bejing, China, 19-23 October
Talk on WebVoice and Analysis by Perception at CISP 2009, Tianjin, China, 17-19 October
Talk on the Videana Grid Infrastructure at GeNeMe 2009, Dresden, Germany, 1-2 October 2009 (in German)
Talk on the Web Service Browser at ICWS 2009, Los Angeles, CA, USA, 6-10 July 2009
Project overview at the "Enhancing Humanities - Potentials of media and ICT in the Humanities" Annual Conference of the IfM, Siegen, Germany, 2-8 May 2009
Poster on accelerated speaker clustering at ICASSP 2006, Toulouse, France, 14-16 May 2006

Some code of public interest may be found here. There is...

PlotGMM: some Matlab(r) routines to plot GMMs - very helpful to gain insight and assist debugging
WebVoice: a web service and accompanying browser-based GUIs for re-synthesizing speech from acoustic models and features
sclib: the sclib class library for all aspects of speaker recognition is available upon request; just email me.
LaTeX version of my thesis