You are cordially invited by the IEEE Benelux Information Theory Chapter and the Centre for Telematics and Information Technology of the University of Twente to attend the
Fifth Van der Meulen Seminar:
Data science at the intersection of information theory, statistics, and machine learning
Thursday, 27 November, 2014
13:00 – 17:00
University of Twente, Building: Horst, Room: Horstring N109
12:30 – 13:00
13:00 – 13:45
Peter Grünwald (CWI Amsterdam & Leiden University)
13:45 – 14:15
Thijs Laarhoven (Eindhoven University of Technology)
“Collusion-resistant fingerprinting and group testing”
14:15 – 14:30
14:30 – 15:00
Arnoud den Boer (University of Twente)
15:00 – 15:30
Sicco Verwer (Delft University of Technology)
15:30 – 16:00
16:00 – 16:30
Farzad Farhadzadeh (Eindhoven University of Technology)
16:30 – 17:00
Robin Aly (University of Twente)
“Big Data Programming Models”
17:00 – …
Abstracts of talks and biographies of speakers are available at
Attendance is free of charge, but registration is required. Please indicate your attendance by sending a message to firstname.lastname@example.org. The deadline for registration is 20 November.
Dr. ir. Jasper Goseling
Faculty of Electrical Engineering, Mathematics and Computer Science, University of Twente
+31 53 489 33 69
Abstracts and biographies
CWI Amsterdam & Leiden University
Bayesian and MDL (Minimum Description Length) inference can behave badly if the model under consideration is wrong yet useful: the posterior may fail to concentrate even for large samples, leading to extreme overfitting in practice. We demonstrate this on a simple regression problem. We introduce a test that can tell from the data whether we are heading for such a situation. The test is based on the idea of hypercompression: if we can compress data more than our model predicts, then the model must be wrong and there is danger of overfitting. In this case we can adjust the learning rate (equivalently: make the prior lighter-tailed, or penalize the likelihood more). The resulting "safe" Bayesian/MDL estimator behaves as well as standard Bayes/MDL if the model is correct but continues to achieve good rates with wrong models. In classification problems, it learns faster in easy settings, i.e. when a Tsybakov condition holds, effectively solving the old problem of 'how to learn the learning rate'.
* For an informal introduction to the idea, see Larry Wasserman's blog.
Peter Grünwald (1970, VIDI, VICI) heads the information-theoretic learning group at CWI <http://www.cwi.nl>, the Dutch national research institute for mathematics and computer science, located in Amsterdam. He is also professor of statistical learning at Leiden University. His research interests lie where statistics, computer science and information theory meet: theories of learning from data. In 2010 he was co-awarded the Van Dantzig prize, the highest Dutch award in statistics and OR. He is mostly known for his work on MDL (He is author of The Minimum Description Length Principle, MIT Press, 2007, becoming the standard reference in the field) and his active involvement in the ultimately successful attempt by a number of scientists to reopen the court case against Lucia de Berk, a nurse who had been wrongfully convicted of murdering seven patients.
Eindhoven University of Technology
To prevent unauthorized redistribution of copyrighted content, owners of the content commonly embed fingerprints in the content, allowing illegal copies to be traced to the responsible users. When several malicious users conspire and mix their fingerprinted copies of the content to form a new copy with a different fingerprint, tracing such a copy to the responsible parties may be much harder. For this we need collusion-resistant fingerprinting codes.
A different problem, which seems unrelated at first, is the group testing problem: given a large population of possibly infected individuals, find the (small) subset of infected individuals. For this we could simply test each person individually, but with the large majority of the population not being infected, many tests would be wasted. Instead, it may be advantageous to test a batch of people with a single test, with a positive test result if and only if the virus is present in (at least) one of the tested persons. With such group tests, it is possible to identify the infected members with fewer tests.
In this talk, we will look at the relation between these two problems, and solutions for both problems using tools from various areas of statistics and information theory.
University of Twente
In optimization problems, simple mathematical models that discard important factors may sometimes be preferred over more realistic models. This can happen if the parameters of the simpler models are easier to estimate than the parameters of the complex model, or the simple model can be optimized exactly while only approximate solutions are available for the complex model. There thus is a trade-off between modeling errors, statistical errors, and optimization errors. We propose a data-driven method to decide when misspecified models give better results, and illustrate its properties and performance with a dynamic pricing problem.
Arnoud den Boer studied Mathematics (2006, cum laude) at Utrecht University and obtained a post-master degree Mathematics for Industry (2008) at Eindhoven University of Technology. From 2008 until 2012 he wrote a PhD thesis titled 'Dynamic pricing and learning' at the CWI (Centrum voor Wiskunde en Informatica) Amsterdam. After subsequent postdoc positions at Eindhoven University of Technology and the University of Amsterdam, he is currently assistant professor at the research group Stochastic Operational Research of the University of Twente. His research centers around joint estimation-and-optimization problems in various application areas.
Delft University of Technology
State machines are key models for the design and analysis of computer systems. Learning finite state machines automatically from trace data enjoys a lot of interest from the software engineering and formal methods communities because it can provide insight into complex (black-box) software systems. In the literature, this approach has been used for learning and analyzing models for different types of complex software systems such as web-services, X11 windowing programs, communication protocols, the biometric passport, and java programs.
Formally, state machine learning can be seen as a grammatical inference (or process mining) problem in which the traces are modeled as the words of a language, and the goal is to find a model for this language, i.e., a state machine (automaton). In this talk, I will explain an algorithm for learning a timed automaton (a type of state machine) from timed traces using statistical tests. The entire structure, including transitions, probabilities, and temporal bounds, is learned entirely from data in an unsupervised manner. The algorithm has been adapted for several interesting use-cases, e.g.: learning the yeast-cell cycle, learning models for ATM-fraud, learning models for driving behaviour. In current work, we are applying the method to automatically reverse-engineer malware communication protocols from IP-traffic.
Eindhoven University of Technology
Content fingerprinting and digital watermarking are techniques that are used for content protection and distribution monitoring and, more recently, for interaction with physical objects. Over the past few years, both techniques have been well studied and their shortcomings understood. In this talk, we introduce a new framework called active content fingerprinting, which takes the best from these two worlds, i.e. the world of content fingerprinting and that of digital watermarking, in order to overcome some of the fundamental restrictions of these techniques in terms of performance and complexity. The proposed framework extends the encoding process of conventional content fingerprinting such that it becomes possible to extract more robust fingerprints from the modified data. We consider different encoding strategies and examine the performance of the proposed schemes in term of content identification rate.
Farzad Farhadzadeh received the B.Sc. degree in electrical engineering from the Isfahan University of Technology,Isfahan, Iran, the M.Sc. degree in signal processing and communication systems from the Electrical Engineering Department, Amirkabir University of Technology, Tehran, Iran, and the Ph.D. degree from the University of Geneva, Geneva, Switzerland, in January 2014. In March 2014, he was awarded the Swiss National Science Foundation Post Doctoral fellowship to join the SPS group at the Electrical Engineering Department, Eindhoven University of Technology. His main research interests include information forensics and multimedia security, multiuser information theory, and source coding (particularly human genome compression).
University of Twente
The Big Data trend shows that growing data sizes make valuable pattern visible. As a result of its size, data has to be processed in large computer centers with thousands of computers. In traditional settings, using such clusters requires advanced skills in distributed computing from the data scientists wanting to mine these data sets. To guarantee scalability to growing data sizes and to lower the burden for data scientists, industry and academia develop frameworks to support data processing. This talk will give insight into the architecture and design principles of current data management and programming frameworks developed in connection with large Internet companies, such as Google, Bing and Twitter.
Robin Aly is Assistant Professor for the area of Big Data at the University of Twente, the Netherlands. He studied database management and information systems in Singapore and Mannheim. In 2010, he received his PhD from the University of Twente. His thesis proposed new models for concept-based multimedia retrieval. During his PhD he visited Dublin City University. After obtaining his PhD, he lead a work package on structured search and hyperlinking of videos within the EU Project AXES. He contributed to over 25 peer-reviewed research papers in influential venues, reviewed for international conferences and journals, and co-organized four editions of the Multimedia Evaluation Workshop MediaEval. His current research interests include formal models of information retrieval, multimedia retrieval, and Big Data.