Michael Mueller GSOC 2015 Application: Implementing Dynamic Time Warping Algorithm

Google Summer of Code Application for 2015

Background

I am currently a first-year student at the University of Massachusetts Amherst, majoring in mathematics and computer science. I am also interested in physics and enjoy reading about physics (especially the Feynman Lectures). This semester, I am working on a research project to study the melting of "active" crystals, whose constituent elements have a biased rather than purely random pattern of motion. The melting is simulated by small (6 mm wide) plastic particles vibrated on a shaker at ~50 Hz, with particular geometries that cause biased motion. I have found it really interesting to see how programming work can be useful for physics experiments, as a major part of my project has involved image processing in Python using scikit-image and scipy.ndimage (for convolutions and so forth).

Last year, I participated in GSOC with Astropy and completed a project to write an optimized C/Cython engine for io.ascii based on the approach used by Pandas for read_csv. My blog from the summer is viewable at http://muellergsoc.blogspot.com/ and the primary pull request involved is at https://github.com/astropy/astropy/pull/2716. My mentors were Tom Aldcroft, Erik Bray, and Michael Droettboom.

In addition to programming, I enjoy running, reading, and playing or listening to music in my spare time.

Programming Information

I run Ubuntu Linux on my laptop, and my preferred editors generally vary by language. For standard text editing and programming in C++, I use emacs. However, I use Eclipse for Java projects and in Python will use either emacs or occasionally IDLE. Although emacs has a bit of a learning curve, I find it very useful as an editor because its numerous commands and macros allow for faster and more powerful editing.

My programming background extends back to 7th grade, when I discovered MIT’s educational programming tool Scratch. After playing around with Scratch and reading more about programming, I used online tutorials to teach myself Java. From making small computer games with friends to trying out programming challenges like Project Euler and Code Golf, I then continued to immerse myself in the world of programming and soon picked up experience with C++ and Python. Since then, I have continued to enjoy programming recreationally. While I haven’t often worked on large programming projects, one project I particularly enjoyed working on was the creation of an OpenGL-based 3D engine in C++ (viewable at https://github.com/amras1/opengl-engine). This project was exciting to work on because it involved learning about graphics programming, which contains interesting mathematical underpinnings (such as matrix transformations and quaternions). I also incorporated simple models of mathematical and physical phenomena in the engine, such as Lindenmayer systems (or L-systems), which allow for the rendering of fractal patterns which imitate such natural objects as plants, and particle systems, which can be used to create interesting effects like fireworks or the flow of a water fountain. Although I didn’t implement very advanced versions of these effects, I enjoyed discovering new intersections between math, physics, and programming. These intricate relationships continue to excite me, and I hope to explore the applications of computer science to other fields in the future. I have also previously worked on an extensible zombie apocalypse text adventure game, which may be viewed at https://github.com/amras1/zombie-text-adventure.

I have been using Python in particular for about three years, and I find it very useful as a language whenever high-level programming is appropriate. Although there is somewhat of a performance hit in using Python compared to more mid-level languages like C or C++, its ease of use and natural syntax have allowed me to program more quickly and with less propensity for error. I particularly enjoy the most distinctive, or “Pythonic”, aspects of Python, such as list comprehensions, generators, and lambda expressions. In fact, I’ve often found that when I return to C++ or Java after using Python for a while, I become annoyed at having to translate one of these features into a more cumbersome syntax. In my opinion, Python’s most useful language feature is the existence of iterables and functions relating to them. for i, elem in enumerate(elem_set): foo(elem, i) is far clearer and easier to use than the C++ equivalent for (std::set<int>::iterator it = elem_set.begin(); it != elem_set.end(); ++it) { foo(*it, it - elem_set.begin()); } and standard library functions like map() and zip() make it much simpler to operate on elements of a container.

From working on my GSOC project last year, I am fairly comfortable with Cython and git, although I could still use some experience with more advanced commands, complicated rebases, etc. As a sample of previous work, see my primary PR from last summer: https://github.com/astropy/astropy/pull/2716. I also implemented an HTML reader and writer in astropy.io.ascii, which may be viewed at https://github.com/astropy/astropy/pull/2160. For scikit-image, I currently have two pull requests under review. One involves adding Python wrappers for public, pure Cython functions for the purposes of documentation, autocomplete, etc. (https://github.com/scikit-image/scikit-image/pull/1432) and the other expands the test coverage of functions in skimage.feature that compute the Hessian of an image by convolution with Gaussian derivatives (https://github.com/scikit-image/scikit-image/pull/1442).

Project Details

Abstract

One difficulty in comparing two time series is that a simple point-to-point comparison may not be valid, particularly if the series have different time scales. Dynamic Time Warping, or DTW, is a popular algorithm designed to determine the similarity between time series with different time scales by computing an alignment curve, or a path joining elements from the two series. In addition to certain basic constraint conditions on the alignment curve, an important feature for the DTW algorithm is the ability to impose customized constraints to ensure that the alignment curve fits a general pattern expected from the data. Despite the huge search space for possible solution curves, the DTW algorithm can be implemented in linear time, or O(NM) where N and M are the lengths of the respective series. Currently, there exists a DTW package in the statistical language R providing a DTW solver, various forms of customization, and methods to represent aspects of the algorithm visually. My proposal is to implement a DTW library for scikit-image based on the DTW library in R, offering the same features as well as possible other developments such as Adaptive Feature Based DTW.

Detailed Description

The Dynamic Time Warping (DTW) algorithm can be formulated in terms of two series--an index series and a query series--whose elements differ from each other by some given distance metric. For example, if each series element is a 2D vector (x, y), then one possible metric might be the Euclidean distance sqrt((x0 - x1)^2 + (y0 - y1)^2). By associating such a "cost" with each pair (a_i, b_j) where a_i is the ith element of the index series and b_j is the jth element of the query series, one can compute an M x N cost matrix C where M is the length of the index series and N is the length of the query series. The problem is then to determine a path P of minimal total cost that adjoins points in the matrix C.

In all forms of DTW, there are two general constraints on the path P:

It must include the boundary points, i.e. P_initial = (1, 1) and P_final = (M, N)
It must be nondecreasing, i.e. if (x, y) is a point in P and (x', y') is a later point in P, then x <= x' and y <= y'.

There are many other possible constraints to impose on the path P. For example, a continuity constraint might say that the x-values and y-values of P cannot jump by more than 1 in a single step, and a slope constraint might say that the slope of the path cannot be too small or too large. In addition, global constraints can impose restrictions on the overall structure of the path, such as requiring the path to lie within a given region: for example, the Sakoe-Chiba band is a diagonal strip along the line y = x with some width T0 (so that no x-value can correspond to a y-value whose distance from the x-value is greater than T0), and the Itakura parallelogram is a geometric parallelogram allowing for more deviation from the line y = x near the middle of the matrix.

Beyond absolute constraints, the DTW algorithm involves a weighting factor at each step that can impose an additional cost for certain kinds of steps. That is, the increase in the total cost as the path P moves from (x, y) to (x', y') is given by C[x',y'] + w(x'-x,y'-y) where w is some transition function. Common transitions are the so-called "symmetric-1" transition, in which w = 1 for any horizontal, vertical, or diagonal move; the "symmetric-2" transition, which differs from "symmetric-1" in that w = 2 for diagonal moves, and various asymmetric transitions that might depend only on the change in one direction, or might be some more complicated function. Other features of the DTW algorithm are partial matches, in which the boundary constraint is relaxed, and the Minimum Variance Matching algorithm, which requires matched points in the reference series to be in strictly increasing order.

My proposal will aim to replicate the functionality of the DTW library in R (http://dtw.r-forge.r-project.org/), including a variety of possible constraints, a linear-time solution using dynamic programming, and more. In addition, I will add more modern features that are absent in the R DTW package, such as Adaptive Feature Based Dynamic Time Warping (AFBDTW), which includes the surrounding neighbors of a time series point as well as the global shape of the cost matrix C as part of the algorithm. I plan to use Python for initial versions of the algorithm, and then after profiling using cProfile, I will optimize the algorithm and rewrite portions in Cython.

Timeline

Community bonding period (April 27 — May 24): Become more familiar with the DTW algorithm in R, paying attention to various possible transition functions, window bounds, and other constraints. Read papers outlining the algorithm itself as well as implementation details.
May 25 — May 31 (1 week): Implement a basic version of the DTW algorithm, using minimal constraints and no special transition function. This will involve the use of dynamic programming to avoid scaling exponentially in time with the input. Write basic tests for the core algorithm. Allow for a variety of possible "cost" metrics, such as Euclidean distance, Manhattan distance, etc.
June 1 — June 14 (2 weeks): Adapt the basic algorithm to a variety of additional constraints, particularly special transition functions and global restrictions (e.g. the Sakeo-Chiba band). Test extensively throughout, and employ abstraction so as to make it easier to add new constraints later. At this stage, focus on mimicing the functionality offered by the R DTW package and include this package's possible transition functions, such as symmetric/asymmetric functions, Rabiner's smoothed transition functions, etc.
June 15 — June 21 (1 week): Work on allowing partial matches in the DTW algorithm by relaxing the boundary constraint that P_initial = (1, 1) and P_final = (M, N). In particular, allow for solution curves that either (1) start after the beginning, (2) finish before the end, or (3) both, i.e. substring matches.
June 22 — July 5 (2 weeks): Implement the Minimum Variance Matching (MVM) algorithm, using a large step pattern to reproduce the conditions for the algorithm. Optimize all previous code by finding performance bottlenecks with cProfile and rewriting portions of the algorithm in Cython. Tie up all loose ends in simulating the features of the R DTW library, making sure to include all possible constraints, ensure that normalization is done correctly, and any other necessary steps.
July 6 — July 12 (1 week): Reproduce the graphical features in the R DTW package, such as the ability to plot alignment curves, transition functions, etc. Write documentation for the library and create a plot gallery illustrating all functionality of the DTW algorithm.
July 13 — July 26 (2 weeks): Implement Adaptive Feature Based Dynamic Time Warping (AFBDTW), starting with an initial implementation of Feature Based Dynamic Time Warping (FBDTW). Test the new functionality extensively, write documention, and include plots in the image gallery.
July 27 — August 2 (1 week): Optimize the new AFBDTW algorithm (possibly by providing certain default constraints) with the goal of reaching linear-time complexity. Use CProfiler again (with a visualization tool such as SnakeViz) to identify performance bottlenecks and rewrite time-critical portions in Cython or C.
August 3 — August 16 (2 weeks): These two weeks will serve as a buffer period in case earlier steps of the proposal require more time, unforeseen difficulties arise, etc. If everything runs smoothly and the buffer period is unnecessary, then work on implementing additional features for the new DTW library that my mentor might suggest.
August 17 (suggested pencils down date) — August 21 (final evaluation deadline): Write documentation for all previous code and add any tests I may have missed during the main coding period. Look for potential bugs and fix any that arise.

Additional Information

I have a blog from last summer (http://muellergsoc.blogspot.com/), which I intend to use again weekly this summer for progress reports. I will also be available for regular contact with my project mentor over the course of the summer through IRC, email, phone calls, or any other mutually convenient form of communication.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly