Theory and implementation of pinch gesture image manipulation
... With help from Joël Spaltenstein
Have you ever written an app that does photo capture, and wanted to implement the classic pinch-to-zoom that is now ubiquitous? Easy, right? Pretty much a no-brainer. Then you want to get fancier and have scaling and rotation so you can crop your photo. No problem, right? The basic operations are translate, scale, and rotate. You should be able to just combine those to get what you want. In practice, getting a pleasing solution working is not obvious.
What is pleasing?
First, we have to figure out what we mean by a "pleasing solution" - what our brains will feel is the most natural response when our finger and thumb pinch and rotate on the screen. It is kind of funky to think about generating a natural response when we are doing something that can't actually happen in reality. I guess that means our perceptions and our interactions with our own inventions have transcended the natural world, yet not disassociated from it.
After playing with some examples and mulling it over, it seems that the most rigorous description of what we want is to have the points that we're touching "locked" under our fingertips. That is, when we move our fingers around, we want the things that initially fell under our fingertips at the first touch to remain under our fingertips. If you had a bowl of fruit and put your thumb on the orange and your forefinger on the kiwi, you want the orange always under your thumb and the kiwi always under your forefinger no matter how you move them.
Theory
OK, here is where we start switching to math to help translate our intentions into a form that the computer can understand. Don't panic. It will be gentle.
First off, through our experience with using graphics API, we know that that the thing we must ultimately come up with is a transformation matrix that is used to draw the image the way we want. We're also going to guess that what we want to do can be achieved via a single translate, scale, and rotate, in that order. (There's no real proof for this, but it sounds right. A previous, more hand-wavy solution to this problem bears this out.)
Surprisingly, the best explanation of 2D transformation matrices is Apple's own documentation.
Here is what a transformation matrix that just does translation looks like:
$$\begin{bmatrix}
1 & 0 & 0 \\
0 & 1 & 0 \\
T_x & T_y & 1
\end{bmatrix}$$
Here is what a transformation matrix that just does scaling looks like:
$$\begin{bmatrix}
S_x & 0 & 0 \\
0 & S_y & 0 \\
0 & 0 & 1
\end{bmatrix}$$
Here is what a transformation matrix that just does rotation looks like:
$$\begin{bmatrix}
\cos\theta & \sin\theta & 0 \\
-\sin\theta & \cos\theta & 0 \\
0 & 0 & 1
\end{bmatrix}$$
Learn about matrix multiplication at Math is Fun!
This is what you get when you do a translation, then a scale: (Order matters, and the second operation comes first in the matrix multiplication. I didn't make the rules.)
$$\begin{bmatrix}
S_x & 0 & 0 \\
0 & S_y & 0 \\
0 & 0 & 1
\end{bmatrix}
\begin{bmatrix}
1 & 0 & 0 \\
0 & 1 & 0 \\
T_x & T_y & 1
\end{bmatrix}
=
\begin{bmatrix}
S_x & 0 & 0 \\
0 & S_y & 0 \\
T_x & T_y & 1
\end{bmatrix}$$
This is what you get when you multiply in the rotation:
$$\begin{bmatrix}
\cos\theta & \sin\theta & 0 \\
-\sin\theta & \cos\theta & 0 \\
0 & 0 & 1
\end{bmatrix}
\begin{bmatrix}
S_x & 0 & 0 \\
0 & S_y & 0 \\
T_x & T_y & 1
\end{bmatrix}
=
\begin{bmatrix}
S_x\cos\theta & S_y\sin\theta & 0 \\
-S_x\sin\theta & S_y\cos\theta & 0 \\
T_x & T_y & 1
\end{bmatrix}$$
Since we are doing only proportional scaling, we can replace \(S_x\) and \(S_y\) with one value, just \(S\):
$$\begin{bmatrix}
S\cos\theta & S\sin\theta & 0 \\
-S\sin\theta & S\cos\theta & 0 \\
T_x & T_y & 1
\end{bmatrix}$$
It is the nature of the transformation matrix that we use it to compute a new point \((x^\prime,y^\prime)\) by multiplying it with an original point \((x,y)\):
$$\begin{bmatrix}
x^\prime & y^\prime & 1 \\
\end{bmatrix}
=
\begin{bmatrix}
x & y & 1 \\
\end{bmatrix}
\begin{bmatrix}
S\cos\theta & S\sin\theta & 0 \\
-S\sin\theta & S\cos\theta & 0 \\
T_x & T_y & 1
\end{bmatrix}$$
So what is this actually saying here? Let's multiply it out and see:
$$x^\prime = x S \cos\theta - y S \sin\theta + T_x \\
y^\prime = x S \sin\theta + y S \cos\theta + T_y$$
So we can calculate a new location \((x^\prime,y^\prime)\) by plugging in an old location \((x,y)\) and the translation, scale, and rotation values. We can actually do this twice - once for your thumb, which we'll call \((x_1,y_1)\), and once for your forefinger, which we'll call \((x_2,y_2)\). Now we've got:
$${x_1}^\prime = x_1 S \cos\theta - y_1 S \sin\theta + T_x \\
{y_1}^\prime = x_1 S \sin\theta + y_1 S \cos\theta + T_y \\
{x_2}^\prime = x_2 S \cos\theta - y_2 S \sin\theta + T_x \\
{y_2}^\prime = x_2 S \sin\theta + y_2 S \cos\theta + T_y$$
We know what all the \(x\)'s and \(y\)'s are, so we have four equations with four unknowns: \(T_x, T_y, S,\) and \(\theta\). Now we can solve for these values, plug them in to the transformation matrix and... CRAP! There's no way we are going to be able to solve this with trig functions and two unknowns multiplied together!
This is where we decide it's not doable and back-burner the problem for five years...
...aaand we're back! Going back to look at the original transformation matrix that we were shooting for:
$$\begin{bmatrix}
S\cos\theta & S\sin\theta & 0 \\
-S\sin\theta & S\cos\theta & 0 \\
T_x & T_y & 1
\end{bmatrix}$$
We see that we don't really HAVE to know what \(S\) and \(\theta\) actually are. In order to fill the box, we just have to know what \(S\cos\theta\) and \(S\sin\theta\) are! Let's make the substitution
$$S\cos\theta = A \\
S\sin\theta = B$$
So now we have:
$$\begin{bmatrix}
S\cos\theta & S\sin\theta & 0 \\
-S\sin\theta & S\cos\theta & 0 \\
T_x & T_y & 1
\end{bmatrix}
=
\begin{bmatrix}
A & B & 0 \\
-B & A & 0 \\
T_x & T_y & 1
\end{bmatrix}$$
Well that looks a lot simpler. Better yet, our ugly system of equations now looks like this:
$${x_1}^\prime = x_1 A - y_1 B + T_x \\
{y_1}^\prime = x_1 B + y_1 A + T_y \\
{x_2}^\prime = x_2 A - y_2 B + T_x \\
{y_2}^\prime = x_2 B + y_2 A + T_y$$
We can rewrite it to look like this:
$$x_1 A - y_1 B + T_x + 0 = {x_1}^\prime \\
y_1 A + x_1 B + 0 + T_y = {y_1}^\prime \\
x_2 A - y_2 B + T_x + 0 = {x_2}^\prime \\
y_2 A + x_2 B + 0 + T_y = {y_2}^\prime$$
Not only is this solvable, it's matrix solvable!
$$\begin{bmatrix}
x_1 & -y_1 & 1 & 0 \\
y_1 & x_1 & 0 & 1 \\
x_2 & -y_2 & 1 & 0 \\
y_2 & x_2 & 0 & 1
\end{bmatrix}
\begin{bmatrix}
A \\
B \\
T_x \\
T_y
\end{bmatrix}
=
\begin{bmatrix}
{x_1}^\prime \\
{y_1}^\prime \\
{x_2}^\prime \\
{y_2}^\prime
\end{bmatrix}$$
Let's call the matrix M to make it easier to refer to:
$$\begin{bmatrix}
x_1 & -y_1 & 1 & 0 \\
y_1 & x_1 & 0 & 1 \\
x_2 & -y_2 & 1 & 0 \\
y_2 & x_2 & 0 & 1
\end{bmatrix}
=
M$$
Then we have:
$$M
\begin{bmatrix}
A \\
B \\
T_x \\
T_y
\end{bmatrix}
=
\begin{bmatrix}
{x_1}^\prime \\
{y_1}^\prime \\
{x_2}^\prime \\
{y_2}^\prime
\end{bmatrix}$$
And the solution to an equation like this is to pre-multiply both sides by the inversion of the matrix:
$$M^{-1}
M
\begin{bmatrix}
A \\
B \\
T_x \\
T_y
\end{bmatrix}
=
M^{-1}
\begin{bmatrix}
{x_1}^\prime \\
{y_1}^\prime \\
{x_2}^\prime \\
{y_2}^\prime
\end{bmatrix}$$
\(M\) and \(M^{-1}\) cancel out, leaving us with:
$$\begin{bmatrix}
A \\
B \\
T_x \\
T_y
\end{bmatrix}
=
M^{-1}
\begin{bmatrix}
{x_1}^\prime \\
{y_1}^\prime \\
{x_2}^\prime \\
{y_2}^\prime
\end{bmatrix}$$
The amazing part here is that M only contains \((x_1, y_1, x_2, y_2, 0, 1)\) - only the original coordinates of where you put your thumb and forefinger. So even the moderately expensive part - inverting M - only needs to be done when you put your fingers down. As you move your fingers around, we can figure out what the transform matrix should be just by doing a matrix multiplication!
Sample Code
The Story Behind the Article
There was this feature in an app I worked on that allowed you to position and resize an overlay on top of another image in order to align the two. Because you were using a mouse and couldn't do both at the same time, you had to use one point to move and the other to rotate around the first control point. It seemed natural to do this in a translate-then-scale operation using NSAffineTransform, and it worked, but there was always some mystery about it. (Read: I hacked on it until it worked, but didn't really know how it worked.) There was always a desire to understand it and make it better.
When iOS and the pinch gesture came along, it tickled this brain-itch again. I was sure there ought to be a way to solve it rigorously, and in one step. My old partner Joël Spaltenstein is pretty good at math (his dad is a math prof), and he helped me work out a solution 90% of the way on a sheet of 8.5x11 paper. I kept that sheet of paper in my folio, but the next time I thought to work on it, the paper was gone, and Joël had gone on to bigger things. I thought, "it shouldn't be that hard to work out the solution again". That's where this article begain. It was literally years before I remembered Joël's trick on how to solve the system of equations to get the final answer.