Scaling to cloud computing has been a game changer for us. Now we can really accelerate our data pipeline, avoid training bottlenecks, and run as many experiments as our wallet allows.
This post is a small departure from the typical biomechanics content I write here, but is an interesting topic nonetheless; how do we train our algorithms? This could be ridiculously dense, but I’m hoping to provide a more high-level description of the process and a couple things that we have done to make our lives a bit easier.
Before we get into how we train these algorithms, it may make sense to give a general description of the data workflow. Effectively, we take a picture and we determine a key point on that image by eye. For instance, if we are looking to add a knee joint, we would click on the center of rotation of the knee (I’m going to write about this process in another post more comprehensively, like how we measure different body shapes and sizes, but it’s too much for here), and we would do this in hundreds of thousands of images.
With these data, we then train an algorithm to learn where that knee joint location is, without a human needing to click on it! Learning in general happens in a similar way to how humans learn. Consider a child trying to speak. The child tries to say a word, and initially, they might be incorrect. Then, the parents will correct them, and the child will try again. This process happens over and over until the child says the word correctly. This is how we train our algorithms; the algorithm guesses where the knee is (incorrectly) and we provide feedback until it gets it right!
Now in the early days, we would train these in house on some very nice computers. Our data throughput was so slow that it really didn’t matter, we could train for two weeks for each joint and never fall behind. Times have changed for the better, and now we have a huge data team that adds and corrects constantly. The point is to always improve our tracking. Moreover, we are constantly expanding that team, so we expect the “problem” to get worse.
I promise I’m not sponsored by a cloud service (Google, if you are listening, please send some compute credits!!), but really the only way we could adjust to the changing landscape of our data team was to change to a cloud compute paradigm, where we lease supercomputers to train our algorithms on the cloud. To put these in perspective, some of the models we run now would have historically taken 24 days to train, which, in practice, means they would never finish (i.e. something would always happen, like a power outage). After adjusting the code a little bit to be compatible on the cloud, we can run these same training sessions in about 12 hours.
Now we are actually keeping up to our data team without breaking a sweat. The other benefit is that we can run as many experiments as our wallet will allow, so we are never compromising our scientific curiosity because other more important problems are being tackled. We can actually run them all at the same time. It has also changed our relationship with our data team and developers. Now we are way more connected, and because of this compute speed increase, have come up with some other strategies that have further improved our data input speed as well as accuracy.
For us, the biggest benefit is that we can better keep up with customers' needs, further the accuracy of our product, and not be restricted by older hardware limiting our ability to innovate. As a word of caution, I’m not sure if I sleep the same when these sessions are running, because if you forget that it’s running or it times out, the bill will be a thing of nightmares. Happy computing!