Sometime, by the end of the first decade of the 21st century, we humans were already contributing to machine learning unassumingly. Me and you, along with millions of unsuspecting users worldwide, digitized the entire Google Books archive by the start of 2012. I assure you, I am not fibbing. If you remember, during the early 2000’s, internet users worldwide were shown an image of a text, with one or two words, which had to be typed out correctly to be able to proceed further. This was a way of identifying that the user was, in fact, human, and not a machine trying to extract sensitive information from the website. This was partially true. However, the other part to this exercise was that only one of the two “Captcha” words shown were actually part of the test. The other, was an image of a word taken from a not yet transcribed book. In effect, unknowingly we have all contributed to preparing data for machine learning (ML).
Do We Need Millions of People to build AI?
Human efforts to make machines intelligent by teaching them to find answers to complex problems without being explicitly programmed or spoon fed for every single query, required accurate training data. To help algorithms (read: machines) understand patterns, sequences, or outcomes required humans to comprehend images, videos, texts and audio and differentiate between similar and dissimilar objects. Merely feeding the machine with inputs about objects is however, not enough. Model validation in machine learning validates the machine learning model to check its accuracy, so that predictions are accurate. AI model training entails elaborate cross-checking using labelled data, by checking whether machines have correctly detected the object or not. Hence, for AI and ML models to give accurate results, training data is very important as it is the foundation on which any machine learning model is built. Although we require hundreds, thousands, or millions of data points for building efficient machine learning models, we wouldn’t need more than a few hundred or less people to develop AI to suit each requirement.
Why is training data important?
AI Model Training is equivalent to teaching kindergarten students. One has to describe every minute bit of the object in detail: size, colour, use, etc. All the learning- supervised and reinforcement happens on this training data and hence it is of utmost importance that the training data is of enriched and of the highest quality and accuracy. The other crucial aspect while building training data is the quality of data labelling or data annotation. Be it images, text, speech or video or geo data, every point of the object must be annotated accurately. It is also important to remember that high-quality training data also makes model training accurate as well as model validation and model testing faster.
The quality, accuracy, relevancy and availability of training data directly affects the goals of AI model as it has a direct impact on the AI achieving its goals. As pointed out before, it is best to think of training data similar to learning. A student with an outdated textbook and half the pages missing may not pass an examination. Similarly, without quality training data, the AI will do its job haphazardly, if at all.
Coming back to google. If you are wondering why images have replaced captcha words, it would have dawned on you that web users are now annotating images to identify patterns and symbols. All the traffic lights, road crossings, motorcycles and cars that you are identifying today are innocuously building training data for the next breakthrough in AI.