What is ‘Metadata’ and why does it matter?
In the information technology world, metadata is a term you’ll often hear thrown around in many contexts – but what does it mean and why does it matter? If you search for a definition of metadata, you’ll see things like; ‘data that provides information about other data’. Ok, that might be correct, but it’s not very helpful for understanding.
Here’s an example instead. In the olden days you may gone to an actual physical store and bought your music on CDs or Vinyl Records. The packages were sealed so it usually wasn’t very easy to listen to them in the store, but there was information (metadata) on the back of the package to tell you more about what was inside; information about the musicians, songwriters, titles and lyrics. You could use this metatdata to help you decide if you wanted to buy the data (the songs) inside. Perhaps in your brain you could correlate the metadata with other data to inform your decision making (e.g. you remember hearing one of the titles on the radio, and you liked that).
Realizing the many readers of this blog may have never bought music this way, lets move on to a more modern example. Social media and online search make extensive use of metadata. For example, suppose you enjoy funny animal videos. Using your favorite online video browsing service, you could look for animal videos – but would these be funny videos, or something very unfunny like a surgery tutorial for veterinarians? You might find a funny dog video, but what is you are more of a cat person? It could take a lot of manual viewing of data (the videos) to get to what you want. Metadata embedded in online video files can make it easier to find what you are looking for, it could include information such as type of animal, category of video (e.g. humor, education, clinical), and the computerized brains correlate it with human friendly search language such as ‘funny cat videos’.
Now pivoting back to the information technology world, why does metadata matter? Cyber security breaches are constantly in the news these days, so let’s use a very simplified example there. At the lowest level computerized data is represented as series of 1s and 0s (binary data), then encoded into a format ever more slightly manageable like Hexadecimal, eventually a bunch of these might be combined in packets to form a computer file (e.g. pdf type) which is attached to an email and transmitted over the internet. In the process the raw data would have metadata attributed to it e.g. this is a pdf, it is an email attachment, it was sent from a computer running this type of OS, the transmission originated from this internet protocol (IP) address (oh and by the way a hacker has embedded malware into this file). Suppose the malware was identifiable in metadata contained in a list of known threats, would you want to inspect every single 1 and 0, or Hex byte, or every packet in every file on your network to find it? Not if you could avoid it you wouldn’t. The malware might be known to infect pdf attachments, metadata about the file type would narrow down what to look at. But there are lots of legitimate pdfs, looking at all of them might take too long, so correlating with certain IP addresses known to be popular with hackers could further narrow the scope of investigation – you could then look at the actual source data in a controlled manner and if necessary block the malware.
Much like the CD cover, or the video search application, information security metatdata narrows down the data to actually look at. But much like those earlier examples metatdata alone isn’t sufficient (what fun would it be just looking at a CD cover, or reading an explanation of a cat video?)– when combined with the source data itself is when metadata becomes most powerful.