Solve the Simple Data Problem First

AI keeps reaching for the complex answer when the easy one is sitting right there in the file.

I recently asked an AI assistant to identify the location of a photo. It did something impressive. It analyzed the vegetation, the architecture, it looked for text in the image. It built a detailed case for where the photo was taken. And it got the answer wrong.

The GPS coordinates were embedded in the file the entire time.

Brad Hinkel with his daughter Jahya at Fatehpur Sikri
A photo of me with my daughter Jahya at Fatehpur Sikri (not the Agra Fort)

The AI skipped the metadata — the simple, deterministic, already-there answer — and went straight to visual inference. It performed an impressive feat of analysis to arrive at a confident wrong answer, when the right answer was sitting in the file properties.

This isn’t a one-off glitch. I’m starting to see a pattern: AI systems often reach for the complex solution when simpler, more reliable data is already available. And the problem isn’t limited to AI; the products we’re building around AI are making the same mistake.

The Barcode Test

I tried another experiment. I used a visual product search tool — the kind built into shopping apps that lets you point your camera at something to identify it. I aimed the camera at the front cover of a board game. It worked well. The visual recognition identified the product almost every time.

Then I flipped the box over to the back, where the barcode was clearly visible, and tried again. It failed every time. The tool actively tried to interpret the image visually and ignored the barcode entirely.

Think about that for a moment. Barcodes have been around since the mid-1970s. They are one of the most reliable, standardized identifiers in commercial history. A barcode gives you an exact product match; no inference, no confidence score, no guessing. And the AI looked right past it.

The front-of-box visual recognition is genuinely impressive technology. But the back-of-box failure reveals a design blind spot: the system wasn’t built to check for the simple identifier first.

It’s Everywhere

Once you start looking, this pattern shows up all over the place.

AI tools routinely run optical character recognition on PDF files that already contain perfectly good selectable text, introducing errors to extract data that was already there in a cleaner form. Price comparison tools scrape web pages and report the standard price they find, returning list prices instead of actual selling prices, even when the page’s structured markup clearly distinguishes between the two. Audio analysis tools attempt to identify songs by processing the sound when the file’s metadata tags already contain the artist, title, album, and year. Photo services that could pull location data straight from GPS metadata instead try to infer it from the image content or skip location data entirely.

In each case, the simple data source is usually more reliable than the complex analysis. When it’s right there, why skip it? Are we designing AI systems to demonstrate inference rather than address the core retrieval cases?

Check the Label Before You Read the Book

One lens I keep returning to is that the best retrieval systems check the label before they read the book. The metadata layer — file properties, structured tags, embedded identifiers — is fast, deterministic, and everywhere. It doesn’t hallucinate. It’s either there or it isn’t.

But metadata isn’t always trustworthy. EXIF data can be stripped or wrong. MP3 tags can be mislabeled. A PDF’s embedded text layer can be garbled from a bad scan. So, the real discipline isn’t “always trust the metadata.” It’s “check the metadata, assess its quality, and use it when it’s good.”

That points to something I’m noticing more broadly in AI product design: the confidence in a piece of data matters almost as much as the data itself. A GPS coordinate with a valid timestamp and a tight accuracy reading is worth far more than one floating in isolation. A clean, searchable PDF text layer is worth trusting; a garbled one isn’t. Building the judgment about when to trust the simple source is the actual engineering problem — and it’s usually cheaper than re-deriving everything from scratch with a model.

The deeper analysis layers — image recognition, natural language processing, audio analysis — are powerful and necessary for cases where metadata doesn’t exist or isn’t sufficient. But they should be the second pass, not the first. The complex analysis should fill gaps, not ignore what’s already known. Better still, metadata can feed deeper analysis and make it smarter. Imagine scene recognition that already knows the GPS coordinates — identifying a specific mountain peak, beach, or museum gets much easier.

This isn’t a new idea. It’s how good database design has always worked: check the index before you scan the table. But somewhere in the rush to build AI-powered everything, we started skipping the index.

A Note on Privacy

Some metadata gets stripped for privacy reasons, and that’s legitimate in shared or published contexts. GPS coordinates attached to public photo posts have led to real-world stalking cases, and platforms strip that data by default for good reason.

But plenty of applications have direct access to the original file — personal photo libraries, on-device assistants, enterprise document systems, an app working on your own device. In those contexts, there’s no PII excuse for ignoring what’s right there. Stripping GPS because “metadata is PII” is sometimes a real privacy decision and sometimes a lazy default that throws away useful signal. It’s worth knowing which one you’re making.

Why This Matters for Product Teams

If you’re a PM building a product that touches data, the temptation is always to lead with the most sophisticated technology. The AI-powered visual search. The deep learning model. Those features demo well and look great on a roadmap slide.

But simple data is often more reliable than complex analysis. GPS coordinates don’t hallucinate. Barcodes don’t guess. Structured metadata doesn’t need a confidence score — though checking its quality still matters. When simple data is available and trustworthy, it should be the foundation, not an afterthought.

The products that get this right will use AI to augment simple data, not replace it. They’ll check the metadata first, assess whether it’s good, and use deep analysis to fill in what’s missing or unreliable. That’s not a less ambitious approach. It’s a more reliable one.

Solve the simple data problem first. The cool stuff works better when you do.

What metadata are you throwing away?


Brad Hinkel is a product leader with 25+ years across Microsoft, Amazon, Disney, and Google, currently focused on AI product management and human-AI workflow design.


Comments

Leave a Reply

Discover more from Brad Hinkel

Subscribe now to keep reading and get access to the full archive.

Continue reading