Can AI estimate food weight from a photo?

We weighed real meals, then measured how close the models' gram estimates came.

A meal-photo tracker rests on one number: how many grams of each food are on the plate. Get the grams wrong and the calories are wrong before you touch anything else. So we weighed real meals and measured how close the models came.

This is a writeup of an internal, still-running benchmark of Pensum's photo feature. The sample is small and the numbers move as we add plates, so read everything here as a direction rather than a settled result. The picture is unflattering in places, and we are showing it as measured.

Why the grams are the hard part

Identifying the food in a photo is largely a solved problem. Modern vision models will tell you a plate holds fish, couscous and spinach, and they are usually right. The value, and nearly all of the error, comes one step later: turning that plate into portions. A model that confidently says "130 g of cottage cheese" when it is really 56 g has produced a roughly 2.3x calorie error, and correct labelling does not save it. So we score on how far each estimate lands from the weighed grams.

How we measured it

The setup is deliberately plain:

Real ground truth. Each item on each plate was weighed on a kitchen scale before eating. That weight is the answer key.
The real engine. We replay the exact prompt the shipping app uses, so the results reflect what the product would actually return.
Several models. Each plate is run through a range of current vision models, from lightweight "flash" tiers up to premium frontier models.
Different amounts of context. Photo alone; photo plus the kind of notes a user would type (ingredients, plate size, cooking method); and photo plus a weighed total.
One metric. For each item, the percentage gap between the estimate and the true weight. We average the absolute gaps across a plate. Lower is better. Zero would be perfect.

Finding 1: the food sets the difficulty

The clearest pattern is that the shape of the food sets the error, more than the choice of model does. Flat, legible foods are read well by almost everything. Heaped, soft, or small lone items get over-estimated by almost everything, because a single flat photo does not capture height, and for a heaped food most of the weight is in the height.

Photo-only estimates against the scale. Same models, very different difficulty by food.
Food (weighed truth)	What the models did
Folded turkey cold cuts, 65 g	Read well. Most models within about 10 to 15% with no help at all.
Sourdough slice, 65 g	Consistently under-read. Models assume bread is lighter than a dense slice really is.
Heaped cottage cheese, 56 g	Consistently over-read, often 80 to 130 g. Up to roughly 2.3x the real weight.
Lone sweet-potato wedge, 50 g	Over-read by 50 to 200%. Small items alone on a plate read far bigger than they are.

A heaped soft food and a thin folded one are not the same task. A model that reads one well can badly miss the other.

Finding 2: a bigger, smarter model does not fix it

The obvious next move is to throw a more capable model at the problem. We did. Premium frontier models, including ones with extended "thinking" turned up, did not beat the cheap lightweight ones on gram accuracy. If anything the reasoning models drifted worse, talking themselves into confident but wrong volume math. On the plates we share below, the lighter models averaged around 20% error and the premium tier landed closer to 27 to 29%.

The big models are not bad. The ceiling is set by the input. One flat photo from one angle does not carry the depth needed to recover volume, and no model can invent information that is not in the pixels. The limit is the photo itself.

Finding 3: the thing that actually helps is a number you already have

If the photo cannot supply scale, the user can. The largest single improvement we measured came from telling the engine the total weight of the food, and letting it decide only how that total splits across the items. Weigh the plate, and the model no longer has to guess how big everything is; it only has to guess the proportions, which it is much better at.

Average gram error as we add context. More signal, less guessing.
What the model was given	Average gram error
Photo alone (better models)	~20 to 30%
Photo plus typed context (ingredients, plate size, method)	a modest improvement
Photo plus the weighed total	~13 to 17%

Across our cached set, pinning each estimate to the true total cut the average error from about 29% to 17%, and on the hardest plate from about 50% to 25%. Re-run live on a fresh plate, giving the model the total weight took a single-photo estimate from roughly 27% down to about 13%, cutting it in half. Early readers pointed out the irony: the biggest AI accuracy win is a kitchen scale.

Finding 4: the honest floor, and a second kind of error

Even with the total handed to it, a floor remains, around 17% on our plates. It is a segmentation problem: small or lone items keep getting more than their share of the total. Weighing the plate fixes the total size; how the model divides that total is a separate problem.

There is also a second error that has nothing to do with grams. A perfect weight matched to the wrong database entry still mislogs. Ask for "couscous" and you might land on the dry entry rather than the cooked one, and dry couscous carries more than three times the calories per 100 g. The word still has to resolve to the right food. That is a real and sometimes larger source of error, and we treat it as its own axis to fix.

What this means for how Pensum's photo tool works

The measurements pushed us to a clear stance: photo logging is an on-ramp you correct. We built the tool around that.

You lead with what you know. Any weight you type is held fixed and never re-estimated, and a single known number pulls the whole plate into line. A total weight, when you have it, is the strongest input you can give.
It returns an editable breakdown. The result is a list of ingredients and grams you can adjust, because the raw estimate will be off on exactly the foods this research flags.
Photo entries are flagged as estimated. You always know which numbers were weighed and which were read from a picture, and an estimate is never shown as more precise than it is.

The bottom line: if the food is in front of you and a scale is within reach, weigh it. The photo tool is for when weighing is not an option.

Caveats

This is a living benchmark on a small number of labelled plates. Absolute numbers will shift as we add cases, and run-to-run noise is real, so we lean on the directions rather than any single figure. We are sharing it because "we actually weighed this" is rarer than it should be in this space, and because being wrong in public is how the numbers get better.

Pensum is a fast, private macro tracker built on real European food data, with photo logging designed around exactly this research. Download Pensum for Android, or read the photo guide for how to get a closer estimate. Back to the homepage.