Automated Image Captioning PoC

I was respon­si­ble for the enti­re tech­ni­cal con­cept, fea­si­bi­li­ty stu­dy, and cli­ent com­mu­ni­ca­ti­on. Over the cour­se of the pro­ject, I deve­lo­ped a deep learning–based sys­tem for auto­ma­ted image cap­tio­ning.

The main chall­enge was to com­bi­ne mul­ti­ple data sources — inclu­ding the image its­elf — into a sin­gle gram­ma­ti­cal­ly cor­rect Ger­man sen­tence describ­ing what a per­son was doing in the pic­tu­re. Sin­ce the tar­get group con­sis­ted of poli­ti­ci­ans, we quick­ly dis­co­ver­ed that 98% of all images depic­ted only seven distinct actions.

By app­ly­ing trans­fer lear­ning from an exis­ting com­pu­ter visi­on model ori­gi­nal­ly trai­ned to clas­si­fy sports actions, we achie­ved excel­lent results. A sub­stan­ti­al amount of NLP pro­ces­sing was then requi­red to gene­ra­te the final sen­tence — not only describ­ing the action but also inclu­ding the politician’s name, loca­ti­on, and occa­si­on, all extra­c­ted from various exter­nal data sources.

It’s worth empha­si­zing that this work was done befo­re powerful LLMs beca­me available, mea­ning that such cap­tio­ning capa­bi­li­ties had to be built manu­al­ly and from scratch.

The out­co­me was a sys­tem that excee­ded cli­ent expec­ta­ti­ons and demons­tra­ted the real poten­ti­al of AI-dri­ven auto­ma­ti­on in public image docu­men­ta­ti­on.

Due to con­fi­den­tia­li­ty reasons, I’m unable to show­ca­se the actu­al sys­tem. The images shown here are AI-gene­ra­ted and do not repre­sent the real pro­ject.

Learn how we helped 100 top brands gain success