How to build from scratch a recommender and boost its accuracy while keeping it simple

What would you recommend?

Circumstances of 2020–2021 have made more and more business owners think of transferring major communications with their customers online. You may have noticed how great the number of online activities that anticipate, guide and enclose a purchase (even off-line one) has changed recently? Seems like any Internet business does its best to maintain a never-ending dialogue with a client. In such a dialogue the client expects to receive at least relevant personal offers from the seller to make a choice faster.

Personal offers for customers are generated by so-called recommender systems (recsys).


Way to build your own object detector and turn semi-structured blocks of data in an image into a machine-readable text

Image by author

Document parsing

Document parsing is an initial step for transforming information into valuable business data. That information is often stored within commercial documents in tabular format or incidentally in data blocks without distinctive graphical borders. A borderless table may help to simplify the visual perception of semi-structured data for us, humans. From the machine-reading point of view, such presenting information on a page has quite a few shortcomings which make it difficult to separate the data belonging to a presumptive table structure from the surrounding textual context.

Tabular data extraction as a business challenge may have several ad-hoc or heuristiс rules-based solutions…


Hands-on Tutorials

Get a sense of how to deal with context-specific data structures with pdfminer, numpy and pandas

image from https://wiki.atlan.com/unstructured-data/

What is semi-structured data?

In today’s work environment PDF documents are widely used for exchanging business information, internally as well as with trading partners. Naturally, you’ve seen quite a lot of PDFs in the form of invoices, purchase orders, shipping notes, price-lists etc. Despite serving as a digital replacement of paper PDF documents present a challenge for automated manipulation with data they store. It is as accessible as data written on a piece of paper since some PDFs are designed to transfer information to us, humans, but not computers. Such PDFs can contain unstructured information that does not have a pre-defined data model or…


Few quick notes on how to perform OCR in Python using some popular engines along with their quirks and tips

Image by author

Modern OCR systems

OCR (Optical Character Recognition) systems transform an image containing valuable information (presumably in text format) into machine-readable data. In most cases, performing OCR through some available means is the initial step for data extraction from paper or scan-based PDF documents.

Whereas after a short search on the web, you can find plenty of links to various open-source and commercial tools, Google Vision and Tesseract as OCR engines have got a long start over their competitors, especially in recent years.

Tesseract is an offline and open-source text recognition engine with a fully-featured API that can be easily implemented into any business…

Volodymyr Holomb

Passionate value adder to data providing business intelligence services based on data processing, visualization and analysis

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store