Andrej Baranovskij Blog: Machine Learning

Showing posts with label Machine Learning. Show all posts

Monday, November 4, 2024

Structured Output Example with Sparrow UI Shell

Structured output is all you need. I deployed a Sparrow demo UI with Gradio to demonstrate the output Sparrow can produce by running a JSON schema query. You can see examples for the Bonds table, Lab results, and Bank statement.

Sunday, August 18, 2024

Sparrow Parse: Table Data Extraction with Table Transformer and OCR

I explain how we extract data with Sparrow Parse, using Table Transformer to identify table area and build table structure to be processed by OCR. Sparrow Parse implements additional logic to clear-up and improve (removing noise, merging columns, adjusting rows) table structure generated by Table Transformer.

Monday, May 29, 2023

Document AI: How To Convert Colab ML Notebook Into FastAPI App

I explain how I converted Donut ML model fine-tuning code implemented as Colab notebook into API running as FastAPI app. I share several hints how to simplify code refactoring efforts.

Monday, May 22, 2023

Speeding Up FastAPI App with Background Tasks

FastAPI runs background tasks in a parallel thread, which prevents blocking app endpoints when a long task executes. I explain it in this video and show the benefit of running time-consuming operations in background tasks.

Monday, April 24, 2023

Efficient Document Data Extraction with Sparrow UI: Streamlit, FastAPI, and Hugging Face's Donut ML

In this easy-to-follow video, I show you how I built Sparrow UI, a tool for pulling data from documents using Streamlit. With Sparrow UI, you can upload a document and quickly run a data extraction task. I'll walk you through how the system works, using a FastAPI app on the backend to run a fine-tuned Donut ML model from Hugging Face. I'll also explain the code that sends POST requests from the Streamlit app, including how it sends files and text to the FastAPI endpoint. This way, you'll get a JSON response with the extracted info from your document.

Monday, February 27, 2023

Document Data Extraction - Data Mapping for Donut Model Fine-Tuning Dataset (Document AI)

I explain the current status of my work related to dataset preparation for ML Donut model fine-tuning. I plan to use this model to run data extraction tasks from invoice documents. I share hints about data mapping and how to structure data to achieve better fine-tuning results.

Monday, February 13, 2023

Preparing Dataset for Donut Fine-Tuning (part 3, Document AI)

In this episode, I explain redesigned Sparrow UI for data annotation. Sparrow UI is improved with Streamlit Grid component (aggrid). I show how to group related fields generated by OCR into a single entity and map it with the label. I will briefly review the code and discuss how you can set up a grid component in Streamlit - a convenient and helpful UI element.

Monday, February 6, 2023

Preparing Dataset for Donut Fine-Tuning (part 2, Document AI)

I explain how to group OCR results into a single entity using Sparrow annotation tool. This is useful for such fields as an address, item description - when field text is based on multiple words.

Tuesday, January 31, 2023

Preparing Dataset for Donut Fine-Tuning (part 1, Document AI)

I explain the dataset I will be using to fine-tune Donut model. I show how PDFs are converted to image files for further processing and OCR data extraction. In the next step, JSON data is converted to the format understandable by Sparrow annotation processing/review tool.

Monday, January 23, 2023

How To Fine-tune Donut Model

Donut is an awesome Document AI model to extract data from docs. I share my experiences in fine-tuning the model, with CORD dataset, based on example from Transformers Tutorials.

Monday, January 16, 2023

Donut 🍩 - ChatGPT for Document AI

Donut - OCR-free Document Understanding Transformer. This ML model can process documents (images, scans) and return JSON structured info about the content. It works for different use cases: form understanding, visual question answering about the document, document image classification.

Sunday, December 4, 2022

Invoice Annotation with Sparrow/Python

I explain our Streamlit component for invoice/receipt document annotation and labeling. It can be used either to create new annotations or review and edit existing ones. With this component you can add new annotations directly on top of the document image. Existing annotations can be resized/moved and values/labels assigned.

This component is part of Sparrow - our open-source solution for data extraction from invoices/receipts with ML.

Sunday, June 5, 2022

MLUI: Django App Setup

UI plays an essential part for ML apps, it helps build access to ML model API. With friendly and usable UI there are more chances for ML project to be successful. I'm building UI for our ML product Sparrow (data extraction from the documents). I will be explaining in the series of videos, how to build UI (including security, data model, etc.) for ML project. Stay tuned, it will be fun and lots to learn.

Monday, May 16, 2022

Data Annotation with SVG and JavaScript

I explain how to build a simple data annotation tool with SVG and JavaScript in HTML page. The sample code renders two boxes in SVG on top of the receipt image. You will learn how to select and switch between annotation boxes. Enjoy!

Tuesday, April 26, 2022

UI for ML - Django, React or Streamlit?

UI is an important part for ML app to be successful. In this video I discuss multiple UI options I was looking into to build UI for our ML product. While deciding on which UI framework or library to use, you should point your attention to multiple things - such as ease of data transfer, UI flexibility, and ability to build user-friendly functionality.

Monday, April 18, 2022

Mindee docTR - Probably the Best Open-Source OCR

Do you want to build ML pipeline to automate data extraction from business documents (receipts, invoices, forms)? Then your first step should be to integrate OCR for text extraction. OCR extraction quality must be good, the whole pipeline will depend on initial text data extraction quality. If extracted data will be accurate, this means ML models will be able to run proper classification. I spent time researching available solutions for OCR and I think Mindee docTR currently is one of the best open-source OCR solutions available. Check the video, where I run and show multiple tests.

Monday, April 11, 2022

Document Information Extraction Demo on Hugging Face Spaces

This video shows how fine-tuned LayoutLMv2 document understanding and information extraction model runs on Hugging Face Spaces demo environment. I show how data extraction works for different receipts and why you should not rely on OCR which comes pre-configured together with LayoutLMv2 model.

Sunday, March 27, 2022

Hugging Face LayoutLMv2 Model True Inference

I explain why OCR quality matters for Hugging Face LayoutLMv2 model performance, related to document data classification. If input from OCR is poor, ML classification inference results will be low quality too. This is why it is important to use high quality OCR system to extract text and coordinates from the document, before applying ML solution.

Sunday, March 20, 2022

Get Receipt Data with Hugging Face ML Model

This tutorial is about how to use fine-tuned Hugging Face model to extract data from scanned receipt documents. We are executing inference action - passing receipt image, along with words and coordinates to the model. As a result, we get back predictions - class labels assigned to each input. This helps to classify document elements and extract correct data. I share a hint on how to match input words with classified labels. Input words and coordinates are expected to be retrieved from separate OCR.

Sunday, March 13, 2022

Fine-Tuning with Hugging Face Trainer

In this tutorial, I explain how I was using Hugging Face Trainer with PyTorch to fine-tune LayoutLMv2 model for data extraction from the documents (based on CORD dataset with receipts). The advantage of Hugging Face Trainer - it simplifies model fine-tuning pipeline and you can easily upload the model to Hugging Face model hub.