AMuRD: Annotated Multilingual Receipts Dataset for Cross-lingual Key Information Extraction and Classification
Abstract
A multilingual dataset and novel approach using InstructLLaMA achieve high performance in key information extraction and item classification from scanned receipts.
Key information extraction involves recognizing and extracting text from scanned receipts, enabling retrieval of essential content, and organizing it into structured documents. This paper presents a novel multilingual dataset for receipt extraction, addressing key challenges in information extraction and item classification. The dataset comprises 47,720 samples, including annotations for item names, attributes like (price, brand, etc.), and classification into 44 product categories. We introduce the InstructLLaMA approach, achieving an F1 score of 0.76 and an accuracy of 0.68 for key information extraction and item classification. We provide code, datasets, and checkpoints.\url{https://github.com/Update-For-Integrated-Business-AI/AMuRD}.
Get this paper in your agent:
hf papers read 2309.09800 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 2
Datasets citing this paper 1
Spaces citing this paper 1
Collections including this paper 0
No Collection including this paper