Numerical Tuple Extraction from Tables with Pre-training

Published in KDD, 2022

Recommended citation: Qingping Yang, Yixuan Cao, Yinging Hu, Jianfeng Li, Nanbo Peng, and Ping Luo. Numerical Tuple Extraction from Tables with Pre-training. In KDD, 2022. https://dl.acm.org/doi/abs/10.1145/3534678.3539460

Tables are omnipresent on the web and various vertical domains, such as business, academic, and economic, storing massive amounts of valuable data. However, the great flexibility in the table layout hinders the machine from understanding these valuable data. In order to unlock and utilize knowledge from tables, extracting data as numerical tuples is the first and critical step. As a form of relational data, numerical tuples have direct and transparent relationships between their elements and are therefore easy for machines to use. Extracting numerical tuples requires a deep understanding of intricate correlations between cells. The correlations are presented implicitly in texts and visual appearances of tables, which can be roughly classified into Hierarchy and Juxtaposition.

Although many studies have made considerable progress in data extraction from tables, most of them only consider hierarchical relationships but neglect the juxtapositions. Meanwhile, they only evaluate their methods on relatively small corpora. This paper proposes a new framework to extract numerical tuples from tables and evaluate it on a large test set. Specifically, we convert this task into a relation extraction problem between cells. To represent cells with their intricate correlations in tables, we propose a BERT-based pre-trained language model, TableLM, to encode tables with diverse layouts.

To evaluate the framework, we collect a large finance dataset that includes 19,264 tables and 604K tuples. Extensive experiments on the dataset are conducted to demonstrate the superiority of our framework compared to a well-designed baseline. We also conduct an ablation study to show the effectiveness of each component in our model.

paper