|
Large organizations
like banks, government departments, universities, and corporations
need to handle massive databases of postal addresses. These
databases are often poorly structured and frequently accumulate
several duplicate entries for the same person. Hence, such
organizations periodically engage in a data cleaning or
warehousing activity where addresses are stored in a standard
format, with duplicates removed. A key step in this process is
address segmentation that involves extracting from address
strings, individual structured fields like 'Landmarks', 'House
number', and 'State'. In the less structured Indian addressing
system, existing commercial approaches require extensive manual
effort due to various reasons like: non-uniform building numbering
schemes, reliance on ad hoc descriptive landmarks, changing city
names, non-standard abbreviations of state names and style of
writing addresses, spelling mistakes, and optional zip codes.
Prof. Sunita
Sarawagi and her team at the Kanwal Rekhi School of Information
Technology (KReSIT) have developed a software tool that will
'learn' a model for segmenting unseen addresses when ‘trained’
with some examples of segmented addresses. The underlying model is
a powerful statistical machine-learning technique that can handle
new data robustly, is computationally efficient, and is easy for
humans to interpret and tweak in order to rectify the address
segmentation problem. Experiments using nationwide, heterogeneous
collections of actual addresses showed encouraging results, with
high levels of accuracy. The software is now licensed to a data
cleaning company in India, and is being deployed commercially.
Contact: Prof Sunita Sarawagi,
sunita@iitb.ac.in
|