Monday, March 5, 2007

Crime Analytics at NYPD

Text search and text mining is an integral part of Business Intelligence. Often text mining is tightly coupled with data mining. Here is an example where a text search for "sugar" led to sweet success in solving a crime at NYPD recently:
"Best in class 2007: New York City Police Department"

The text searching capabilities are an integral part of crime related systems. In this case one of the witnesses recalled the word "sugar" in the tattoe on the neck of the suspect and the detectives searched the Real Time Crime Center (RTCC) database for this string. Free text is categorised as un-structured data and often creates the most challenging aspect of a software system based on database. Databases are great in handling structured data or data in columns that can be easily queried upon. However, in recent times, search capabilities for free test of phrases is gaining important. The Google's and Yahoo's have mastered the art of searching text, so what is the big deal about searching for text in the database. The free text consists of several words or phrases and in order to index it for faster searching, the text engine creates a text index. Unlike the index on structured fields such as a numeric column or character column with a few categories of data, the text indexing has to extract all the important words, get rid of stop words like {a, an, the, it, this etc.} and then store the occurence, frequency and relative positions of these words in every row/record of data. Such a text index would allow a quick search for a word like sugar in the description of the suspects, among 120 million or so crime and arrest records.

Realizing the importance of the growing importance of text search and text mining of the data stored in databases, companies like Oracle have tightly coupled the Oracle Text engine with the database. This text engine also works closely with Oracle Data Mining to allow looking for patterns in textual descriptions. In crime incidents, the narrative of the crime report, has a wealth of unstructured data. Now, this "wealth" can be mined using the marriage of the in-database text and the mining engine.