Statistica Text Miner

Statistica Text Miner is an optional extension of Statistica Data Miner, ideal for translating unstructured text data into meaningful, valuable clusters of decision-making "gold."

As most users familiar with text mining already know, real-world data comes in a variety of forms, not always organized or easily ready to analyze. Text mining digs for the underlying information not readily apparent in traditional structured data.  These data sources can be extremely large as well.  Statistica Text Miner is optimized and has recently been further enhanced for working with such data.

How can you Use Statistica Text Miner ?

text mining

Example Text Miner Workspace
Click to enlarge

  • Analyze the contents of Web pages. For example, users can automatically process and summarize all Web pages of particular companies, message boards, etc.
  • Include unstructured notes in predictive data mining projects. For example, users may include responses to open-ended interview questions, patients' own descriptions of medical symptoms, etc. in data mining projects involving the clustering of patients and symptoms.
  • Analyze large document repositories. For example, users may analyze repositories of documents such as narratives of insurance claims, etc., to include such information in fraud detection projects.

Statistica Text Miner was specifically designed as a general and open-architecture tool for mining unstructured information. The feature extraction/selection and other analytic tools available in Statistica Text Miner are not only applicable to text documents or Web pages, but can also be used to index, classify, cluster, or otherwise include in your analyses unstructured information such as (pre-processed) bitmaps imported as data matrices, etc.

  • Accessing Documents
  • Processing Documents
  • Analyzing Documents

Integration with Statistica, Statistica Data Miner, and Statistica Enterprise

The text miner software is fully integrated into the Statistica line of software. It is not a stand-alone product manufactured by another vendor and "connected" to Statistica. Text mining functionality can be integrated into the Statistica Data Miner workspace environment, Statistica Enterprise, or custom Statistica applications.

For example a customer may:

  • automatically access data stored in a data warehouse
  • update certain analyses and numeric summaries of the textual information
  • publish results to authorized users via the Internet

It is scalable and uses multi-threaded computing technology to extract optimum performance from advanced multiple-processor server hardware.

Accessing Documents

The program contains numerous options for accessing text documents in different formats, including .txt (text), .pdf (Adobe), .html, .xml (Web-formats), and most Microsoft Office formats (e.g., .doc, .rtf).

Flexible user interface options (and automation functions) are provided for selecting large numbers of files via wild-cards (e.g., to select all documents in a particular subdirectory structure).

The program supports full "Web-crawling" capabilities, so that documents can be extracted from the Web, starting at a particular root Web page (URL). All documents linked to that particular page will be included, as well as the documents linked to those sub-documents, and so on, up to a user-specified level or depth.

File names and URLs can also be stored in text variables, in Statistica data files. In this manner, the program can not only process actual text stored in text variables, but also properly interpret references to text documents or URLs. Thus, numeric information and textual information (large documents) can be stored on a per-case (observation) basis and meaningful analyses can be performed on data files where for each observation numeric as well as (voluminous) unstructured textual information is available (e.g., patients' age, height, weight, along with physicians narrative description of symptoms).

Options are provided to flexibly import such lists of filenames or URLs into the columns of a Statistica spreadsheet.

Processing Documents

Documents can be preprocessed, prior to (actually concurrent with the) indexing of all documents. Exclusion rules and stub-lists can be applied to remove common but not useful words like "a", "the", "to", "is". Then a stemming algorithm is applied so that English words like "traveled", "traveling" both count as instances of "travel".

Statistica Text Miner includes stub lists and stemming algorithms for Danish, Dutch, English, French, German, Italian, Portuguese, Spanish, Swedish, and other languages. Please email about your language needs. Stub lists can be edited (augmented) by the user as needed. The program is designed so that support for additional languages can be added with minimum effort.

Next, the program will index the "stubbed-and-stemmed" documents, to create a frequency count of all words and for all documents. This "raw-data" (count) information is the basis for all subsequent numerical analyses.

Before creating a Statistica Data File containing the counts (etc.) to summarize the documents, various additional filters may be applied. For example, the counts for particular (most frequent) words per document can be:

  • normalized based on the length of each document
  • transformed (e.g., log-transformed)
  • optionally "compressed" by, for example, applying various feature extraction algorithms such as SVD (singular value decomposition, specifically optimized to operate on large sparse matrices)

The resulting data file with numeric information (e.g., SVD dimensions, raw counts, relative counts, most-frequent-word counts, and so on) is then ready for further analyses.

Various options are provided for writing the information extracted from text into the input data file, or directly into external databases (see also the description of Statistica In-Place Database Processing technology). 

Analyzing Documents

All Statistical analysis methods can be applied to the numeric summaries representing the texts. Simple summary statistics may extract the most common words used in the documents.

By mapping the documents into the SVD dimensions (e.g., via PCA), dimensional maps of documents can be created, to evaluate the similarity of documents, etc.

By mapping documents into dimensions based on original (transformed) word counts, simultaneous maps of documents and words can be created. This reflects the "meaning" of documents.

Clustering techniques (such as EM or k-Means) can be applied to identify clusters of similar documents.

Predictive data mining techniques can be used to relate the numerical summaries of documents to other indicators of interest, e.g., fraudulent intent, medical diagnosis, and so on.

Key analytic components requiring extensive data processing are implemented via multi-threaded computing technology, to extract optimum performance from advanced multiple-processor server hardware.