Explain the five phases in a process pipeline for text mining.

10 A] Explain the five phases in a process pipeline for text mining.

The five phases for processing text are as follows:
Phase 1: Text pre-processing enables Syntactic/Semantic text-analysis and does the
followings:

  1. Text cleanup is a process of removing unnecessary or unwanted information. Text cleanup
    converts the raw data by filling up the missing values, identifies and removes outliers, and
    resolves the inconsistencies. For example, removing comments, removing or escaping “m20”
    from URL for the web pages or cleanup the typing error, such as teh (the), do n’t (do not) [9t20
    specifies space in a URL].
  2. Tokenization is a process of splitting the cleanup text into tokens (words) using white
    spaces and punctuation marks as delimiters.
  3. Port of Speech (POS) t«gqinq is a method that attempts labeling of each token (word) with
    an appropriate POS. Tagging helps in recognizing names of people, places, organizations and
    titles. English language set includes the noun, verb, adverb, adjective, prepositions and
    conjunctions. Part of Speech encoded in the annotation system of the Penn Treebank Project
    has 36 POS tags.4
  4. Word sense disambiguation is a method, which identifies the sense of a word used in a
    sentence; that gives meaning in case the word has multiple meanings. The methods, which
    resolve the ambiguity of words can be context or proximity based. Some examples of such
    words are bear, bank, cell and bass.
  5. Parsing is a method, which generates a parse-tree for each sentence. Parsing attempts and
    infers the precise grammatical relationships between different words in a given sentence.

Phase 2: Features Generation is a process which first defines features (variables,
predictors). Some of the ways of feature generations are:

  1. Bag of words—Order of words is not that important for certain applications.
    Text document is represented by the words it contains (and their occurrences). Document
    classification methods commonly use the bag-of-words model. The pre-processing of a
    document first provides a document with a bag of words. Document classification methods
    then use the occurrence (frequency) of each word as a feature for training a classifier.
    Algorithms do not directly apply on the bag of words, but use the frequencies.
  2. Stemminq—identifies a word by its root.
    (i) Normalizes or unifies variations of the same concept, such as speed for three variations,i.e.,
    speaking, speaks, speakers denoted by [speaking, speaks, speaker —• speak]
    (ii) Removes plurals, normalizes verb tenses and remove affixes.Stemming reduces the word
    to its most basic element. For example, impurification —• pure.
  3. Removing stop words from the feature space—they are the common words, unlikely to help
    text mining. The search program tries to ignore stop words. For example, ignores ‹{or, it, in
    and are.
  4. Vector 6poce Model (VSM)—is an algebraic model for representing text documents as
    vector of identifiers, word frequencies or terms in the document index. VSM uses the method
    of term frequency-inverse document frequency (TF-IDF) and evaluates how important is a
    word in a document. When used in document classification, VSM also refers to the bag-of-
    words model. This bag of words is required to be converted into a term-vector in VSM. The
    term vector provides the numeric values corresponding to each term appearing in a document.
    The term vector is very helpful in feature generation and selection.
    Term frequency and inverse document frequency (IDF) are important metrics in text analysis.
    TF-IDF weighting is most common— Instead of the simple TF, IDF is used to weight the
    importance of word in the document.
    TF-IDF stands for the „term frequency-inverse document frequency‟. It is a numeric measure
    used to score the importance of a word in a document based on how often the word appears in
    that document and in a given collection of documents. It suggests that if a word appears
    frequently in a document, then it is an important word, and should therefore be high in score.
    But if a word appears in many more other documents, it is probably not a unique identifier, and
    therefore should be assigned a lower score. The TF-IDF is measured as:
    where t denotes the term vector.

Phase 3: Features Selection is the process that selects a subset of features by rejecting
irrelevant and/or redundant features (variables, predictors or dimension) according to
defined criteria. feature selection process does the following:

  1. Dimensionality reduction—Feature selection is one of the methods of division and
    therefore, dimension reduction. The basic objective is to eliminate irrelevant and redundant
    data. Redundant features are those, which provide no extra information. Irrelevant features
    provide no useful or relevant information in any context. Principal Component Analysis (PCA)
    and Linear Discriminate Analysis (LDA) are dimension reduction methods. Discrimination
    ability of a feature measures relevancy of features. Correlation helps in finding the redundancy
    of the feature. Two features are redundant to each other if their values correlate with each other.
  2. N-grom ev‹iluotion—finding the number of consecutive words of interest and extract them.
    For example, 2-gram is a two words sequence, [“tasty food”, “Good one”]. 3-gram is a three
    words sequence, [“Crime Investigation Department”].
  3. Noise detection end evuluotion o{ outliers methods do the identification of unusual or
    suspicious items, events or observations from the data set. This step helps in cleaning the data.
    The feature selection algorithm reduces dimensionality that not only improves the performance
    of learning algorithm but also reduces the storage requirement for a dataset. The process
    enhances data understanding and its visualization.

Phase 4: Data mining techniques enable insights about the structured database that
resulted from the previous phases. Examples of techniques are:

  1. Unsupervised learning (for example, clustering)
    (i) The class labels (categories) of training data are unknown
    (ii) Establish the existence of groups or clusters in the data
    Good clustering methods use high intra-cluster similarity and low inter-cluster
    similarity.Examples of uses – blogs, patternsand trends.
  2. Supervised feorninq (for ex‹imple, classipcation)
    (i) The training data is labeled indicating the class
    (ii) New data is classified based on the training set
    Classification is correct when the known label of test sample is identical with the
    resulting class computed from the classification model.
    Examples of uses are news filtering npplicotion, where it is required to automatically assign
    incoming documents to pre-defined categories; emoil spom altering, where it is identified
    whether incoming email messages are spam or not.
    Example of text classification methods are Naïve Bayes Classifier and SVMs.
  3. IdentiJinq evolutionary patterns in temporal text streams—the method is useful in a
    wide range of applications, such as summarizing of events in news articles and extracting
    the research trends in the scientific literature.

Phase 5: Analysing results
(i) Evaluate the outcome of the complete process.
(ii) Interpretation of Result- If acceptable then results obtained can be used as an input for next
set of sequences. Else, the result can be discarded, and try to understand what and why the
process failed.
(iii) Visualization – Prepare visuals from data, and build a prototype.
(iv) Use the results for further improvement in activities at the enterprise, industry or
institution.

    Leave a Reply

    Your email address will not be published. Required fields are marked *