Re: Announce: LibOTS 0.3.0 released

From: Jordi Mas (jmas@softcatala.org)
Date: Tue Jul 15 2003 - 14:44:19 EDT

  • Next message: Dom Lachowicz: "Re: Announce: LibOTS 0.3.0 released"

    En/na Dom Lachowicz ha escrit:
    > For Catalan, you just need to add a list of the 200 or
    > so most common *meaningless* words in the language.
    > Like:
    >
    > the, a, an, he, she, of, ...

    Hello Dom and the others,

    The summarizer should not only have stop words (maningless) it should also
    know the most common words in every language.

    Well, that I wanted to mention is that if you are doing the selection of the
    stop words manually the quality of the summarisation is going to be low and
    the algorithm is not going to perform well. If you use Word, you may be
    familiar with the concept of not performing well when doing summarisation.

    The right way of getting a list of common words for a language is to get a
    corpus (colletion of the documents), calculate the relative word frequency
    (number of times that the words appears in all documents) and then select the
    200 o 300 most common words, then you are going to have exactly that you need.
    Also, it would be necessary that the corpus contain texts from different parts
    of the human knowdlege.

    I know that no every one has a corpus handy, but we should do this with love
    at least for the major languages (English, Spanish, German), if not we are not
    going to perform well.

    I would also suggest to implement another algorism in the library that has
    been proof to be effective for text sumarisation. Lots of texts contains words
    like "In conclusion", etc, that definitly should have enhance the score of the
    sentence and words like "As we said before", "As you already seen" that should
    give you less score. This works well, specially for formal texts.

    Finally, one common problem in text summarisation is that the selected
    sentences assume knowdlege that you may no longer have. For exemple, if you
    select "He will do the course with them" or "Also, ..." you no longer have
    these references in the text. We can have a list of pronames (pronobres in
    Spanish) that if there are present we score lower the setence, because we
    prefer first setences with no references to text that we longer no have.

    Here my five cents, if you think that some of this is interesting, I can give
    you guys a hand, or two :-)

    Best Regards,

    -- 
    

    Jordi Mas i Hernāndez - Abiword developer - http://www.abisource.com jmas@softcatala.org - Softcatalā member - http://www.softcatala.org - Personal Homepage http://www.softcatala.org/~jmas



    This archive was generated by hypermail 2.1.4 : Tue Jul 15 2003 - 14:58:14 EDT