Motivated by the goal to develop a model for the Filipino language that will help in the creation of language-based applications such as machine translators, grammar checkers and automated essay graders to name a few, researchers from the Electrical and Electronics Engineering Institute of the University of the Philippines Diliman (EEEI-UPD) have developed a large database of Filipino words through the Bantay Wika Project.
Using web crawler and document processing programs, a digital database of Filipino words was developed with around 1.6 million sentences, 41 million words with 746,060 unique number of words. Further, the digital corpus was expanded to include other materials downloaded from the Internet which were originally published in print form as far back as 1910. These include the corpus from McFarland 3rd, Fil. News Corpora-1, Fil.Net, Filipiniana and Project Gutenberg. Also included are selected literary works from Philippine National Artist for Literature, Virgilio S. Almario, from 1982 to 1993 and other eminent Filipino writers. These literary works include plays, short stories, folkloric materials, poems, novels, series and works on Philippine history written in the Filipino language. By combining all the digital Filipino text corpora of the database, the corpus contains a total number of 2,059,363 sentences, 1,605,904 unique sentences, 49,655,557 words with 1,166,375 unique words. The Bantay Wika Project is one of the project under the Interdisciplinary Signal Processing for Pinoys (ISIP) Programfunded by the Philippine Council for Industry, Energy and Emerging Technology Research and Development of the Department of Science and Technology (DOST-PCIEERD) and the University of the Philippines Diliman.
For more information, please contact the Program Leader, Dr. Rhandley Cajote of the Digital Signal Processing of the EEEI-UPD.