Skip to content

Text Corpus and Search Application that build to quickly extract information about metaphors and metaphor usage in Sinhala Songs.

Notifications You must be signed in to change notification settings

ThilinaTLM/sinhala-song-search-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sinhala Songs Search Engine.

Introduction

Text Corpus and Search Application that build to quickly extract information about metaphors and metaphor usage in Sinhala Songs.

CS4642 - Data Mining and Information Retrieval Mini Project

Metaphors used in Sinhala Songs can be useful to many different people, but it is very hard to find for metaphors, metaphor usage, etc. using typical search engines. This project is to build the full-text and query based search application for finding information about metaphors and usage of metaphors in Sinhala Songs.

Objectives

  • Building a text corpus dataset including metaphors and usage of metaphors of Sinhala Songs (at least 100).
  • Building a search application with rich searching capabilities such as full-text search, finding usage of metaphors, finding which metaphors are used to get a certain meaning.

Use Cases

  • Finding the meaning of metaphors that are used in a song. This would be helpful to people who are interested in understanding the meaning of the song.
  • Finding the usage of metaphors in selected songs. This would be helpful to people who are interested in composing songs, as well as people who are interested in metaphor usage of songs.
  • Finding which metaphors are used in songs to get a certain meaning. This would be helpful to people who are composing songs.
  • Finding matching songs in terms of metaphor usage.

How to setup

  1. Clone the repository
  2. Use rest-api/docker-compose.yaml file to setup the mongodb database and elasticsearch cluster.
  3. Create new database in mongodb named 'songs-db' and add corpus as a new collection named 'songs'
  4. Start the backend api
  5. Call http://host:3000/api/search/index to create elasticsearch index
  6. Run the frontend app (search-app) using npm run dev

Keep note that all nodejs components are needed to setup by running npm install before start them.

Text Corpus

More than 100 sinhala songs were collected including following details about each songs,

  1. Title (Singlish)
  2. Singer (Singlish)
  3. Composer (Singlish)
  4. Lyricist (Singlish)
  5. Year
  6. Album (Singlish)
  7. Genre
  8. Lyrics (Sinhala Unicode)
  9. Metaphors with Explanations

Data Collection

I used a web scraping script to collect basic infomation about more than 100 songs. Then I used song manager app that I have created to add lyrics, metaphors, and explanations to collected songs.

Preprocssing

Preprocessing were done in the song manager app when the updated song is saved to the database,

  • Replaced all empty fields (singer, composer, etc.) to 'Uknown'
  • Remove trailing whitespaces and commas.
  • Remove the songs that have less metaphor usage

Project Architecture

Following component have used in this project,

  • MongoDB as the persistant database
  • ElasticSearch as search engine
  • NestJs(NodeJs) backend API
    • Support adding new songs
    • Search for songs with elasticseach queries
  • Songs search web app (Svelte)
  • Songs manager web app (Svelte)

image

Check bottom of the README for UI Screenshots

ElasticSearch

ElasticSearch Index

  • Standard analyzer were used to analyze the non unicode fields such as title, singer, composer, etc.
  • Custom analyzer were used to analyze the sinhala unicode fields.
analyzer: {
  sinhala_analyzer: {
    type: 'custom',
    tokenizer: 'icu_tokenizer', // better unicode tokenizing
    filter: ['edgeNgram'], // better matching for unicode
    char_filter: ['dotFilter'], // better unicode tokenizing
  },
  singlish_analyzer: {
    type: 'custom',
    tokenizer: 'standard',
    filter: ['edgeNgram'], // better fuzzy matching
    char_filter: ['dotFilter'],
  },
  betterFuzzy: {
    type: 'custom',
    tokenizer: 'standard',
    filter: ['lowercase', 'edgeNgram'], // better fuzzy matching
  },
}

Elasticseach Queries & Features

  • Bool Queries
    Used to combine multiple field to get the best results.

  • Fuzzy Queries
    Used to match the field even if there was a missing letters in the text query.

  • Match Queries
    Used to match keyword queries.

  • Nested Queris
    Used while matching the metaphors and explanations.

  • Boost Scores
    Used to boost some queries than other. As an example matching the title field have more impact than matching lyrics, match-query has more impact than fuzzy query

  • Inner Hits
    Used to get addtional information such as which metaphors or explanation are match in certaion hits.

Query Building

image

Screenshots

Search App

  1. Search UI Screenshot_20230119_101345

  2. Basic Search Result Screenshot_20230119_101750 Screenshot_20230119_101805

  3. Searching for Metaphor Screenshot_20230119_101504 Screenshot_20230119_101654

  4. Searching for Singer, Composer, Lyricist Screenshot_20230119_101916

Song Manager

  1. Songs List Screenshot_20230119_102105

  2. Edit Song Screenshot_20230119_103445 Screenshot_20230119_103509 Screenshot_20230119_103607 Screenshot_20230119_103618

About

Text Corpus and Search Application that build to quickly extract information about metaphors and metaphor usage in Sinhala Songs.

Topics

Resources

Stars

Watchers

Forks