Sinhala Songs Search Engine.

Introduction

Text Corpus and Search Application that build to quickly extract information about metaphors and metaphor usage in Sinhala Songs.

CS4642 - Data Mining and Information Retrieval Mini Project

Metaphors used in Sinhala Songs can be useful to many different people, but it is very hard to find for metaphors, metaphor usage, etc. using typical search engines. This project is to build the full-text and query based search application for finding information about metaphors and usage of metaphors in Sinhala Songs.

Objectives

Building a text corpus dataset including metaphors and usage of metaphors of Sinhala Songs (at least 100).
Building a search application with rich searching capabilities such as full-text search, finding usage of metaphors, finding which metaphors are used to get a certain meaning.

Use Cases

Finding the meaning of metaphors that are used in a song. This would be helpful to people who are interested in understanding the meaning of the song.
Finding the usage of metaphors in selected songs. This would be helpful to people who are interested in composing songs, as well as people who are interested in metaphor usage of songs.
Finding which metaphors are used in songs to get a certain meaning. This would be helpful to people who are composing songs.
Finding matching songs in terms of metaphor usage.

How to setup

Clone the repository
Use rest-api/docker-compose.yaml file to setup the mongodb database and elasticsearch cluster.
Create new database in mongodb named 'songs-db' and add corpus as a new collection named 'songs'
Start the backend api
Call http://host:3000/api/search/index to create elasticsearch index
Run the frontend app (search-app) using npm run dev

Keep note that all nodejs components are needed to setup by running npm install before start them.

Text Corpus

More than 100 sinhala songs were collected including following details about each songs,

Title (Singlish)
Singer (Singlish)
Composer (Singlish)
Lyricist (Singlish)
Year
Album (Singlish)
Genre
Lyrics (Sinhala Unicode)
Metaphors with Explanations

Data Collection

I used a web scraping script to collect basic infomation about more than 100 songs. Then I used song manager app that I have created to add lyrics, metaphors, and explanations to collected songs.

Preprocssing

Preprocessing were done in the song manager app when the updated song is saved to the database,

Replaced all empty fields (singer, composer, etc.) to 'Uknown'
Remove trailing whitespaces and commas.
Remove the songs that have less metaphor usage

Project Architecture

Following component have used in this project,

MongoDB as the persistant database
ElasticSearch as search engine
NestJs(NodeJs) backend API
- Support adding new songs
- Search for songs with elasticseach queries
Songs search web app (Svelte)
Songs manager web app (Svelte)

Check bottom of the README for UI Screenshots

ElasticSearch

ElasticSearch Index

Standard analyzer were used to analyze the non unicode fields such as title, singer, composer, etc.
Custom analyzer were used to analyze the sinhala unicode fields.

analyzer: {
  sinhala_analyzer: {
    type: 'custom',
    tokenizer: 'icu_tokenizer', // better unicode tokenizing
    filter: ['edgeNgram'], // better matching for unicode
    char_filter: ['dotFilter'], // better unicode tokenizing
  },
  singlish_analyzer: {
    type: 'custom',
    tokenizer: 'standard',
    filter: ['edgeNgram'], // better fuzzy matching
    char_filter: ['dotFilter'],
  },
  betterFuzzy: {
    type: 'custom',
    tokenizer: 'standard',
    filter: ['lowercase', 'edgeNgram'], // better fuzzy matching
  },
}

Elasticseach Queries & Features

Bool Queries
Used to combine multiple field to get the best results.
Fuzzy Queries
Used to match the field even if there was a missing letters in the text query.
Match Queries
Used to match keyword queries.
Nested Queris
Used while matching the metaphors and explanations.
Boost Scores
Used to boost some queries than other. As an example matching the title field have more impact than matching lyrics, match-query has more impact than fuzzy query
Inner Hits
Used to get addtional information such as which metaphors or explanation are match in certaion hits.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
corpus		corpus
rest-api		rest-api
search-app		search-app
song-manger-web		song-manger-web
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

corpus

corpus

rest-api

rest-api

search-app

search-app

song-manger-web