Text Corpus and Search Application that build to quickly extract information about metaphors and metaphor usage in Sinhala Songs.
CS4642 - Data Mining and Information Retrieval Mini Project
Metaphors used in Sinhala Songs can be useful to many different people, but it is very hard to find for metaphors, metaphor usage, etc. using typical search engines. This project is to build the full-text and query based search application for finding information about metaphors and usage of metaphors in Sinhala Songs.
- Building a text corpus dataset including metaphors and usage of metaphors of Sinhala Songs (at least 100).
- Building a search application with rich searching capabilities such as full-text search, finding usage of metaphors, finding which metaphors are used to get a certain meaning.
- Finding the meaning of metaphors that are used in a song. This would be helpful to people who are interested in understanding the meaning of the song.
- Finding the usage of metaphors in selected songs. This would be helpful to people who are interested in composing songs, as well as people who are interested in metaphor usage of songs.
- Finding which metaphors are used in songs to get a certain meaning. This would be helpful to people who are composing songs.
- Finding matching songs in terms of metaphor usage.
- Clone the repository
- Use
rest-api/docker-compose.yaml
file to setup the mongodb database and elasticsearch cluster. - Create new database in mongodb named 'songs-db' and add corpus as a new collection named 'songs'
- Start the backend api
- Call
http://host:3000/api/search/index
to create elasticsearch index - Run the frontend app (search-app) using
npm run dev
Keep note that all nodejs components are needed to setup by running
npm install
before start them.
More than 100 sinhala songs were collected including following details about each songs,
- Title (Singlish)
- Singer (Singlish)
- Composer (Singlish)
- Lyricist (Singlish)
- Year
- Album (Singlish)
- Genre
- Lyrics (Sinhala Unicode)
- Metaphors with Explanations
I used a web scraping script to collect basic infomation about more than 100 songs. Then I used song manager app that I have created to add lyrics, metaphors, and explanations to collected songs.
Preprocessing were done in the song manager app when the updated song is saved to the database,
- Replaced all empty fields (singer, composer, etc.) to 'Uknown'
- Remove trailing whitespaces and commas.
- Remove the songs that have less metaphor usage
Following component have used in this project,
- MongoDB as the persistant database
- ElasticSearch as search engine
- NestJs(NodeJs) backend API
- Support adding new songs
- Search for songs with elasticseach queries
- Songs search web app (Svelte)
- Songs manager web app (Svelte)
Check bottom of the README for UI Screenshots
- Standard analyzer were used to analyze the non unicode fields such as title, singer, composer, etc.
- Custom analyzer were used to analyze the sinhala unicode fields.
analyzer: {
sinhala_analyzer: {
type: 'custom',
tokenizer: 'icu_tokenizer', // better unicode tokenizing
filter: ['edgeNgram'], // better matching for unicode
char_filter: ['dotFilter'], // better unicode tokenizing
},
singlish_analyzer: {
type: 'custom',
tokenizer: 'standard',
filter: ['edgeNgram'], // better fuzzy matching
char_filter: ['dotFilter'],
},
betterFuzzy: {
type: 'custom',
tokenizer: 'standard',
filter: ['lowercase', 'edgeNgram'], // better fuzzy matching
},
}
-
Bool Queries
Used to combine multiple field to get the best results. -
Fuzzy Queries
Used to match the field even if there was a missing letters in the text query. -
Match Queries
Used to match keyword queries. -
Nested Queris
Used while matching the metaphors and explanations. -
Boost Scores
Used to boost some queries than other. As an example matching the title field have more impact than matching lyrics, match-query has more impact than fuzzy query -
Inner Hits
Used to get addtional information such as which metaphors or explanation are match in certaion hits.