#

corpus

Here are 849 public repositories matching this topic...

INL / corpus-frontend

BlackLab Frontend, a feature-rich corpus search interface for BlackLab.

Updated May 28, 2024
TypeScript

adbar / trafilatura

Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

Updated May 28, 2024
Python

luciamariaalvarezcrespo / GalMisoCorpus2023

📑 Galician corpus for misogyny detection

nlp machine-learning corpus corpus-data nlp-machine-learning misogyny galician misogyny-detection

Updated May 28, 2024
Python

PyThaiNLP / thaigov-v2-corpus

Thai News Dataset from Thai government website.

corpus thai-language corpus-data thai-nlp pythainlp

Updated May 28, 2024
Jupyter Notebook

malaysian-dataset

mesolitica / malaysian-dataset

We gather Malaysian dataset! https://malaysian-dataset.readthedocs.io/

text-mining corpus malaysia bahasa-melayu manglish malay-dataset

Updated May 28, 2024
Jupyter Notebook

esteeschwarz / SPUND-LX

linguistics essais

corpus linguistics

Updated May 27, 2024
HTML

spottolaq / Spotted-Poivorrei-ITA2020

This repository houses a comprehensive collection of 14,701 Instagram posts authored by Italian university students between January 2020 and December 2020. These posts offer invaluable insights into the experiences and reflections of students during the challenging period of the COVID-19 lockdown in Italy.

data corpus linguistics dataset italiano linguistic-analysis italian-language linguistics-dataset

Updated May 27, 2024

quasilyte / gocorpus

The code used to serve gocorpus application

search go golang syntax data query statistics analysis corpus gogrep

Updated May 26, 2024
Go

umarbutler / open-australian-legal-corpus-creator

The code used to create and update the Open Australian Legal Corpus, the first and only multijurisdictional open corpus of Australian legislative and judicial documents.

law legal scraping corpus australia open-data dataset web-scraping datasets

Updated May 26, 2024
Python

jdave23 / EAD-corpus

A collection of encoded archival description XML documents for text and content analysis.

archives corpus text-corpus finding-aids ead

Updated May 25, 2024
Shell

tlu-dt-nlp / EstGEC-L2-Corpus

Estonian Grammatical Error Correction (GEC) test and development corpus that contains L2 learner texts error-annotated in the M2 format.

annotation corpus error-corpora estonian-language language-resources benchmark-datasets gold-standard grammatical-error-correction

Updated May 25, 2024
Python

SaiedAlshahrani / leveraging-corpus-metadata

Leveraging Corpus Metadata to Detect Template-based Translation: An Exploratory Case Study of the Egyptian Arabic Wikipedia Edition

metadata translation wikipedia corpus arabic egyptian detection-systems template-based-translation

Updated May 25, 2024
Jupyter Notebook

BLKSerene / Wordless

An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation

translation tokenizer corpus linguistics tagger literature dependency-parser corpus-linguistics lemmatizer corpus-tools corpus-processing corpus-search corpus-statistics stopword corpus-analysis

Updated May 24, 2024
Python

dracor-org / gerdracor

German Drama Corpus

xml corpus digital-humanities tei drama dramatic-texts

Updated May 23, 2024
CSS

flairNLP / fundus

A very simple news crawler with a funny name

python nlp rss sitemap crawler scraper corpus text-extraction web-scraping news-crawler commoncrawl web-corpus news-scraping cc-news

Updated May 23, 2024
Python

INL / BlackLab

Linguistic search for large annotated text corpora, based on Apache Lucene

Updated May 27, 2024
Java

johentsch / ms3

A parser for annotated MuseScore 3 files.

Updated May 23, 2024
Python

ina-foss / InaGVAD

Voice activity detection and speaker gender segmentation audiovisual corpus

radio benchmark corpus media tv gender audio-segmentation voice-activity-detection gender-prediction speech-dataset gender-bias speech-activity-detection speaker-gender speech-corpus audio-dataset audiovisual-dataset acoustic-diversity gender-representation

Updated May 23, 2024
Jupyter Notebook

Wenhao-Yang / TwoWayRadio

Radio Audio Corpus Collection Toolkit with Hackrf One.

radio pyqt5 corpus gnuradio hackrf-one

Updated May 23, 2024
Python

CLUEbenchmark / CLUE

中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard

benchmark tensorflow nlu glue corpus transformers pytorch dataset chinese pretrained-models language-model albert bert roberta chineseglue

Updated May 23, 2024
Python

Improve this page

Add a description, image, and links to the corpus topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the corpus topic, visit your repo's landing page and select "manage topics."