Croatian National Corpus
Croatian National Corpus (HNK) is a systematized collection of selected texts mainly written in contemporary Croatian covering different media, genres, styles, fields and topics. The Corpus is accompanied by additional linguistic and non-linguistic data. You can find out more about HNK here.
Croatian Morphological Lexicon
Croatian Morphological Lexicon is a lexical database which encompasses more than 45,000 lemmas of general language, 15,000 personal male and female first names and more than 50,000 surnames registered in the Republic of Croatia. From this repository of lemmas more than 3,900,000 word-forms has been generated. This Lexicon can be highly useful for students of Croatian language (both native speakers and foreigners), experts and systems which deal with document retrieval (Internet and intranet search-engines), information extraction, text-mining and computational processing of Croatian.
Croatian Wordnet (CROWN)
Croatian Wordnet is a lexical database for Croatian built through the expand approach and linked to Princeton WordNet 3.0. Sets of synonyms (synsets) in CroWN consist of nouns, verbs, adjectives and adverbs. Synsets are connected with various semantic relations.
CroWN 2.0. has 23 122 synsets comprising 47 906 lexical units.
CroWN 1.0. has 10 031 synsets comprising 31 367 lexical units.
CroDeriV is a morphological database of Croatian verbs. It consists of 14491 verbs. Each verb in CroDeriV is segmented into lexical and derivational morphemes. Verbs of the same root are mutually linked. This procedure enables the recognition of derivationally related families of verbs and, at the same time, the detection of full derivational spans of particular base forms. The structure of the database enables the recognition of generalized morphological structure applicable to all Croatian verbs. It consists of four slots for derivational prefixes and three slots for derivational suffixes on each side of a lexical morpheme
Croatian Dependency Treebank – HOBS
The Croatian Dependency Treebank site includes two corpora tagged on the morphosyntactic, dependency and semantic role level.
The first corpus, Croatian Dependency Treebank, is part of the Croatian National Corpus, specifically, of the newspaper subcorpus (consisting of texts from the Croatia Weekly newspaper, CW2000). The CW2000 subcorpus is lemmatized, morphosyntactically tagged and manually disambiguated. The tagging was made in accordance with the MulTextEast recommendations for the Croatian language by using the Croatian Lemmatization Server. Furthermore, the subcorpus is manually annotated on the syntactic level according to the modified specification used for the Prague Dependency Treebank, while the semantic roles are assigned according to the specification for Croatian developed at the Institute of Linguistics of the Faculty of Humanities and Social Sciences in Zagreb.
The second corpus, tagged in the same fashion as the previous one, consists of approx. 500 sentences taken from the Croatian language courses available at the web portal HR4EU (www.hr4eu.hr). This corpus is therefore particularly useful for Croatian language learners.