Chatroulette teen portal
2006)), containing about 700,000 posts to (in total about 140 million words) by almost 20,000 bloggers. Slightly more information seems to be coming from content (75.1% accuracy) than from style (72.0% accuracy). We see the women focusing on personal matters, leading to important content words like love and boyfriend, and important style words like I and other personal pronouns.
For each blogger, metadata is present, including the blogger s self-provided gender, age, industry and astrological sign. The creators themselves used it for various classification tasks, including gender recognition (Koppel et al. The men, on the other hand, seem to be more interested in computers, leading to important content words like software and game, and correspondingly more determiners and prepositions.
The authors do not report the set of slang words, but the non-dictionary words appear to be more related to style than to content, showing that purely linguistic behaviour can contribute information for gender recognition as well.
Gender recognition has also already been applied to Tweets. (2010) examined various traits of authors from India tweeting in English, combining character N-grams and sociolinguistic features like manner of laughing, honorifics, and smiley use.
Later, in 2004, the group collected a Blog Authorship Corpus (BAC; (Schler et al.
One gets the impression that gender recognition is more sociological than linguistic, showing what women and men were blogging about back in A later study (Goswami et al.
2009) managed to increase the gender recognition quality to 89.2%, using sentence length, 35 non-dictionary words, and 52 slang words.
Video chat - a modern tool for online social networking on a variety of topics.
This talk of young girls, and talk of sports fans aged 18 and over, and chat on the theme of love and friendship, the discussion of in politics and sex in the private chat rooms.