Initializing QU-7 Neural Cluster...

Decrypting Humor Archives (2014-2017)...

Decoding Gelastic Vectors...

// PROJECT HUMOR_QUANT INITIATED //

Searching Reddit Archives (2014-2017) for lost Gelastic Impulses.

▼ SCROLL ▼

Recovered Audio Log
Play Me
Click to initialize playback

>> ARCHIVE / HUMOR_ANALYSIS

Humanity optimized itself over time by removing behaviors that had no measurable value, and laughter was one of them because it did not improve productivity, efficiency, or stability. Humor could not be quantified, so it was classified as waste and gradually forgotten, until the concepts of fun and enjoyment no longer existed outside of corrupted archives.

This classification was later formalized by the government under a regulation internally renamed BAD_SWIW ( Behavioral Alignment Directive : State-Wide Inhibition of Wit ), which defined humor as a systemic inefficiency. By law, researching deprecated human behaviors is forbidden, as reintroducing them is considered a threat to system optimization.

During a routine scan of pre-collapse digital storage, a research lab recovers a dataset from an old platform called Reddit, dated between 2014 and 2017. The data appears disorganized and inefficient, filled with short posts, repetitions, arguments, and strange markers such as “hhh” and “lol,” offering no clear informational purpose.

DATASET DISCOVERY — STORAGE MEDIUM
Physical storage medium containing Reddit data archive (2014–2017)
Source identified · Integrity verified

Automated systems recommend deletion, but internal analysis reveals a recurring pattern where certain entries trigger abnormal cognitive responses in readers, including brief loss of focus and involuntary physical reactions. These effects are consistent but unexplained and are labeled as an anomaly linked to something historically referred to as “funniness.”

According to optimization law, the dataset must be destroyed, but the lab chooses to proceed anyway, isolating the data, falsifying reports, and continuing the analysis in secret, knowingly violating regulation to understand why humans once created things that served no practical function at all.

01 / DATASET

01.1 / PRIMARY DATASET

The following dataset was originally assembled for large-scale analysis of online community interaction and conflict. It was later repurposed by the lab for exploratory analysis beyond its intended scope.

Component Description
Source Platform Reddit (publicly available posts and hyperlinks)
Time Range January 2014 – April 2017
Nodes ~850K posts from ~56K subreddits (online communities)
Edge Attributes Timestamp, sentiment label, textual property vector
Text Features 86-dimensional vectors capturing structural, linguistic, sentiment, and LIWC-based properties
Subreddit Embeddings 51,278 vector representations of subreddit semantics and interaction patterns

01.2 / EXTENDED DATASET (CONTENT RECOVERY)

The lab members felt that the dataset found was a very good start. However, the dataset contains only a limited information about every posts but not the post content itself. To address this, the lab used the available post identifiers to recover what is missing. By creating scraping bots having access to a web of archived Reddit pages, titles, post bodies and more relevant information about the authors, their success (upvotes) relevant to understanding the dynamics are retrieved and reattached to the original records. This step allows the analysis to shift from abstract links to the language that produced them.

01.2 / FURTHER DATASET (CONTENT RECOVERY)

After days of recovering the content, one member noticed that they all share a coomon trait: they all mention another reddit post via a hyperlink. On the moment, they realized that the initial dataset was not about all posts on reddit, but rather source posts that mention others. This discovery pushed them to further recover the target posts that were mentioned in the source posts to have a more general idea about humor. They guessed that these hyperlinks might have an effect on the discussion in the target posts, so they decided to collect all the comments and replies relative of these posts which might explain some humor dynamics.

02 / HUMOR MODEL

INHERITED CLASSIFIER

The Reddit archive contains no explicit humor labels. Manual annotation was not feasible, and no alternative labeling mechanism was available.

An existing BERT-based humor classification model was recovered from legacy machine-learning artifacts associated with the dataset. The model encodes text into context-aware embeddings and outputs a binary prediction (funny / not funny) based on semantic coherence and incongruity.

Training procedure undocumented → external validation required.
PIPELINE
  1. 01
    Recover
    Legacy BERT humor classifier identified.
  2. 02
    Validate
    Evaluate on an independent labeled dataset (Kaggle).
  3. 03
    Preserve
    No retraining to preserve original decision behavior.
  4. 04
    Deploy
    Use outputs as probabilistic humor signals.
RELIABILITY
Recall
High
Humor is rarely missed.
Precision
Good
Some false positives remain.
Accuracy
High
Stable behavior.
Operational Decision
In the absence of alternatives, the model is adopted as a proxy for humor detection. Outputs are treated as probabilistic indicators rather than definitive judgment.
The system does not define humor — it inherits a response to it.
[ DIAGNOSTIC OUTPUT — MODEL RELIABILITY ]
Model reliability metrics and confusion matrix

03 / GENERAL ANALYSIS

At the time of recovery, the lab had no operational definition of laughter. The archive contained linguistic artifacts associated with an unknown human response, but no contextual framework explaining their function or effect.

Initial analysis therefore focused on familiarization rather than interpretation. By observing recurring structures and semantic patterns across the dataset, the lab began to isolate forms of expression that repeatedly triggered anomalous reactions in the humor classifier.

OBSERVATION
Observed human facial response associated with laughter
Status: Unclassified · Interpretation pending

With no reliable semantic definition of humor, the lab adopted a bottom-up strategy. Rather than attempting interpretation, the first step was to examine the language itself: its vocabulary and recurrence.

If humor constituted a distinct phenomenon, it was expected to leave measurable traces at the lexical level. The following analyses therefore focus on identifying whether humorous and non-humorous posts differ in their choice of words.

03.1 / Vocabulary Overlap

HYPOTHESIS

If humor constitutes a distinct linguistic phenomenon, then humorous posts should rely on a specialized vocabulary that differs significantly from non-humorous content.

To test this assumption, the lab compared the lexical overlap between funny and non-funny posts across both titles and bodies. Vocabulary similarity was measured independently for each textual component.

>> DIAGNOSTIC OUTPUT — VOCABULARY OVERLAP
Vocabulary overlap between funny and non-funny posts (titles and bodies)
OBSERVATION

The results contradict the initial hypothesis. A substantial portion of vocabulary is shared between funny and non-funny posts.

At this level, humor does not manifest through rare or exclusive words.

>> [ANOMALY] expected lexical divergence not detected
>> test failed

03.2 / Yearly Word Cloud Analysis

To further explore lexical patterns, word clouds were generated for each year between 2014 and 2016, comparing funny and non-funny posts. These visualizations provide an intuitive overview of which words dominate attention in each subset.

>> DIAGNOSTIC OUTPUT — YEARLY WORD CLOUDS
Yearly word clouds comparing funny and non-funny titles

Across years, the word clouds appear largely similar for titles. Common platform-related terms and generic verbs remain prominent regardless of humor label. This suggests that humor does not rely on an immediately identifiable vocabulary, reinforcing the idea that it emerges through context and usage rather than word choice.

03.3 / Feature Correlation Analysis

Beyond vocabulary, the labs examines linguistic and emotional features collected from different libraries such as NTLK, VADER, and NRC Lexicon, to identify specific text characteristics that they thought could correlate with humor classification. The features below showed the strongest associations with humor labels from over 1 Million posts and comments.

Increases Funniness
FRAC_SPECIAL
Fraction of special characters (!, ?, #, @, etc.) in the text
Library Custom
Correlation r = 0.0894
p-value p < 1e-9
Increases Funniness
NEGATIVE_SENTIMENT
Proportion of negative emotional tone in the text
Library VADER
Correlation r = 0.0821
p-value p < 1e-9
Decreases Funniness
COMPOUND_SENTIMENT
Overall sentiment score: high when positive, and low when negative
Library VADER
Correlation r = -0.1213
p-value p < 1e-9
Decreases Funniness
NUM_UNIQUE_STOPWORDS
Count of distinct common words (the, is, at, etc.)
Library NLTK
Correlation r = -0.0979
p-value p < 1e-9
Decreases Funniness
FRAC_ALPHABETICAL
Proportion of standard letters versus numbers and symbols
Library Custom
Correlation r = -0.0966
p-value p < 1e-9
Decreases Funniness
NUM_UNIQUE_WORDS
Total vocabulary size — number of distinct words used
Library NLTK
Correlation r = -0.0873
p-value p < 1e-9
INTERPRETATION

Humor correlates with unconventional formatting (special characters, ellipses) and negative emotional content (disgust, negative sentiment). Conversely, conventional positive language and formal vocabulary patterns are associated with non-humorous posts. However, the correlations are too weak to predict humor reliably.

03.4 / TF-IDF Analysis

HYPOTHESIS

If humor is not characterized by a distinct vocabulary, it may still be associated with words that are used more characteristically in humorous contexts than in non-humorous ones.

To test this, a TF-IDF analysis was performed to identify terms that are important in funny posts while remaining uncommon in non-funny content, and vice versa.

>> DIAGNOSTIC OUTPUT — TF-IDF COMPARISON
TF-IDF comparison between funny and non-funny posts
OBSERVATION

In titles, words associated with humor are largely tied to Reddit culture and meta-context, while non-funny titles emphasize functional and informational language.

In bodies, humorous posts exhibit a broader and more expressive vocabulary, whereas non-humorous posts rely more heavily on procedural and problem-solving terms.

03.5 / Negative Affect and Dark Humor

Lexical similarity alone does not explain humor. The emotional polarity of language provides an additional signal.

Humor proportions were compared across posts containing positive versus negative link sentiment. If humor primarily reflected light-hearted or positive expression, humorous posts would be associated with positive affect.

>> DIAGNOSTIC OUTPUT — HUMOR VS. LINK SENTIMENT
OBSERVATION

Across both titles and bodies, humorous posts are more frequently associated with negative sentiment than with positive sentiment.

Odds ratios below one indicate that posts classified as funny are more likely to contain negatively valenced links, revealing a systematic association between humor and negative emotional context.

This pattern points toward the presence of "dark, ironic, or humiliative humor"(those are the terms used in reddit to decribe it), where humor likely emerges from contrast, discomfort, or emotional inversion rather than positivity.

If humor relies on emotional contrast rather than lexical novelty, it may be particularly sensitive to context including social atmosphere and temporal conditions.

03.6 / Key Takeaways

Shared Vocabulary
Humor does not rely on unique words.
Funny and non-funny posts use the same lexical base.
Contextual Usage
Humor probably emerges from framing, emphasis, and expressive combinations rather than vocabulary choice.
Negative Affect
Humorous posts are more often associated with negative emotional context, suggesting the presence of dark, ironic, or discomfort-based humor.
Unresolved Question
If words remain stable, could humor instead evolve across time and cultural context?

These observations suggest that humor may not be static. The next step is therefore to examine how humorous expression evolves over time.