QU-7: The Laughter Archive | Project HUMOR

>> ARCHIVE / HUMOR_ANALYSIS

Humanity optimized itself over time by removing behaviors that had no measurable value, and laughter was one of them because it did not improve productivity, efficiency, or stability. Humor could not be quantified, so it was classified as waste and gradually forgotten, until the concepts of fun and enjoyment no longer existed outside of corrupted archives.

This classification was later formalized by the government under a regulation internally renamed BAD_SWIW ( Behavioral Alignment Directive : State-Wide Inhibition of Wit ), which defined humor as a systemic inefficiency. By law, researching deprecated human behaviors is forbidden, as reintroducing them is considered a threat to system optimization.

During a routine scan of pre-collapse digital storage, a research lab recovers a dataset from an old platform called Reddit, dated between 2014 and 2017. The data appears disorganized and inefficient, filled with short posts, repetitions, arguments, and strange markers such as “hhh” and “lol,” offering no clear informational purpose.

DATASET DISCOVERY — STORAGE MEDIUM

Physical storage medium containing Reddit data archive (2014–2017)

Source identified · Integrity verified

Automated systems recommend deletion, but internal analysis reveals a recurring pattern where certain entries trigger abnormal cognitive responses in readers, including brief loss of focus and involuntary physical reactions. These effects are consistent but unexplained and are labeled as an anomaly linked to something historically referred to as “funniness.”

According to optimization law, the dataset must be destroyed, but the lab chooses to proceed anyway, isolating the data, falsifying reports, and continuing the analysis in secret, knowingly violating regulation to understand why humans once created things that served no practical function at all.

01 / DATASET

01.1 / PRIMARY DATASET

The following dataset was originally assembled for large-scale analysis of online community interaction and conflict. It was later repurposed by the lab for exploratory analysis beyond its intended scope.

Component	Description
Source Platform	Reddit (publicly available posts and hyperlinks)
Time Range	January 2014 – April 2017
Nodes	~850K posts from ~56K subreddits (online communities)
Edge Attributes	Timestamp, sentiment label, textual property vector
Text Features	86-dimensional vectors capturing structural, linguistic, sentiment, and LIWC-based properties
Subreddit Embeddings	51,278 vector representations of subreddit semantics and interaction patterns

01.2 / EXTENDED DATASET (CONTENT RECOVERY)

The lab members felt that the dataset found was a very good start. However, the dataset contains only a limited information about every posts but not the post content itself. To address this, the lab used the available post identifiers to recover what is missing. By creating scraping bots having access to a web of archived Reddit pages, titles, post bodies and more relevant information about the authors, their success (upvotes) relevant to understanding the dynamics are retrieved and reattached to the original records. This step allows the analysis to shift from abstract links to the language that produced them.

01.2 / FURTHER DATASET (CONTENT RECOVERY)

After days of recovering the content, one member noticed that they all share a coomon trait: they all mention another reddit post via a hyperlink. On the moment, they realized that the initial dataset was not about all posts on reddit, but rather source posts that mention others. This discovery pushed them to further recover the target posts that were mentioned in the source posts to have a more general idea about humor. They guessed that these hyperlinks might have an effect on the discussion in the target posts, so they decided to collect all the comments and replies relative of these posts which might explain some humor dynamics.

02 / HUMOR MODEL

INHERITED CLASSIFIER

The Reddit archive contains no explicit humor labels. Manual annotation was not feasible, and no alternative labeling mechanism was available.

An existing BERT-based humor classification model was recovered from legacy machine-learning artifacts associated with the dataset. The model encodes text into context-aware embeddings and outputs a binary prediction (funny / not funny) based on semantic coherence and incongruity.

Training procedure undocumented → external validation required.

PIPELINE

01

Recover

Legacy BERT humor classifier identified.
02

Validate

Evaluate on an independent labeled dataset (Kaggle).
03

Preserve

No retraining to preserve original decision behavior.
04

Deploy

Use outputs as probabilistic humor signals.

RELIABILITY

Recall

High

Humor is rarely missed.

Precision

Good

Some false positives remain.

Accuracy

High

Stable behavior.

Operational Decision

In the absence of alternatives, the model is adopted as a proxy for humor detection. Outputs are treated as probabilistic indicators rather than definitive judgment.

The system does not define humor — it inherits a response to it.

[ DIAGNOSTIC OUTPUT — MODEL RELIABILITY ]

Model reliability metrics and confusion matrix

03 / GENERAL ANALYSIS

At the time of recovery, the lab had no operational definition of laughter. The archive contained linguistic artifacts associated with an unknown human response, but no contextual framework explaining their function or effect.

Initial analysis therefore focused on familiarization rather than interpretation. By observing recurring structures and semantic patterns across the dataset, the lab began to isolate forms of expression that repeatedly triggered anomalous reactions in the humor classifier.

OBSERVATION

Observed human facial response associated with laughter

Status: Unclassified · Interpretation pending

With no reliable semantic definition of humor, the lab adopted a bottom-up strategy. Rather than attempting interpretation, the first step was to examine the language itself: its vocabulary and recurrence.

If humor constituted a distinct phenomenon, it was expected to leave measurable traces at the lexical level. The following analyses therefore focus on identifying whether humorous and non-humorous posts differ in their choice of words.

03.1 / Vocabulary Overlap

HYPOTHESIS

If humor constitutes a distinct linguistic phenomenon, then humorous posts should rely on a specialized vocabulary that differs significantly from non-humorous content.

To test this assumption, the lab compared the lexical overlap between funny and non-funny posts across both titles and bodies. Vocabulary similarity was measured independently for each textual component.

>> DIAGNOSTIC OUTPUT — VOCABULARY OVERLAP

Vocabulary overlap between funny and non-funny posts (titles and bodies)

OBSERVATION

The results contradict the initial hypothesis. A substantial portion of vocabulary is shared between funny and non-funny posts.

At this level, humor does not manifest through rare or exclusive words.

>> [ANOMALY] expected lexical divergence not detected
>> test failed

03.2 / Yearly Word Cloud Analysis

To further explore lexical patterns, word clouds were generated for each year between 2014 and 2016, comparing funny and non-funny posts. These visualizations provide an intuitive overview of which words dominate attention in each subset.

>> DIAGNOSTIC OUTPUT — YEARLY WORD CLOUDS

Yearly word clouds comparing funny and non-funny titles

Across years, the word clouds appear largely similar for titles. Common platform-related terms and generic verbs remain prominent regardless of humor label. This suggests that humor does not rely on an immediately identifiable vocabulary, reinforcing the idea that it emerges through context and usage rather than word choice.

03.3 / Feature Correlation Analysis

Beyond vocabulary, the labs examines linguistic and emotional features collected from different libraries such as NTLK, VADER, and NRC Lexicon, to identify specific text characteristics that they thought could correlate with humor classification. The features below showed the strongest associations with humor labels from over 1 Million posts and comments.

Increases Funniness

▲

FRAC_SPECIAL

Fraction of special characters (!, ?, #, @, etc.) in the text

Library Custom

Correlation ▲ r = 0.0894

p-value p < 1e-9

Increases Funniness

▲

NEGATIVE_SENTIMENT

Proportion of negative emotional tone in the text

Library VADER

Correlation ▲ r = 0.0821

p-value p < 1e-9

Decreases Funniness

▼

COMPOUND_SENTIMENT

Overall sentiment score: high when positive, and low when negative

Library VADER

Correlation ▼ r = -0.1213

p-value p < 1e-9

Decreases Funniness

▼

NUM_UNIQUE_STOPWORDS

Count of distinct common words (the, is, at, etc.)

Library NLTK

Correlation ▼ r = -0.0979

p-value p < 1e-9

Decreases Funniness

▼

FRAC_ALPHABETICAL

Proportion of standard letters versus numbers and symbols

Library Custom

Correlation ▼ r = -0.0966

p-value p < 1e-9

Decreases Funniness

▼

NUM_UNIQUE_WORDS

Total vocabulary size — number of distinct words used

Library NLTK

Correlation ▼ r = -0.0873

p-value p < 1e-9

INTERPRETATION

Humor correlates with unconventional formatting (special characters, ellipses) and negative emotional content (disgust, negative sentiment). Conversely, conventional positive language and formal vocabulary patterns are associated with non-humorous posts. However, the correlations are too weak to predict humor reliably.

03.4 / TF-IDF Analysis

HYPOTHESIS

If humor is not characterized by a distinct vocabulary, it may still be associated with words that are used more characteristically in humorous contexts than in non-humorous ones.

To test this, a TF-IDF analysis was performed to identify terms that are important in funny posts while remaining uncommon in non-funny content, and vice versa.

>> DIAGNOSTIC OUTPUT — TF-IDF COMPARISON

TF-IDF comparison between funny and non-funny posts

OBSERVATION

In titles, words associated with humor are largely tied to Reddit culture and meta-context, while non-funny titles emphasize functional and informational language.

In bodies, humorous posts exhibit a broader and more expressive vocabulary, whereas non-humorous posts rely more heavily on procedural and problem-solving terms.

03.5 / Negative Affect and Dark Humor

Lexical similarity alone does not explain humor. The emotional polarity of language provides an additional signal.

Humor proportions were compared across posts containing positive versus negative link sentiment. If humor primarily reflected light-hearted or positive expression, humorous posts would be associated with positive affect.

>> DIAGNOSTIC OUTPUT — HUMOR VS. LINK SENTIMENT

OBSERVATION

Across both titles and bodies, humorous posts are more frequently associated with negative sentiment than with positive sentiment.

Odds ratios below one indicate that posts classified as funny are more likely to contain negatively valenced links, revealing a systematic association between humor and negative emotional context.

This pattern points toward the presence of "dark, ironic, or humiliative humor"(those are the terms used in reddit to decribe it), where humor likely emerges from contrast, discomfort, or emotional inversion rather than positivity.

If humor relies on emotional contrast rather than lexical novelty, it may be particularly sensitive to context including social atmosphere and temporal conditions.

03.6 / Key Takeaways

Shared Vocabulary

Humor does not rely on unique words.
Funny and non-funny posts use the same lexical base.

Contextual Usage

Humor probably emerges from framing, emphasis, and expressive combinations rather than vocabulary choice.

Negative Affect

Humorous posts are more often associated with negative emotional context, suggesting the presence of dark, ironic, or discomfort-based humor.

Unresolved Question

If words remain stable, could humor instead evolve across time and cultural context?

These observations suggest that humor may not be static. The next step is therefore to examine how humorous expression evolves over time.

04 / TEMPORAL TRENDS

Up to this point, humor appeared structurally stable at the lexical level. Temporal effects had not yet been examined.

To assess whether humor follows long-term calendar patterns, the lab aggregated humor signals across multiple years and aligned them by day of the year.

>> temporal alignment in progress

Researcher exposed to jokes from 2014, 2015, and 2016

04.1 / Yearly Calendar Structure

HYPOTHESIS

If humor is influenced by long-term calendar structure (seasons, holidays, recurring annual events), then both average humor levels and extreme deviations should recur at consistent points in the year.

Two complementary analyses were performed to test this assumption. First, average humor levels were aligned by calendar date. Second, extreme humor deviations were examined for cross-year recurrence.

[ LIVE SIGNAL — HUMOR PROPORTION BY DAY OF YEAR ]

This view tests whether humor consistently increases or decreases at fixed calendar dates across years.

[ LIVE SIGNAL — EXTREME HUMOR EVENTS ACROSS YEARS ]

This view tests whether extreme humor events recur at the same calendar positions across years.

RESULT

Average humor levels remain broadly flat across the calendar. While fluctuations exist, they do not recur at consistent dates, and no stable holiday or seasonal peaks are observed.

Extreme humor events also fail to align across years. Spike days show minimal overlap, indicating that high-intensity humor events are not tied to fixed calendar positions.

>> yearly calendar structure: not detected

With yearly structure ruled out at both baseline and extreme levels, temporal organization—if present—must occur at shorter time scales.

04.2 /Short-Term Temporal Patterns

To separate long-term behavior from short-term regularity, the humor signal was decomposed into trend, seasonal, and residual components. This allows recurring patterns to be isolated from noise and event-driven effects.

[ LIVE SIGNAL — HUMOR TEMPORAL DECOMPOSITION ]

The trend component shows moderate fluctuations but no sustained increase or decrease over time, confirming the absence of long-term humor drift.

In contrast, the seasonal component reveals a likely regular weekly cycle. Humor proportions rise and fall every seven days, with consistent amplitude and spacing across the year.

This weekly rhythm appears in both titles and bodies, although it is more stable in titles and more variable in bodies. A two-sample proportion z-test confirms that humor rates differ significantly between titles and bodies.

While baseline humor follows a predictable weekly rhythm, the signal also contains sharp deviations that cannot be explained by regular temporal structure alone. These event-driven anomalies are examined next.

04.3 / Event-Driven Deviations

Calendar time failed to explain humor fluctuations. The remaining question is whether external real-world events temporarily disrupt humor expression.

CONTROL CONDITION

Fixed Calendar Holidays

Christmas (all years)

No significant effect

Humor proportions do not differ from baseline.

Titles: p = 0.43
Bodies: p = 0.56
Effect size: n.s.

EXTERNAL SHOCK

Paris Attacks

2015-11-13

Sharp humor drop

Strong decrease in titles and bodies.

Titles: p = 0.019 · d = −0.99
Bodies: p = 0.023 · d = −0.88

EXTERNAL SHOCK

US Election

2016-11-08

Title-specific drop

Effect concentrated in titles.

Titles: p = 0.012 · d = −1.12
Bodies: p = 0.58 · n.s.

INTERPRETATION

Humor does not respond to fixed calendar markers, but may probably react sharply to collective emotional shocks.

These effects are large, statistically robust, and temporally localized: humor rapidly returns to baseline once the event context fades indicating a possible correlation.

04.4 / Intra-Day Humor Dynamics

Long-term and weekly structures explain when humor can emerge. This final step examines when during the day humor is most likely to appear.

Humor proportions were aggregated by hour of the day and compared between titles and bodies. If humor reflects daily cognitive or social routines, systematic intra-day variation should be observable.

[ LIVE SIGNAL — HUMOR PROPORTION BY HOUR OF DAY ]

Humor prevalence by hour for titles and bodies

ANALYSIS

Humor expression follows a clear intra-day structure. In the early hours, humorous content is more prevalent in bodies, while humor in titles increases later in the day.

This pattern suggests a difference in how humor is produced: titles likely tend to rely on short, immediate framing which can be reflected by spontaneous thoughts along the day, whereas bodies may possible have more diversity, needing more concentration to be made: leading to more posts early in the day.

Taken together, humor on the platform appears structured by human routines rather than by the calendar itself.

04.5 / Temporal Takeaways

No Yearly Seasonality

Humor does not follow the calendar. No stable seasonal or holiday-based patterns are observed in either average levels or extreme events.

Strong Weekly Rhythm

Humor follows a robust 7-day cycle, repeating consistently across years. This rhythm is stronger in titles than in bodies.

Event Sensitivity

Collective emotional shocks are correlated with temporary drops in humor. Effects are strong but quickly reversible.

Daily Human Routines

Humor varies within the day: brief humor tend to appear later, while more elaborate humor tend to emerge earlier.

Humor is therefore not calendar-driven, but probably shaped by human rhythms, emotional context, and cognitive availability.

If humor responds to context, then it should also depend on who is interacting with whom. The following section moves beyond time to study humor as a collective phenomenon, structured by communities and interaction dynamics.

05 / DYNAMIC ANALYSIS

[ LOG 5403-B ]

Community clustering reveals coherent groups The connections are real. The potential to influence the narrative is huge, but we are running out of time.

05.1 / Structural Organization of Humor

The clustering analysis reveals that humor is not randomly distributed, but organized into a finite set of coherent community structures.

Using HDBSCAN clustering on semantic embeddings of subreddits, the platform decomposes into 23 distinct thematic communities. These clusters span a wide range of domains, from entertainment-oriented spaces (Gaming, Anime, Memes) to institutional and informational domains (Politics, Cryptocurrency, Support communities).

[ CLUSTERS ] Drag to move - scroll to zoom - hover for more info

OBSERVATION

* Humor intensity varies strongly across clusters. Entertainment-focused communities consistently exhibit the highest humor proportions. In particular, clusters centered on gaming, Pop Culture, and informal content (“shitposts”) dominate the humor landscape.

* In contrast, politically oriented and support-driven communities show markedly lower humor levels, accompanied by higher concentrations of negative sentiment. In these clusters, communication likely prioritizes argumentation, emotional expression, or assistance over playfulness.

* Despite the presence of dark and ironic humor, sentiment analysis reveals a surprising global pattern: positive sentiment interactions dominate overall for humorous and not humorous content. Humor can often be directed toward shared enjoyment than toward sustained hostility, even within clusters that tolerate transgressive content.

These results suggest that humor functions as a localized, community-regulated practice. The next step is to examine how these communities interact, and whether certain clusters act as conduits for humor propagation across the network.

05.2 / Network Bridges & Humor Flow

Community structure alone does not explain how humor spreads. The network analysis reveals how interactions between communities shape the circulation of humorous content.

Subreddit clusters were connected based on cross-linking activity, producing a network in which nodes represent thematic communities and edges capture the intensity and sentiment of their interactions. This representation allows the identification of central hubs, peripheral clusters, and directional flows of humor.

[ NETWORK DYNAMICS ] Drag to move - scroll to zoom - hover for more info

OBSERVATION

A small number of clusters occupy structurally central positions in the network. Entertainment-oriented communities (notably Gaming and broader Pop Culture clusters) function as network hubs, maintaining dense connections with many other communities.

These hub clusters act as bridges for humor propagation. Humor originating within them is more likely to appear across multiple thematic regions of the platform, facilitating cross-community diffusion.

In contrast, several clusters remain relatively isolated. Highly specialized communities (e.g. Cryptocurrency or game-specific clusters such as Dota 2) exhibit dense internal interaction but weak external connectivity. Humor in these spaces tends to remain insular, evolving according to local norms rather than platform-wide dynamics.

05.3 / Dynamic Takeaways

Community-Bound Humor

Humor is not platform-wide.
It is structured by thematic communities, each with its own norms and tolerance.

Entertainment Dominance

Gaming, entertainment, and informal communities concentrate the highest humor levels and act as primary humor producers.

Network Hubs

A small number of central clusters function as bridges, enabling humor to propagate across community boundaries.

Insular Subcultures

Specialized communities remain isolated. Humor evolves locally and rarely spreads beyond them.

The clustering and network analysis revealed that humor is likely not randomly distributed but structurally organized across Reddit's ecosystem: some communities may tend generate it, hub communities to spread it, and isolated ones to keep it local.

06 / CONCLUSION

Humor is Multi-Dimensional

Understanding humor requires analyzing temporal patterns, linguistic features, AND community structure simultaneously.

Context trumps content

Humor is conveyed through how words are used, not which words are used, especially in constrained formats.

External events matter

Major world events correlate with temporarily humor suppression without structurally altering it.

Platform constraints shape expression

Titles and bodies likely obey fundamentally different humor dynamics.

Network structure enables propagation

Humor appears to spread through entertainment hubs, making connectivity critical for virality.

⚠ System Boundary Reached

Model confidence degrades beyond this point.

⚠ Modal Blindness

Image memes are excluded, limiting humor detection in visual communities.

⚠ Classification Noise

False positives persist in niche or context-heavy communities.

⚠ Causality Gap

Event effects remain correlational, not causal.

⚠ Temporal Scope

Dataset limited to 2014–2017; humor norms may have shifted.

⚠ Future Extensions

Image analysis, comment-level humor, and real-time signals remain open.

>> analysis terminated · uncertainty acknowledged

In this project, humor was treated not as entertainment, but as a signal.

Through the analysis of the Reddit Hyperlink dataset, we attempted to reverse-engineer a seemingly trivial artifact: online humor. What emerged was neither randomness nor noise, but a structured, context-sensitive phenomenon shaped by human rhythms, community norms, and collective emotional states.

Humor seems to follow predictable weekly cycles, varies sharply across thematic communities, and likely propagates through specific network pathways. It seems to react to external shocks, adapts to platform constraints, and shifts form depending on when, where, and by whom it is produced.

Most importantly, this study shows that humor cannot be reduced to surface-level linguistic features or simple rules. Even with advanced models and large-scale data, humor resists full formalization. It emerges from the interaction between language, timing, shared context, and human experience.

In attempting to decode laughter, we did not eliminate its mystery. We only mapped its boundaries………………

>> The Reddit Ruins have spoken.
Humor, it would seem, is irreducibly human.
And it exists and thrives in each one of us :)
>> humor: partially understood · fully human

:) Bye

Source identified · Integrity compromised

THE TEAM

Dámaso Dubail
Mohamed Mokhtar Sellami
Aïda Besri
Raki Ben Mustapha
Amine Benzarti

Remember us...

Website Repository

GitHub Repository

// PROJECT HUMOR_QUANT INITIATED //

>> ARCHIVE / HUMOR_ANALYSIS

01 / DATASET

01.1 / PRIMARY DATASET

01.2 / EXTENDED DATASET (CONTENT RECOVERY)

01.2 / FURTHER DATASET (CONTENT RECOVERY)

02 / HUMOR MODEL

03 / GENERAL ANALYSIS

03.1 / Vocabulary Overlap

03.2 / Yearly Word Cloud Analysis

03.3 / Feature Correlation Analysis

03.4 / TF-IDF Analysis

03.5 / Negative Affect and Dark Humor

03.6 / Key Takeaways

04 / TEMPORAL TRENDS

04.1 / Yearly Calendar Structure

04.2 /Short-Term Temporal Patterns

04.3 / Event-Driven Deviations

04.4 / Intra-Day Humor Dynamics

04.5 / Temporal Takeaways

05 / DYNAMIC ANALYSIS

05.1 / Structural Organization of Humor

05.2 / Network Bridges & Humor Flow

05.3 / Dynamic Takeaways

06 / CONCLUSION

Humor is Multi-Dimensional

Context trumps content

External events matter

Platform constraints shape expression

Network structure enables propagation

⚠ System Boundary Reached

⚠ Modal Blindness

⚠ Classification Noise

⚠ Causality Gap

⚠ Temporal Scope

⚠ Future Extensions

THE TEAM