Skip to main content
Skip to main content
Edit this page

YouTube dataset of dislikes

In November of 2021, YouTube removed the public dislike count from all of its videos. While creators can still see the number of dislikes, viewers can only see how many likes a video has received.

Info

The dataset has over 4.55 billion records, so be careful just copying-and-pasting the commands below unless your resources can handle that type of volume. The commands below were executed on a Production instance of ClickHouse Cloud.

The data is in a JSON format and can be downloaded from archive.org. We have made this same data available in S3 so that it can be downloaded more efficiently into a ClickHouse Cloud instance.

Here are the steps to create a table in ClickHouse Cloud and insert the data.

Note

The steps below will easily work on a local install of ClickHouse too. The only change would be to use the s3 function instead of s3cluster (unless you have a cluster configured - in which case change default to the name of your cluster).

Step-by-step instructions

  1. Let's see what the data looks like. The s3cluster table function returns a table, so we can DESCRIBE the result:

ClickHouse infers the following schema from the JSON file:

  1. Based on the inferred schema, we cleaned up the data types and added a primary key. Define the following table:
  1. The following command streams the records from the S3 files into the youtube table.
Info

This inserts a lot of data - 4.65 billion rows. If you do not want the entire dataset, simply add a LIMIT clause with the desired number of rows.

Some comments about our INSERT command:

  • The parseDateTimeBestEffortUSOrZero function is handy when the incoming date fields may not be in the proper format. If fetch_date does not get parsed properly, it will be set to 0
  • The upload_date column contains valid dates, but it also contains strings like "4 hours ago" - which is certainly not a valid date. We decided to store the original value in upload_date_str and attempt to parse it with toDate(parseDateTimeBestEffortUSOrZero(upload_date::String)). If the parsing fails we just get 0
  • We used ifNull to avoid getting NULL values in our table. If an incoming value is NULL, the ifNull function is setting the value to an empty string
  1. Open a new tab in the SQL Console of ClickHouse Cloud (or a new clickhouse-client window) and watch the count increase. It will take a while to insert 4.56B rows, depending on your server resources. (Without any tweaking of settings, it takes about 4.5 hours.)
  1. Once the data is inserted, go ahead and count the number of dislikes of your favorite videos or channels. Let's see how many videos were uploaded by ClickHouse:
Note

The query above runs so quickly because we chose uploader as the first column of the primary key - so it only had to process 237k rows.

  1. Let's look and likes and dislikes of ClickHouse videos:

The response looks like:

  1. Here is a search for videos with ClickHouse in the title or description fields:

This query has to process every row, and also parse through two columns of strings. Even then, we get decent performance at 4.15M rows/second:

The results look like:

Questions

If someone disables comments does it lower the chance someone will actually click like or dislike?

When commenting is disabled, are people more likely to like or dislike to express their feelings about a video?

Enabling comments seems to be correlated with a higher rate of engagement.

How does the number of videos change over time - notable events?

A spike of uploaders around covid is noticeable.

More subtitles over time and when

With advances in speech recognition, it's easier than ever to create subtitles for video with youtube adding auto-captioning in late 2009 - was the jump then?

The data results show a spike in 2009. Apparently at that, time YouTube was removing their community captions feature, which allowed you to upload captions for other people's video. This prompted a very successful campaign to have creators add captions to their videos for hard of hearing and deaf viewers.

Top uploaders over time

How do like ratio changes as views go up?

How are views distributed?