Hello, everyone!
Today I'm going to introduce you DAYU
If you have already applied for big data products such as MRS, DLI, DWS, etc., in the next time, you will hope to quickly run your business on big data services. Data Lake Factory (DLF) is an online one-stop big data development platform. Assist users to quickly move to the cloud. This paper uses Data Lake Factory (DLF) and Data Lake Exploration (DLI) to analyze the raw data of movie ratings, and output the highest-rated and most active TOP 20 movies. Through this article, you can learn DLF script editing, job editing, job scheduling and other functions, as well as the basic SQL syntax of DLI.
Data Lake Factory provides a one-stop big data collaborative development platform to help users easily complete data modeling, data integration, script development, job scheduling, operation and maintenance monitoring and other tasks, which can greatly reduce user usage The threshold of big data helps users quickly build big data processing centers.
Data preparation The
demo data comes from https://grouplens.org/datasets/movielens/ , for the convenience of presentation, it is saved in the attachment of this article. Please upload the data in the attachment to the OBS bucket, and then DLF and DLI will directly process the data in the bucket.
Original data description
movies.csv saves basic movie information, fields: movie ID, name, genre
ratings.csv saves movie ratings, fields: user ID, movie ID, rating, rating time, rating range: 0~5

To create a database table,
we need to create two tables, movie and rating, to store the original data, and create top_rating_movie and top_active_movie to store the analysis results, that is, the TOP 20 movies with the highest ratings and the most active.
Create table SQL can be executed using DLI or DLF editor. Below, use DLF editor to execute SQL.
First, you need to create a DLF to DLI connection.

Then create a SQL script in "Data Development" and open the script editor.
Enter the following table creation statement to create 4 tables.
| create table movie(movieId int,title varchar(256),genres varchar(256)) using csv options (path's3a://obs-movies/movies.csv'); create table rating(userId int,movieId int,rating float,rating_date long) using csv options (path's3a://obs-movies/ratings.csv'); create table top_rating_movie(title varchar(256),avg_rating float,rating_user_number int); create table top_active_movie(title varchar(256 ),avg_rating float,rating_user_number int); |
Description
Movie and rating are OBS tables, data is stored on obs, top_rating_movie and top_active_movie are DLI tables, which store analysis results.
After the movie and rating tables are created successfully, you can use the SELECT statement to query the data in the obs files s3a://obs-movies/movies.csv and s3a://obs-movies/ratings.csv.
Analyze the data
to create SQL scripts, calculate the highest rated and most active TOP 20 movies from the movie and rating tables, and store the results in the top_rating_movie and top_active_movie tables.
Calculate the content of the top_rating_movie script
insert overwrite table top_rating_movie from from |
This SQL first uses a sub-SELECT statement to calculate the total rating of each movie and the number of user ratings on the rating table according to movieId. The sub-SELECT language and movie are used as a join query to filter out records with a user rating less than 100, and then the movie name, average rating and user rating are returned. Finally, the result is saved to the table top_rating_movie.
Calculate the content of the top_active_movie script
insert overwrite table top_active_movie from from |
The calculation method for the most active TOP20 movies is that the average rating is greater than 3.5 and the maximum number of user ratings is 20 movies.
Refresh TOP20 data regularly every day.
We assume that the movie and rating tables are changing in real time, so we hope to update TOP20 movies every day; here, DLF job scheduling and job scheduling functions can be used.
First create a DLF job

There is a Dummy node and two DLI SQL nodes in the job. The two DLI SQL nodes are each associated with the DLI SQL scripts (top_rating_movie and top_active_movie) that calculate the highest score and the most active TOP20 movies.

After the arrangement is complete, click "Run Test" to test the job. The running log is as follows.
If the log shows normal operation, configure the job scheduling policy.

Configuration strategy: execute a job at 0 o'clock every day. After the completion is complete, click Save, and then click Submit Job. This will automatically run the job every day and automatically save the TOP 20 movies to the table (top_rating_movie and top_active_movie).
The job monitoring
job is automatically run every day, and users hope to know whether the job execution result is success or failure in time. Users can understand the job results through DLF's job monitoring interface, email notifications, and SMS notifications.
