ETL (Extract, Transform, Load) with command-line tools

https://github.com/nogibjj/coursera-applied-data-eng-projects

GitHub - nogibjj/coursera-applied-data-eng-projects: Project for the Duke Coursera Applied Data Engineering Specialization

Project for the Duke Coursera Applied Data Engineering Specialization - GitHub - nogibjj/coursera-applied-data-eng-projects: Project for the Duke Coursera Applied Data Engineering Specialization

github.com

명렁어를 통한 ETL 작업 실습에 관한 내용이다.

우선 main.py는 다음과 같다.

#!/usr/bin/env python3

"""
Main cli or app entry point
"""
import yake
import click

# create a function that reads a file 파일읽기 함수
def read_file(filename): 
    with open(filename, encoding="utf-8") as myfile:
        return myfile.read()


# def extract keywords  키워드 추출 함수
def extract_keywords(text):
    kw_extractor = yake.KeywordExtractor()
    keywords = kw_extractor.extract_keywords(text)
    return keywords


# create a function that makes hash tags 해쉬태그 만드는 함수 (#태그)
def make_hashtags(keywords):
    hashtags = []
    for keyword in keywords:  # replace로 빈칸 제거
        hashtags.append("#" + keyword[0].replace(" ", ""))
    return hashtags


@click.group()
def cli():
    """A cli for keyword extraction"""


@cli.command("extract")  # 명령어: main.py extract
@click.argument("filename", default="text.txt")
def extract(filename):
    """Extract keywords from a file"""
    text = read_file(filename)
    keywords = extract_keywords(text)
    click.echo(keywords)


@cli.command("hashtags") # 명령어: main.py hashtags
@click.argument("filename", default="text.txt")
def hashtagscli(filename):
    """Extract keywords from a file and make hashtags"""
    text = read_file(filename)
    keywords = extract_keywords(text)
    hashtags = make_hashtags(keywords)
    click.echo(hashtags)


if __name__ == "__main__":
    cli()

이에 대한 명령어 수행 결과:

[실습문제 1]: Add a flag to the command-line tool to limit the number of keywords to a maximum of the top keywords. Note that the lowest score is better.

문제에서 keyword 출력시 (단어, 숫자) tuple 형태로 출력되는데(extract 결과 참조), 여기서 숫자가 score 부분이다.

즉, 이 문제는 빈번한 top keyword를 출력하는 명령어 및 함수를 만들라는 문제이고, 그 근거는 낮은 숫자라는 의미이다.

이는 다음과 같이 작성된다.

@cli.command("top") # 커멘드 이름
@click.option("--n", default=10, type=int, help="Limit the number of top keywords.")
@click.argument("filename", default="text.txt")
def top_keywords(filename, n):
    """Extract top keywords from a file: top --n {number}""" #--help 입력시 나오는 주석
    text = read_file(filename)
    keywords = extract_keywords(text)
    sorted_keywords = sorted(keywords, key=lambda x: x[1])  # 낮은 숫자 기준 정렬된 키워드 리스트
    top_keywords = [key[0] for key in sorted_keywords[:n]]  # 상위 N개의 문자열만 선택선택
    click.echo(top_keywords)

사용법은 다음과 같다.

새로운 명령어와 주석을 처리하고, 다양하게 키워드 개수를 조절하며 출력해본 결과이다.

여기서 파일을 text.txt가 아닌 다른것으로 바꾸려면

python main.py top --n 3 newfile.txt

이렇게 제일 뒤에 다른 파일을 집어넣는다.

[실습문제 2]: Grab some text on the internet, say wikipedia or a blog post and create hashtags with it.

해당부분은 웹사이트에서 텍스트를 가져오는 코드를 설정하고, 기존 내용도 또한 수정해야한다.

#!/usr/bin/env python3

"""
Main cli or app entry point
"""
import yake
import click
import requests # 웹사이트 호출!

# create a function that reads a file or a webpage
def read_file_or_webpage(filename_or_url): # 웹사이트인지, 로컬 파일인지 구분함수
    if filename_or_url.startswith("http://") or filename_or_url.startswith("https://"):
        return read_webpage(filename_or_url)
    else:
        return read_file(filename_or_url)


# new function to read webpage
def read_webpage(url):  # 웹사이트인 경우 정상 호출되는 사이트인지 검증 및 출력
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        raise ValueError(
            f"Failed to fetch webpage. Status code: {response.status_code}")


# create a function that reads a file  로컬파일인 경우 동작하는 함수
def read_file(filename):
    with open(filename, encoding="utf-8") as myfile:
        return myfile.read()


# def extract keywords
def extract_keywords(text):
    kw_extractor = yake.KeywordExtractor()
    keywords = kw_extractor.extract_keywords(text)
    return keywords


# create a function that makes hash tags
def make_hashtags(keywords):
    hashtags = []
    for keyword in keywords:
        hashtags.append("#" + keyword[0].replace(" ", ""))
    return hashtags


@click.group()
def cli():
    """A cli for keyword extraction"""


@cli.command("extract")
@click.argument("filename_or_url", default="text.txt")
def extract(filename_or_url):
    """Extract keywords from a file or a webpage"""
    text = read_file_or_webpage(filename_or_url)
    keywords = extract_keywords(text)
    click.echo(keywords)


@cli.command("hashtags")
@click.argument("filename_or_url", default="text.txt")
def hashtagscli(filename_or_url):
    """Extract keywords from a file or a webpage and make hashtags"""
    text = read_file_or_webpage(filename_or_url)
    keywords = extract_keywords(text)
    hashtags = make_hashtags(keywords)
    click.echo(hashtags)

# Add a flag to the command-line tool to limit the number of keywords to a maximum of the top keywords.


@cli.command("top")
@click.option("--n", default=10, type=int, help="Limit the number of top keywords.")
@click.argument("filename_or_url", default="text.txt")
def top_keywords(filename_or_url, n):
    """Extract top keywords from a file or a webpage: top --n {number}"""
    text = read_file_or_webpage(filename_or_url)
    keywords = extract_keywords(text)
    # 낮은 숫자 기준 정렬된 키워드 리스트
    sorted_keywords = sorted(keywords, key=lambda x: x[1])
    top_keywords = [key[0] for key in sorted_keywords[:n]]  # 상위 N개의 문자열만 선택선택
    click.echo(top_keywords)


if __name__ == "__main__":
    cli()

인터넷 주소를 넣는 경우 실행방법은

python main.py top --n 3 "https://www.example.com"

이런식으로 작성한다.

물론 몇몇 사이트는 웹사이트 SSL인증 관련문제로 안될 수도 있다.

위에 위키피디아는 되는데 내 블로그에 적용하니 인증서 어쩌구 이유로 안된다.

etl하기위해 db를 만들고 쿼리를 실행하는 파일

#!/usr/bin/env python3

"""
Extract keywords from a text file and load them into a database.
"""
from main import extract_keywords, make_hashtags, read_file
import click
import sqlite3
import os
from os import path

DATABASE = "keywords.db"

# create a function that returns keywords, score and hashtags
# create a function that loads keywords and score into a database
def load_keywords(keywords, score, hashtags):
    """Load keywords, hashtags  and their scores into a database"""

    db_exists = False
    # if path to the database exists store True in db_exists
    if path.exists(DATABASE):
        db_exists = True

    conn = sqlite3.connect(DATABASE)
    c = conn.cursor()
    c.execute(
        "CREATE TABLE IF NOT EXISTS keywords (keyword TEXT, score REAL, hashtags TEXT)"
    )
    for keyword, score, hashtags in zip(keywords, score, hashtags):
        c.execute("INSERT INTO keywords VALUES (?, ?, ?)", (keyword, score, hashtags))
    conn.commit()
    conn.close()
    return db_exists


def collect_extract(filename):
    """Collect keywords from a file and extract them into a database"""
    keywords = []
    score = []
    text = read_file(filename)
    extracted_keyword_score = extract_keywords(text)
    for keyword_score in extracted_keyword_score:
        keywords.append(keyword_score[0])
        score.append(keyword_score[1])
    # feed keyword/score into make_hashtags to generate hashtags
    hashtags = make_hashtags(extracted_keyword_score)
    return keywords, score, hashtags


def extract_and_load(filename):
    """Extract keywords from a file and load them into a database"""
    keywords, score, hashtags = collect_extract(filename)
    status = load_keywords(keywords, score, hashtags)
    return status


# write a function the queries the database and returns the keywords, hashtags and scores
def query_database(order_by="score", limit=10):
    """Query the database and return keywords, hashtags and scores"""
    conn = sqlite3.connect(DATABASE)
    c = conn.cursor()
    c.execute(f"SELECT * FROM keywords ORDER BY {order_by} DESC LIMIT {limit}")
    results = c.fetchall()
    conn.close()
    return results


@click.group
def cli():
    """An ETL cli"""


@cli.command("etl")
@click.argument("filename", default="text.txt")
def etl(filename):
    """Extract keywords from a file and load them into a database

    Example:
    python etl.py etl text.txt
    """

    path_to_db = path.abspath(DATABASE)
    click.echo(
        click.style(
            f"Running ETL to extract keywords from {filename} and load them into a database: {path_to_db}",
            fg="green",
        )
    )
    result = extract_and_load(filename)
    if result:
        click.echo(click.style("Database already exists", fg="yellow"))
    else:
        click.echo(click.style("Database created", fg="green"))


@cli.command("query")
@click.option("--order_by", default="score", help="Order by score or keyword")
@click.option("--limit", default=10, help="Limit the number of results")
def query(order_by, limit):
    """Query the database and return keywords, hashtags and scores

    Example:
    python etl.py query
    """
    results = query_database(order_by, limit)
    for result in results:
        print(
            click.style(result[0], fg="red"),
            click.style(str(result[1]), fg="green"),
            click.style(result[2], fg="blue"),
        )


@cli.command("delete")
def delete():
    """Delete the database

    Example:
    python etl.py delete
    """
    if path.exists(DATABASE):
        path_to_db = path.abspath(DATABASE)
        click.echo(click.style(f"Deleting database: {path_to_db}", fg="green"))
        os.remove(DATABASE)
    else:
        # database does not exist
        click.echo(click.style("Database does not exist", fg="red"))


if __name__ == "__main__":
    cli()

[실습문제 3]: Change the command-line flag to return only five results

쿼리를 5개만 출력하도록 바꾸어야 한다. 함수 자체를 위에 코드로 따라가다보면, 제공하는 SQL은 다음과 같다.

"SELECT * FROM keywords ORDER BY {order_by} DESC LIMIT {limit}"

즉, order by를 desc하는 기준과, limit 기준을 가변적으로 변경할 수 있다는 의미이다.

--help를 해당 query로 실행시, 이에 대한 명령 실행법이 설명되어있다. limit만 조정하면 된다.

손쉽게 상위 5개 출력 완료!

[실습문제 4]: Create your own version of this tool in GitHub and extend the database with different metadata using another NLP tool.

이러한 tool을 나만의 버전으로 GitHub에 만들고, NLP(Natural Language Processing) 자연어 처리를 이용해서 DB로 확장하라는 실습 요구..

는 시간이 오래걸리므로 일단 이 글에서는 패스!

728x90

'Programming > Python' 카테고리의 다른 글

기본 개념 정리 2 (ML) (0)	2024.01.16
기본 개념 정리 1 (데이터, EDA, 데이터 전처리) (2)	2024.01.16
[DL] over(under)fitting 최적, Dropout, Callback (0)	2023.09.20
[ML] SGDClassifier (확률적 경사하강법 분류모형) (0)	2023.09.17
[ML] Logistic regression (2)	2023.09.15

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

LeafHT

ETL (Extract, Transform, Load) with command-line tools

'Programming > Python' 카테고리의 다른 글

티스토리툴바

ETL (Extract, Transform, Load) with command-line tools

'Programming > Python' 카테고리의 다른 글

'Programming/Python' 관련글

티스토리툴바