pythonで言語処理100本ノックやってみた！〜第１章準備運動〜

お久しぶりになってしまった。

決してサボってたわけじゃない・・・決して・・・。

言い訳は置いといて、記事を書いていきます。

基礎力不足を感じる今日この頃

まだpythonしか扱えない僕ですが、pythonすら扱えていないと最近ひしひしと感じています。

ということで、基礎力養成として競技プログラミングの問題などをやっていこうという決意をしました。

やるからには大会で戦えるレベルを目指します。

速さや簡潔さに意識して勉強していきますので、もし見てくださった方で『何この、うんこーど。』と思われる方がいらっしゃいまいたら、ぜひコメントを残してくださると嬉しいです。

その活動の一環として、東北大学の乾・岡崎研究室で公開されている自然言語処理100本ノックをやってみました。

jupyterを用いて一連の流れで書いているのをコピってはっつけてるだけなので、セルごとに少し違和感があるかもしませんがあしからず。

それでは早速！

第１章準備運動

追記(4/27): webで見つけた僕のより全然簡単で早いコードをまとめました！
もっと簡単な書き方、とコメントしてる部分がそれらのコードとなります。勝手に学ばせてもらい、勝手に感謝します。ありがとうございました！

まずはライブラリのimport

#!/usr/bin/env python
# -*- coding: utf-8 -*-

# 今回はグラフとか使わねーかも？
#ちなみにelapsed_timeは実行時間
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import time

00. 文字列の逆順

文字列"stressed"の文字を逆に（末尾から先頭に向かって）並べた文字列を得よ

start = time.time()
str = "stressed"
print(str[::-1])
elapsed_time = time.time() - start
print("elapsed_time:{}".format(elapsed_time))

出力

desserts
elapsed_time:0.000134944915771

01. 「パタトクカシーー」

「パタトクカシーー」という文字列の1,3,5,7文字目を取り出して連結した文字列を得よ．

start = time.time()
str = u"パタトクカシーー"
#print(str[1])
s = ""
for i,j in enumerate(str):
    if i%2 == 0:
        s += j
print(s)
elapsed_time = time.time() - start
print("elapsed_time:{}".format(elapsed_time))
print("")

# もっと簡単な書き方
start = time.time()
str = u"パタトクカシーー"
print(str[::2])
elapsed_time = time.time() - start
print("elapsed_time:{}".format(elapsed_time))

出力

パトカー
elapsed_time:0.000699043273926

パトカー
elapsed_time:0.000293016433716

02. 「パトカー」＋「タクシー」＝「パタトクカシーー」

「パトカー」＋「タクシー」の文字を先頭から交互に連結して文字列「パタトクカシーー」を得よ．

start = time.time()
pa = u"パトカー"
ta = u"タクシー"
s = ""
for i in range(len(pa + ta)):
    if i%2 == 0:
        s += pa[i/2]
    elif i%2 == 1:
        s += ta[(i-1)/2]
print(s)
elapsed_time = time.time() - start
print("elapsed_time:{}".format(elapsed_time))
print("")

# もっと簡単な書き方
start = time.time()
pa = u"パトカー"
ta = u"タクシー"
print("".join(sum(zip(pa,ta),())))
elapsed_time = time.time() - start
print("elapsed_time:{}".format(elapsed_time))

出力

パタトクカシーー
elapsed_time:0.000337839126587

パタトクカシーー
elapsed_time:0.000164031982422

03. 円周率

"Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."という文を単語に分解し，各単語の（アルファベットの）文字数を先頭から出現順に並べたリストを作成せよ．

start = time.time()
strList = "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics.".rstrip().split(" ")
#print(strList)
l = []
for i in strList:
    if i.isalpha(): 
        l.append(len(i))
    else:
        l.append(len(i[:-1]))
print(l)
#print("3.14159265358979")
elapsed_time = time.time() - start
print("elapsed_time:{}".format(elapsed_time))
print("")
# もっと簡単な書き方(リストに変える必要ないよね。。。)
# translateの引数にNoneを使えるのか。しかも".,"って書き方がありだとは・・・。
start = time.time()
s = "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."
print [len(word.translate(None, '.,')) for word in s.split()]
[3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9]
elapsed_time = time.time() - start
print("elapsed_time:{}".format(elapsed_time))

出力

[3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9]
elapsed_time:0.000308990478516

[3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9]
elapsed_time:0.000229120254517

04. 元素記号

"Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."という文を単語に分解し，1, 5, 6, 7, 8, 9, 15, 16, 19番目の単語は先頭の1文字，それ以外の単語は先頭に2文字を取り出し，取り出した文字列から単語の位置（先頭から何番目の単語か）への連想配列（辞書型もしくはマップ型）を作成せよ．

start = time.time()
strList = "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can.".rstrip().split(" ")
#print(strList)
dic = {}
num = [1, 5, 6, 7, 8, 9, 15, 16, 19]
for i,j in enumerate(strList):
    if i+1 in num:
        dic[i+1] = j[0]
    else:
        dic[i+1] = j[0:2]
print(dic)
elapsed_time = time.time() - start
print("elapsed_time:{}".format(elapsed_time))
print("")
#辞書型はkeyがint型の場合、数字順になるようだ。

#もっと簡単な書き方
start = time.time()
s = "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."
d = {}
for i, atom in enumerate(s.split()):
    n = i + 1
    l = 1 if n in (1, 5, 6, 7, 8, 9, 15, 16, 19) else 2
    d[atom[:l]] = n
print(d)
elapsed_time = time.time() - start
print("elapsed_time:{}".format(elapsed_time))

出力

{1: 'H', 2: 'He', 3: 'Li', 4: 'Be', 5: 'B', 6: 'C', 7: 'N', 8: 'O', 9: 'F', 10: 'Ne', 11: 'Na', 12: 'Mi', 13: 'Al', 14: 'Si', 15: 'P', 16: 'S', 17: 'Cl', 18: 'Ar', 19: 'K', 20: 'Ca'}
elapsed_time:0.000382900238037

{'Be': 4, 'C': 6, 'B': 5, 'Ca': 20, 'F': 9, 'S': 16, 'H': 1, 'K': 19, 'Al': 13, 'Mi': 12, 'Ne': 10, 'O': 8, 'Li': 3, 'P': 15, 'Si': 14, 'Ar': 18, 'Na': 11, 'N': 7, 'Cl': 17, 'He': 2}
elapsed_time:0.000311851501465

05. n-gram

与えられたシーケンス（文字列やリストなど）からn-gramを作る関数を作成せよ．この関数を用い，"I am an NLPer"という文から単語bi-gram，文字bi-gramを得よ

def get_ngram(n,s,t=1):
    l = []
    if t == 1:#char-wise
        if isinstance(s,list):
            s = ''.join(s)
        #elif isinstance(s,str): 3.x
        elif isinstance(s,basestring): #2.x
            s = s.replace(" ","")

        for i in range(len(s)):
            l.append(s[i:i+n])
        return l
    
    if t == 2:#word-wise
        if isinstance(s,basestring): #2.x
            sList = s.split()
        for i in range(len(sList)):
            l.append(sList[i:i+n])
        return l
    
if __name__ == "__main__":
    start = time.time()
    print(get_ngram(2,"I am an NLPer",1))
    print(get_ngram(2,"I am an NLPer",2))
    print("elapsed_time:{}".format(elapsed_time))

出力

['Ia', 'am', 'ma', 'an', 'nN', 'NL', 'LP', 'Pe', 'er', 'r']
[['I', 'am'], ['am', 'an'], ['an', 'NLPer'], ['NLPer']]
elapsed_time:9.89437103271e-05

06. 集合

"paraparaparadise"と"paragraph"に含まれる文字bi-gramの集合を，それぞれ, XとYとして求め，XとYの和集合，積集合，差集合を求めよ．さらに，'se'というbi-gramがXおよびYに含まれるかどうかを調べよ．

start = time.time()
X = set(get_ngram(2,"paraparaparadise"))
Y = set(get_ngram(2,"paragraph"))

print(X)
print(Y)
print("")
#和集合
print("X CUP Y is {}".format(list(X | Y)))
#積集合
print("X CAP Y is {}".format(list(X & Y)))
#差集合
print("X \ Y　is {}".format(list(X - Y)))
print("Y \ X　is {}".format(list(Y - X)))
print("")

print("have \"se\"?")
print("X: {}".format("se" in list(X)))
print("Y: {}".format("se" in list(Y)))
print("")

print("elapsed_time:{}".format(elapsed_time))

出力

set(['e', 'ad', 'di', 'is', 'ap', 'pa', 'ra', 'ar', 'se'])
set(['gr', 'ag', 'h', 'ap', 'pa', 'ra', 'ph', 'ar'])

X CUP Y is ['e', 'ad', 'ag', 'di', 'h', 'is', 'ap', 'pa', 'ra', 'ph', 'ar', 'se', 'gr']
X CAP Y is ['ap', 'pa', 'ar', 'ra']
X \ Y　is ['is', 'e', 'ad', 'se', 'di']
Y \ X　is ['h', 'ph', 'gr', 'ag']

have "se"?
X: True
Y: False

elapsed_time:0.0019052028656

07. テンプレートによる文生成

引数x, y, zを受け取り「x時のyはz」という文字列を返す関数を実装せよ．さらに，x=12, y="気温", z=22.4として，実行結果を確認せよ．

def make_sentence(x,y,z):
    return "{}時の".format(x)+y+"は{}".format(z)

if __name__ == "__main__":
    print(make_sentence(12,"気温",22.4))

出力

12時の気温は22.4

08. 暗号文

与えられた文字列の各文字を，以下の仕様で変換する関数cipherを実装せよ．

英小文字ならば(219 - 文字コード)の文字に置換その他の文字はそのまま出力この関数を用い，英語のメッセージを暗号化・復号化せよ．

#これはスッキリしすぎてて他人様のほぼパクってます。勉強になります。
def cipher(string):
    return ''.join(chr(219-ord(c)) if c.islower() else c for c in string)
    #return ''.join(chr(219-ord(c)) if 'a'<=c<='z' else c for c in string)
    #上の二つはどっちも同じ速度だった。

if __name__=="__main__":
    start = time.time()
    sentence="Hello, world!"
    ciphertext=cipher(sentence)
    print(sentence)
    print(ciphertext)
    print(cipher(ciphertext))
    print("elapsed_time:{}".format(elapsed_time))

出力

Hello, world!
Hvool, dliow!
Hello, world!
elapsed_time:0.000503063201904

09. Typoglycemia

スペースで区切られた単語列に対して，各単語の先頭と末尾の文字は残し，それ以外の文字の順序をランダムに並び替えるプログラムを作成せよ．ただし，長さが４以下の単語は並び替えないこととする．適当な英語の文（例えば"I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."）を与え，その実行結果を確認せよ．

import random

start = time.time()
string = "I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."
sList = string.lstrip().split(" ")
l = []
for s in sList:
    if len(s) > 4:
        head = s[0]
        tail = s[-1]
        mid = s[1:-1]
        mid = ''.join(random.sample(mid,len(mid)))
        l.append(head+mid+tail)
    else: l.append(s)

print(" ".join(l))
print("elapsed_time:{}".format(elapsed_time))

出力

I cln'odut bleevie that I cloud acutllay uastnrdned what I was rdneiag : the poeahnneml peowr of the haumn mind .
elapsed_time:0.000520944595337

以上です!!!

次回は2章ですね！！
それでは〜

MATHGRAM

主に数学とプログラミング、時々趣味について。

pythonで言語処理100本ノックやってみた！〜第１章準備運動〜

基礎力不足を感じる今日この頃

第１章準備運動

まずはライブラリのimport

00. 文字列の逆順

出力

01. 「パタトクカシーー」

出力

02. 「パトカー」＋「タクシー」＝「パタトクカシーー」

出力

03. 円周率

出力

04. 元素記号

出力

05. n-gram

出力

06. 集合

出力

07. テンプレートによる文生成

出力

08. 暗号文

出力

09. Typoglycemia

出力

基礎力不足を感じる今日この頃

第１章 準備運動

まずはライブラリのimport

00. 文字列の逆順

出力

01. 「パタトクカシーー」

出力

02. 「パトカー」＋「タクシー」＝「パタトクカシーー」

出力

03. 円周率

出力

04. 元素記号

出力

05. n-gram

出力

06. 集合

出力

07. テンプレートによる文生成

出力

08. 暗号文

出力

09. Typoglycemia

出力

第１章準備運動