stayhigh: 從網頁檔中取出連結 re.findall("(?Phttps?://[^\s]+)", myString)

2011年11月9日星期三

從網頁檔中取出連結 re.findall("(?Phttps?://[^\s]+)", myString)

re.findall("(?P<url>https?://[^\s]+)", myString)

result = re.findall(r"\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&@#/%=~_|$?!:,.]*[A-Z0-9+&@#/%=~_|$]", subject)

上面的RE表示法仍有些許問題。待改進

#2011/11/10 問題已解決 (程式碼如下)

refinedList = []
for item in resultList:
    try:
        element = re.search("(P<url>https?://[^\s]+)",item).group("url")
        print (element)
        refinedList.append(element)
    except:
        pass

沒有留言:

張貼留言

訂閱：張貼留言 (Atom)