Getting TF-IDF Weight

Joined
Apr 7, 2016
Messages
337
Likes
225
Degree
1
How do you get TF-IDF? I've got some ideas how to manually do it with Screaming Frog and Excel but I see others using Python scripts. Anybody have any experience with this, what direction should I go? This might be a good reason to learn some python.
 
Can you give us some details on your use case, size of the site, etc.? Certainly, this can be handled at a programming level, and there is a lot of existing support, libraries, etc. for performing this with Python as well as countless other languages.

That being said, if it involves learning a programming language, and if we're talking about a 100 page site, there's easier ways. For example, one option might be a free trial of OnPage.org, as last I checked, their crawler does have some TF/IDF functionality. If it's a 1M+ page site, then a programming-level method may be necessary, or budgeting for something like OnPage.org to do it for you. Let us know what you're looking at so we can give you more relevant recommendations.
 
That being said, if it involves learning a programming language, and if we're talking about a 100 page site, there's easier ways. For example, one option might be a free trial of OnPage.org, as last I checked, their crawler does have some TF/IDF functionality.

I'm looking to recreate the way OnPage takes the first 15 results for a seed keyword and gets the TF-IDF. It's possible this is overkill and I could just stick with OnPage for now but, I'd really like to become less reliant on services if it could be done on my own.

So far I've been feeding scraped data into excel from SERP results, breaking the words out into transposed columns and grouping them by URL. I can get the term frequency that way but was struggling with the IDF portion of the formula due to not having a good way to match the unique terms by document (URL). Also the LOG() function in Excel doesn't seem to return the expected results. I would expect log(10,000,000 / 1,000) to equal 4 but, the function LOG(10000000, 1000) returns 2.33.
 
Last edited:
Also the LOG() function in Excel doesn't seem to return the expected results. I would expect log(10,000,000 / 1,000) to equal 4 but, the function LOG(10000000, 1000) returns 2.33.

The second parameter of the LOG function is the base that you want to use. LOG(10000000, 1000) is calculating the log function with a base of 1000, not 10 like you were expecting.

The function that you were trying to run was LOG(10000000 / 1000)
 
@turbin3 I've only looked at TF-IDF while looking at SERP results. Could you elaborate on how you would apply it while looking at a single site?
 
Well, typed a bunch of other stuff but thought, "Nope!" I don't want to over-complicate it in my typical fashion.

In essence, what you'd need is to determine a weighting for a term, based on frequency. Also, document frequency would be important to utilize in addition to TF/IDF.

A word used in greater frequency across a greater number of pages isn't necessarily a "good" thing (and it isn't necessarily bad, it's case dependent). In some cases, what's optimal may be consolidation of pages into a smaller number of authority pages. In others, greater frequency of the term on a single page. Regardless, you still need to determine a weighting for each term, and a document frequency.

As far as the "how", there are certainly easier and faster ways, but I definitely recommend taking a look at the Python scikit-learn library. It has a TON of capability, but the main thing is there are a significant number of resources and tutorials out there for it. There are lots of individual libraries and one-off GitHub repo's out there from people, but much of the time you're not going to find a lot of help for them. With scikit-learn, on YouTube alone there are TONS of video tutorials. I think that would be a good way to go, in terms of building your own robust solution for the future.

As far as getting the data, what I've used in the past and like quite a bit is the Python Beautiful Soup 4 library. I don't have all the details in front of me at the moment but, I'll give you what I have. In short, with the bs4, urllib, and csv libraries, you can easily create a simple bot to scrape a set of pages from a text file, strip most of the HTML, and dump out a csv list of the actual content for each page. Here's a simple Python script as an example. It's been so long since I've used it, I honestly can't remember if I created it or based it off of something from someone else, so I apologize in advance:

Code:
import urllib
import csv
from bs4 import BeautifulSoup


def getPageText(url):

    html = urllib.urlopen(url).read()
    soup = BeautifulSoup(html, "html.parser")

    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.extract()    # rip it out
    for footer in soup(["footer"]):
        footer.extract()
    for header in soup(["header"]):
        header.extract()
    for title in soup(["title"]):
        title.extract()
    for h1 in soup(["h1"]):
        h1.extract()
    for div in soup('div', id='background'):
        div.extract()
    for span in soup('span', 'stuff'):
        span.extract()

    # get text
    text = soup.get_text()

    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())

    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))

    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)

    tests = text.encode('utf-8')

    f = csv.writer(open("output.csv", "a"))

    print(text.encode('utf-8'))

    f.writerow([url, tests])


def main():

    urls = tuple(open("source.txt","rb").read().replace("\r\n","\n").split("\n"))

    text = [getPageText(url) for url in urls]

if __name__=="__main__":
    main()

Basically, dump your XML sitemap URLs (just the URLs, each on its own line) or other list of URLs in the source.txt file (or rename to whatever). Run the script, and it should output a CSV with one line per page. If you look in each cell, you'll see all of the content separated by new lines (\n), within that cell. Also in another column on that row will be the original source URL, so you can easily match the content to its source.

Also, take note of the "kill all" section. If you find particularly problematic blocks of HTML, or if you want to finely focus what data is scraped, you can add for loops to the list for elements you specifically want to filter out.

Maybe not the best way, but kind of nice for other purposes too, in that it's fairly readable in that format versus a mass of unformatted words. Been a long time since I used that script, so let me know if it fails at scraping the content.
 
Well, typed a bunch of other stuff but thought, "Nope!" I don't want to over-complicate it in my typical fashion.

In essence, what you'd need is to determine a weighting for a term, based on frequency. Also, document frequency would be important to utilize in addition to TF/IDF.

A word used in greater frequency across a greater number of pages isn't necessarily a "good" thing (and it isn't necessarily bad, it's case dependent). In some cases, what's optimal may be consolidation of pages into a smaller number of authority pages. In others, greater frequency of the term on a single page. Regardless, you still need to determine a weighting for each term, and a document frequency.

As far as the "how", there are certainly easier and faster ways, but I definitely recommend taking a look at the Python scikit-learn library. It has a TON of capability, but the main thing is there are a significant number of resources and tutorials out there for it. There are lots of individual libraries and one-off GitHub repo's out there from people, but much of the time you're not going to find a lot of help for them. With scikit-learn, on YouTube alone there are TONS of video tutorials. I think that would be a good way to go, in terms of building your own robust solution for the future.

As far as getting the data, what I've used in the past and like quite a bit is the Python Beautiful Soup 4 library. I don't have all the details in front of me at the moment but, I'll give you what I have. In short, with the bs4, urllib, and csv libraries, you can easily create a simple bot to scrape a set of pages from a text file, strip most of the HTML, and dump out a csv list of the actual content for each page. Here's a simple Python script as an example. It's been so long since I've used it, I honestly can't remember if I created it or based it off of something from someone else, so I apologize in advance:

Code:
import urllib
import csv
from bs4 import BeautifulSoup


def getPageText(url):

    html = urllib.urlopen(url).read()
    soup = BeautifulSoup(html, "html.parser")

    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.extract()    # rip it out
    for footer in soup(["footer"]):
        footer.extract()
    for header in soup(["header"]):
        header.extract()
    for title in soup(["title"]):
        title.extract()
    for h1 in soup(["h1"]):
        h1.extract()
    for div in soup('div', id='background'):
        div.extract()
    for span in soup('span', 'stuff'):
        span.extract()

    # get text
    text = soup.get_text()

    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())

    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))

    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)

    tests = text.encode('utf-8')

    f = csv.writer(open("output.csv", "a"))

    print(text.encode('utf-8'))

    f.writerow([url, tests])


def main():

    urls = tuple(open("source.txt","rb").read().replace("\r\n","\n").split("\n"))

    text = [getPageText(url) for url in urls]

if __name__=="__main__":
    main()

Basically, dump your XML sitemap URLs (just the URLs, each on its own line) or other list of URLs in the source.txt file (or rename to whatever). Run the script, and it should output a CSV with one line per page. If you look in each cell, you'll see all of the content separated by new lines (\n), within that cell. Also in another column on that row will be the original source URL, so you can easily match the content to its source.

Also, take note of the "kill all" section. If you find particularly problematic blocks of HTML, or if you want to finely focus what data is scraped, you can add for loops to the list for elements you specifically want to filter out.

Maybe not the best way, but kind of nice for other purposes too, in that it's fairly readable in that format versus a mass of unformatted words. Been a long time since I used that script, so let me know if it fails at scraping the content.

This is super helpful, thank you! I've got a lot to learn, going to start with scikit and some python tutorials.
 
Back