Parsey McParseface + Syntactic Ngrams = Weekend Fun

CCarter

Final Boss ®
Moderator
BuSo Pro
Boot Camp
Digital Strategist
Joined
Sep 15, 2014
Messages
4,341
Likes
8,855
Degree
8
Warning Super Nerd mode

Alright I've been doing these small weekend projects every now and then to expand my skillset, when I came across this for rudimentary A.I. level parsing (God forbid anyone wasting time spinning content realizes what Google is slowly becoming capable of): Google's open sourcing their "Parsey McParseface" - an english language parser.

Parsey McParseface recovers individual dependencies between words with over 94% accuracy, beating our own previous state-of-the-art results, which were already better than any previous approach. While there are no explicit studies in the literature about human performance, we know from our in-house annotation projects that linguists trained for this task agree in 96-97% of the cases.

Sauce: Announcing SyntaxNet: The World’s Most Accurate Parser Goes Open Source

Get started here: https://github.com/tensorflow/models/tree/master/syntaxnet

I'm going to go crazy with Syntactic Ngrams of over 350 billion words from the 3.5 million English Language books in Google Books with this and see the outcome: http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html

More Sauce: http://research.google.com/research-outreach.html#/research-outreach/research-datasets

I'll continue exploring this language parsing since I've got a couple of ideas that have been sitting on the back burner I can use this for. I'll report back if anything comes of these small experiments, maybe another SAAS, who knows.

A bit more about Parsey McParseface: Has Google's Parsey McParseface just solved one of the world's biggest language problems?
 
Spun content is toast.

Just like anchor text ratio's, all they'll have to do is run a giant filter across their index for pages that have an insane number of tri-grams that don't occur naturally and deflate their link juice power or deindex them altogether.

This will create a huge shortcut for finding bad link nodes too (PBN's, Web 2.0's). If there's already a certain amount of confidence that it's a network, you can add this metric in and raise the confidence. There's going to be another round of destruction coming soon.

I feel bad for anyone who makes their living selling spun content. I predict that these low effort networks and tiers are going to begin using massive amounts of syndicated content. Then the new issue will be getting pages indexed, which will inspire more spam links.

The barrier to entry for all hats is getting higher and higher. There's going to be a lot less competition out there but what's left is going to be ridiculously strong and possibly more savage since more and more will be at risk for each player.
 
Interesting stuff. I pinged Alex from WordAI on Skype and he said they're one step ahead of this:

WordAi addressed everything mentioned in that thread three years ago (http://wordai.com/blog/wordai-now-backed-by-over-50-petabytes-of-intelligence/), so everyone there is a little late to the party. All of the points made in that thread may be completely true for any other spinner, but we are working on the most cutting edge technology before it shows up publicly.

As mentioned in the timestamped WordAi blog post I linked to above, we work directly with the research teams of several Fortune 500 companies on many natural language processing techniques, and there is one team in particular who allows us to train all of WordAi on their entire (50+ petabyte) index. This is bigger than any dataset Google has made publicly available - like Google we have access to datasets that are far bigger and more comprehensive than what the rest of the public has access to. So any techniques that Google may be using related to this technology, we were ready for all the way back in 2013.

Now with that being said, trigram scores by themselves do not properly predict the validity of a sentence, so filtering sites with low trigram scores would be a very bad idea anyways. For instance we have done studies of our own showing that a random article on Ezine Articles will have better trigram scores than a New York Times article, because the New York Times article will use more sophisticated (and uncommon) language.

Thanks for linking me to that thread - it was very flattering. If this is what everyone views as cutting edge it makes me feel very good about the technology we are currently working on. With any luck in three years I'll see another thread with people talking about the things we are researching and implementing right now!
 
Spinners that are just plucking synonyms out of a thesaurus are going to be screwed for the reasons mentioned above. Especially users trying to go for 90% or higher uniqueness. It's gibberish and any human eye can see that without even reading it. And this public tester can accurately tell up to 94% while the best humans hit 96%. It's done for.

Straight syndication will do fine, but indexation will become a nightmare when enough people jump on it.

Spinners who think they're being slick by doing sentence level and "syntax" level, actually rearranging sentences based on adjectives, nouns, verbs and all of that might do a little better if they don't worry about being completely unique.

The argument quoted above is a misdirection piece. Being able to beat anything publicly available has nothing to do with beating Google's private databases and algorithms. What goes public is already ancient.

I do agree with the actual basis of the argument, particularly the trigram detection. Google will only be able to hit the absolute outliers for the reason stated about vocabulary sophistication. And then spammers will dial down the uniqueness, and Google will chase them until they hit the realm of civilian casualties like they did with Penguin. Once the real fear strikes and they see the amount of spinning decreasing, they'll back off again and move to the next target.
726 x 538
 
I think some of y'all are thinking way too low level if the only thing you guys can think about is creating spam spun content to pollute the internet with Parsey McParseface.

I would never waste an ounce of energy towards creating bottom barrel content when there are real life communication problems which haven't been solved and can be solved if you approach solutions from a first principles angle instead of current solutions.

Imagine a world where communicating with people around the globe to a point where they completely understand you down to the nuances, idioms, and slang cause there is a platform that goes beyond "translating" and actually explains what another person is asking for, commenting, or trying to convey as an idea.

Ask yourselves where you see yourself in 2-5 years and even 10 years from now. If you truly believe the tactics of today will work tomorrow... well okay. But for anyone serious about the future, predicting the future is best done by creating it, not going backwards.

Think Forward, not backwards, and definitely Think Bigger.
 
Back