By da5id


2008-12-18 21:25:37 8 Comments

I have a site where users can post stuff (as in forums, comments, etc) using a customised implementation of TinyMCE. A lot of them like to copy & paste from Word, which means their input often comes with a plethora of associated MS inline formatting.

I can't just get rid of <span whatever> as TinyMCE relies on the span tag for some of it's formatting, and I can't (and don't want to) force said users to use TinyMCE's "Paste From Word" feature (which doesn't seem to work that well anyway).

Anyone know of a library/class/function that would take care of this for me? It must be a common problem, though I can't find anything definitive. I've been thinking recently that a series of brute-force regexes looking for MS-specific patterns might do the trick, but I don't want to re-write something that may already be available unless I must.

Also, fixing of curly quotes, em-dashes, etc would be good. I have my own stuff to do this now, but I'd really just like to find one MS-conversion filter to rule them all.

4 comments

@oknate 2017-07-05 19:51:13

In my case, this worked just fine:

$text = strip_tags($text, '<p><a><em><span>');

Rather than trying to pull out stuff you don't want such as embedded word xml, you can just specify you're allowed tags.

@Szél Lajos 2016-05-17 19:14:55

In my case, there was a pattern. The unwanted part always started with

<!-- [if gte mso 9]>

and ended by an

<![endif]-->

So my solution was to cut out everything before and after this block:

$array = explode("<!-", $string, 2);
$begin = $array[0];
$end=substr(strrchr($string,'[endif]-->'),10);
echo $begin.$end;

@Isra 2015-02-12 10:25:59

The website http://word2cleanhtml.com/ does a good job on converting from Word. I'm using it in PHP by scrapping, to process some legacy HTML, and until now it's working pretty fine (the result is very clean <p>, <b> code). Of course, being an external service it's not good to use it in online processing like your case.

If you try it and it brings many 400 errors, try filtering the HTML with Tidy first.

@Eran Galperin 2008-12-18 21:39:26

HTML Purifier will create standards compliant markup and filter out many possible attacks (such as XSS).

For faster cleanups that don't require XSS filtering, I use the PECL extension Tidy which is a binding for the Tidy HTML utility.

If those don't help you, I suggest you switch to FCKEditor which has this feature built-in.

@da5id 2008-12-18 21:48:06

Thanks, but neither of those appear to cope with MS formatting, which is what I'm primarily interested in. HTML Purifier has it planned for version 3.5 but with "research necessary".

@Eran Galperin 2008-12-18 23:02:11

Then I suggest you switch to fckeditor which can deal with word input. Updated my answer.

@da5id 2008-12-18 23:19:13

Hmm. I previously preferred TinyMCE over FCKeditor for a number of other reasons, but this may sway me. Thanks for the tip & pleased to be accepting my +1 :)

@da5id 2008-12-18 23:21:18

Mind you, (if I switch) I still need to clean all the crap that's already been posted...

@Eran Galperin 2008-12-19 01:59:42

Try the non PHP suggestions in the following link - forums.devarticles.com/general-programming-help-4/…

@Kaivosukeltaja 2012-03-20 11:50:55

Also note that FCKEditor is no longer supported and will have problems with modern browsers, so you should use CKEditor instead. ckeditor.com

@Jon L. 2012-04-13 14:56:54

Just a note, Tidy does indeed cope with MS formatting, and has for years. I was using 4-5 years ago to strip pasted MS Word content... tidy.sourceforge.net/docs/quickref.html#word-2000

Related Questions

Sponsored Content

18 Answered Questions

[SOLVED] Reference — What does this symbol mean in PHP?

37 Answered Questions

[SOLVED] Deleting an element from an array in PHP

  • 2008-12-15 20:28:55
  • Ben
  • 2456999 View
  • 2372 Score
  • 37 Answer
  • Tags:   php arrays unset

29 Answered Questions

[SOLVED] How do I get PHP errors to display?

28 Answered Questions

[SOLVED] How can I prevent SQL injection in PHP?

7 Answered Questions

[SOLVED] How does PHP 'foreach' actually work?

17 Answered Questions

[SOLVED] How can I sanitize user input with PHP?

15 Answered Questions

[SOLVED] Why shouldn't I use mysql_* functions in PHP?

  • 2012-10-12 13:18:39
  • Madara Uchiha
  • 212625 View
  • 2439 Score
  • 15 Answer
  • Tags:   php mysql database

31 Answered Questions

[SOLVED] startsWith() and endsWith() functions in PHP

  • 2009-05-07 12:14:27
  • Click Upvote
  • 806875 View
  • 1412 Score
  • 31 Answer
  • Tags:   php string

30 Answered Questions

[SOLVED] How do you parse and process HTML/XML in PHP?

Sponsored Content