Skip to:
Content

bbPress.org

Opened 7 years ago

Closed 6 years ago

#2194 closed defect (fixed)

Import parser truncates topics & replies when source contains  

Reported by: vegasgeek Owned by: netweb
Milestone: 2.5 Priority: normal
Severity: major Version: trunk
Component: API - Importers Keywords: needs-testing
Cc: john@…, stephen@…

Description

While testing out the 2.3 beta 1, I ran multiple imports from SimplePress and found that some of the posts were being cut off. I believe I've tracked it down. It appears that when doing an import, any posts that contain the following string will get truncated:

<p>&nbsp;</p>

If the above shows up in a post, the content before it appears properly, but anything after that is truncated.

Attachments (4)

2194.1.diff (826 bytes) - added by netweb 7 years ago.
sp_sfposts.sql (2.5 KB) - added by netweb 7 years ago.
Example _sfposts.sql export
wp_sfposts.sql (2.4 KB) - added by vegasgeek 7 years ago.
export of single post
2194.2.diff (811 bytes) - added by netweb 7 years ago.
regex now matches 'new line' -> &nbsp; -> 'carriage return'

Download all attachments as: .zip

Change History (22)

#1 @vegasgeek
7 years ago

  • Cc john@… added

#2 @vegasgeek
7 years ago

Just ran another test. In the database the text doesn't have <p></p> wrapped around it. It's simply &nbsp; - But, by running a search/replace on the database first to replace &nbsp; with <br />, the import seems to have worked flawlessly.

#3 @netweb
7 years ago

  • Cc stephen@… added
  • Component changed from General to Importers
  • Summary changed from Import from SimplePress chopping off post content to Import parser truncates topics & replies when source contains &nbsp;

Confirmed with all importers, not just Simple:Press if the source topic or reply contains a &nbsp; that topic or reply will be truncated.

Looking into options for a fix in parser.php

Notes: &nbsp; is actually no-break space typically for multiple spaces.

eg. text with a double--space (Where the - is substituted here for a space)

@netweb
7 years ago

#4 @netweb
7 years ago

  • Keywords reporter-feedback added

I'm attaching a patch 2194.1.diff that is specifically for Simple:Press only.

I have tried to kept the scope of this as narrow as possible so this will ONLY replace using unicode strings '<p>&nbsp;</p>' with standard HTML '<p>&nbsp;</p>'. If there are other &nbsp; not wrapped in <p></p> tags the same issue will occur, if it does let me know and I will modify the patch.

This is not being added to parser.php at this stage as far to much testing is required and all testing thus far breaks all kinds of htmlspecialchars codes eg. &lt;, &gt;. Of all the forum databases I have none of them include HTML or Unicode &nbsp; in the topic or reply content and I think this is an edge case specific to SimplePress.

If you can test this and let me know that would be great.

#5 @vegasgeek
7 years ago

@netweb Not sure if you saw my comment previous to your patch, but in the database it's not storing <p>&nbsp;</p>, but simply &nbsp;

Here's a screenshot of what the post looks like: http://d.pr/i/KQrO

That being said, I applied the patch and ran a test anyway. And as expected, it didn't work.

@netweb
7 years ago

Example _sfposts.sql export

#6 @netweb
7 years ago

@vegasgeek,

Can you dump that row from your wp_sfposts table for me please with phpMyAdmin and attach it to this ticket.

  • Open the wp_sfposts table
  • Click Export
  • Export Method: Custom - display all possible options
  • Rows: Dump some row(s)
  • Number of rows: 1
  • Row to begin at: 1234 (Whatever row number that post is)
  • Click 'Go' at the bottom of the page

It should look similar to to the sp_sfposts.sql​ file I just attached to this ticket

@vegasgeek
7 years ago

export of single post

#7 @vegasgeek
7 years ago

Attached. Let me know if you need anything else.

@netweb
7 years ago

regex now matches 'new line' -> &nbsp; -> 'carriage return'

#8 @netweb
7 years ago

  • Severity changed from normal to major

I uploaded 2194.2.diff and updated the regex to match 'new line' -> &nbsp; -> 'carriage return' and replace it with an HTML line break <br>

It works in this specific case using the post you supplied in the SQL.

Again if there are ANY &nbsp; elsewhere in your data the same behaviour will occur with the topic or reply truncated.

I am not really happy with this as a solution to the core problem and will need to look at creating a patch for parser.php for now though this can be use as a workaround for the issue.

#9 @vegasgeek
7 years ago

I should mention that I solved this issue a different way. I did a search/replace on the wp_sfposts table to replace &nbsp; with nothing. The results for me ended up being spot on. What do you think of doing a str_replace( '&nbsp;', , $data) on just the one table prior to the rest of the conversion? It might mess up a little formatting (which was minimal, if not unnoticeable, I might add), but I'd personally take that over truncated posts.

Side note: I have kept a copy of the database prior to conversion so that I can help test patches for this as needed.

#10 @netweb
7 years ago

I originally tried replacing all the &nbsp; occurrences but had issues depending on if it was used in a line break or used in inline text in the topic/reply and if it was the HTML &nbsp; or Unicode u00A0.

The built in NBBC BBCode parser.php file should be able to parse these properly and I have tested a couple of options to patch this but it needs a great deal more testing against all the current forum imports to ensure it works correctly and doesn't break other bits.

For now we can use the above patch and/or a manual search & replace with phpMyAdmin and I will work on testing a patch for a future release.

Last edited 7 years ago by netweb (previous) (diff)

#11 @vegasgeek
7 years ago

I applied patch 2194.2.diff​ and re-ran the import process. Happy to report, it ran perfectly and didn't truncate the posts. Woot!

#12 @netweb
7 years ago

Cool... Glad it worked but I am still hesitant to include this in the upcoming release of 2.3.

#13 @vegasgeek
7 years ago

I hear ya. As I said, my forums are already moved, so I'm not itching for it to get in to 2.3, but I'm happy to help test as needed.

#14 @johnjamesjacoby
7 years ago

  • Milestone changed from Awaiting Review to 2.4

Non-break spaces are just spaces, not line breaks.

Moving to 2.4 milestone.

#15 @johnjamesjacoby
7 years ago

  • Milestone changed from 2.4 to 2.5

No time to test. Moving to 2.5.

#16 @netweb
6 years ago

  • Keywords needs-testing added; reporter-feedback removed

Will look at this again as part of testing all the importers before final pages for bbPress 2.5

#17 @johnjamesjacoby
6 years ago

  • Owner set to netweb

#18 @netweb
6 years ago

  • Resolution set to fixed
  • Status changed from new to closed

In 5151:

SimplePress Importer Improvements. Props netweb. Fixes #2194

  • Add reply slug field mapping
  • Add custom regex for non-break spaces in HTML
Note: See TracTickets for help on using tickets.